Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 41 min ago

#2: Even Easier Package Registration

Fri, 03/31/2017 - 05:31

Welcome to the second post in rambling random R recommendation series, or R4 for short.

Two days ago I posted the initial (actual) post. It provided context for why we need package registration entries (tl;dr: because R CMD check now tests for it, and because it The Right Thing to do, see documentation in the posts). I also showed how generating such a file src/init.c was essentially free as all it took was single call to a new helper function added to R-devel by Brian Ripley and Kurt Hornik.

Now, to actually use R-devel you obviously need to have it accessible. There are a myriad of ways to achieve that: just compile it locally as I have done for years, use a Docker image as I showed in the post — or be creative with eg Travis or win-builder both of which give you access to R-devel if you’re clever about it.

But as no good deed goes unpunished, I was of course told off today for showing a Docker example as Docker was not "Easy". I think the formal answer to that is baloney. But we leave that aside, and promise to discuss setting up Docker at another time.

R is after all … just R. So below please find a script you can save as, say, ~/bin/pnrrs.r. And calling it—even with R-release—will generate the same code snippet as I showed via Docker. Call it a one-off backport of the new helper function — with a half-life of a few weeks at best as we will have R 3.4.0 as default in just a few weeks. The script will then reduce to just the final line as the code will be present with R 3.4.0.

#!/usr/bin/r library(tools) .find_calls_in_package_code <- tools:::.find_calls_in_package_code .read_description <- tools:::.read_description ## all what follows is from R-devel aka R 3.4.0 to be package_ff_call_db <- function(dir) { ## A few packages such as CDM use base::.Call ff_call_names <- c(".C", ".Call", ".Fortran", ".External", "base::.C", "base::.Call", "base::.Fortran", "base::.External") predicate <- function(e) { (length(e) > 1L) && !is.na(match(deparse(e[[1L]]), ff_call_names)) } calls <- .find_calls_in_package_code(dir, predicate = predicate, recursive = TRUE) calls <- unlist(Filter(length, calls)) if(!length(calls)) return(NULL) attr(calls, "dir") <- dir calls } native_routine_registration_db_from_ff_call_db <- function(calls, dir = NULL, character_only = TRUE) { if(!length(calls)) return(NULL) ff_call_names <- c(".C", ".Call", ".Fortran", ".External") ff_call_args <- lapply(ff_call_names, function(e) args(get(e, baseenv()))) names(ff_call_args) <- ff_call_names ff_call_args_names <- lapply(lapply(ff_call_args, function(e) names(formals(e))), setdiff, "...") if(is.null(dir)) dir <- attr(calls, "dir") package <- # drop name as.vector(.read_description(file.path(dir, "DESCRIPTION"))["Package"]) symbols <- character() nrdb <- lapply(calls, function(e) { if (startsWith(deparse(e[[1L]]), "base::")) e[[1L]] <- e[[1L]][3L] ## First figure out whether ff calls had '...'. pos <- which(unlist(Map(identical, lapply(e, as.character), "..."))) ## Then match the call with '...' dropped. ## Note that only .NAME could be given by name or ## positionally (the other ff interface named ## arguments come after '...'). if(length(pos)) e <- e[-pos] ## drop calls with only ... if(length(e) < 2L) return(NULL) cname <- as.character(e[[1L]]) ## The help says ## ## '.NAME' is always matched to the first argument ## supplied (which should not be named). ## ## But some people do (Geneland ...). nm <- names(e); nm[2L] <- ""; names(e) <- nm e <- match.call(ff_call_args[[cname]], e) ## Only keep ff calls where .NAME is character ## or (optionally) a name. s <- e[[".NAME"]] if(is.name(s)) { s <- deparse(s)[1L] if(character_only) { symbols <<- c(symbols, s) return(NULL) } } else if(is.character(s)) { s <- s[1L] } else { ## expressions symbols <<- c(symbols, deparse(s)) return(NULL) } ## Drop the ones where PACKAGE gives a different ## package. Ignore those which are not char strings. if(!is.null(p <- e[["PACKAGE"]]) && is.character(p) && !identical(p, package)) return(NULL) n <- if(length(pos)) { ## Cannot determine the number of args: use ## -1 which might be ok for .External(). -1L } else { sum(is.na(match(names(e), ff_call_args_names[[cname]]))) - 1L } ## Could perhaps also record whether 's' was a symbol ## or a character string ... cbind(cname, s, n) }) nrdb <- do.call(rbind, nrdb) nrdb <- as.data.frame(unique(nrdb), stringsAsFactors = FALSE) if(NROW(nrdb) == 0L || length(nrdb) != 3L) stop("no native symbols were extracted") nrdb[, 3L] <- as.numeric(nrdb[, 3L]) nrdb <- nrdb[order(nrdb[, 1L], nrdb[, 2L], nrdb[, 3L]), ] nms <- nrdb[, "s"] dups <- unique(nms[duplicated(nms)]) ## Now get the namespace info for the package. info <- parseNamespaceFile(basename(dir), dirname(dir)) ## Could have ff calls with symbols imported from other packages: ## try dropping these eventually. imports <- info$imports imports <- imports[lengths(imports) == 2L] imports <- unlist(lapply(imports, `[[`, 2L)) info <- info$nativeRoutines[[package]] ## Adjust native routine names for explicit remapping or ## namespace .fixes. if(length(symnames <- info$symbolNames)) { ind <- match(nrdb[, 2L], names(symnames), nomatch = 0L) nrdb[ind > 0L, 2L] <- symnames[ind] } else if(!character_only && any((fixes <- info$registrationFixes) != "")) { ## There are packages which have not used the fixes, e.g. utf8latex ## fixes[1L] is a prefix, fixes[2L] is an undocumented suffix nrdb[, 2L] <- sub(paste0("^", fixes[1L]), "", nrdb[, 2L]) if(nzchar(fixes[2L])) nrdb[, 2L] <- sub(paste0(fixes[2L]), "$", "", nrdb[, 2L]) } ## See above. if(any(ind <- !is.na(match(nrdb[, 2L], imports)))) nrdb <- nrdb[!ind, , drop = FALSE] ## Fortran entry points are mapped to l/case dotF <- nrdb$cname == ".Fortran" nrdb[dotF, "s"] <- tolower(nrdb[dotF, "s"]) attr(nrdb, "package") <- package attr(nrdb, "duplicates") <- dups attr(nrdb, "symbols") <- unique(symbols) nrdb } format_native_routine_registration_db_for_skeleton <- function(nrdb, align = TRUE, include_declarations = FALSE) { if(!length(nrdb)) return(character()) fmt1 <- function(x, n) { c(if(align) { paste(format(sprintf(" {\"%s\",", x[, 1L])), format(sprintf(if(n == "Fortran") "(DL_FUNC) &F77_NAME(%s)," else "(DL_FUNC) &%s,", x[, 1L])), format(sprintf("%d},", x[, 2L]), justify = "right")) } else { sprintf(if(n == "Fortran") " {\"%s\", (DL_FUNC) &F77_NAME(%s), %d}," else " {\"%s\", (DL_FUNC) &%s, %d},", x[, 1L], x[, 1L], x[, 2L]) }, " {NULL, NULL, 0}") } package <- attr(nrdb, "package") dups <- attr(nrdb, "duplicates") symbols <- attr(nrdb, "symbols") nrdb <- split(nrdb[, -1L, drop = FALSE], factor(nrdb[, 1L], levels = c(".C", ".Call", ".Fortran", ".External"))) has <- vapply(nrdb, NROW, 0L) > 0L nms <- names(nrdb) entries <- substring(nms, 2L) blocks <- Map(function(x, n) { c(sprintf("static const R_%sMethodDef %sEntries[] = {", n, n), fmt1(x, n), "};", "") }, nrdb[has], entries[has]) decls <- c( "/* FIXME: ", " Add declarations for the native routines registered below.", "*/") if(include_declarations) { decls <- c( "/* FIXME: ", " Check these declarations against the C/Fortran source code.", "*/", if(NROW(y <- nrdb$.C)) { args <- sapply(y$n, function(n) if(n >= 0) paste(rep("void *", n), collapse=", ") else "/* FIXME */") c("", "/* .C calls */", paste0("extern void ", y$s, "(", args, ");")) }, if(NROW(y <- nrdb$.Call)) { args <- sapply(y$n, function(n) if(n >= 0) paste(rep("SEXP", n), collapse=", ") else "/* FIXME */") c("", "/* .Call calls */", paste0("extern SEXP ", y$s, "(", args, ");")) }, if(NROW(y <- nrdb$.Fortran)) { args <- sapply(y$n, function(n) if(n >= 0) paste(rep("void *", n), collapse=", ") else "/* FIXME */") c("", "/* .Fortran calls */", paste0("extern void F77_NAME(", y$s, ")(", args, ");")) }, if(NROW(y <- nrdb$.External)) c("", "/* .External calls */", paste0("extern SEXP ", y$s, "(SEXP);")) ) } headers <- if(NROW(nrdb$.Call) || NROW(nrdb$.External)) c("#include <R.h>", "#include <Rinternals.h>") else if(NROW(nrdb$.Fortran)) "#include <R_ext/RS.h>" else character() c(headers, "#include <stdlib.h> // for NULL", "#include <R_ext/Rdynload.h>", "", if(length(symbols)) { c("/*", " The following symbols/expresssions for .NAME have been omitted", "", strwrap(symbols, indent = 4, exdent = 4), "", " Most likely possible values need to be added below.", "*/", "") }, if(length(dups)) { c("/*", " The following name(s) appear with different usages", " e.g., with different numbers of arguments:", "", strwrap(dups, indent = 4, exdent = 4), "", " This needs to be resolved in the tables and any declarations.", "*/", "") }, decls, "", unlist(blocks, use.names = FALSE), ## We cannot use names with '.' in: WRE mentions replacing with "_" sprintf("void R_init_%s(DllInfo *dll)", gsub(".", "_", package, fixed = TRUE)), "{", sprintf(" R_registerRoutines(dll, %s);", paste0(ifelse(has, paste0(entries, "Entries"), "NULL"), collapse = ", ")), " R_useDynamicSymbols(dll, FALSE);", "}") } package_native_routine_registration_db <- function(dir, character_only = TRUE) { calls <- package_ff_call_db(dir) native_routine_registration_db_from_ff_call_db(calls, dir, character_only) } package_native_routine_registration_db <- function(dir, character_only = TRUE) { calls <- package_ff_call_db(dir) native_routine_registration_db_from_ff_call_db(calls, dir, character_only) } package_native_routine_registration_skeleton <- function(dir, con = stdout(), align = TRUE, character_only = TRUE, include_declarations = TRUE) { nrdb <- package_native_routine_registration_db(dir, character_only) writeLines(format_native_routine_registration_db_for_skeleton(nrdb, align, include_declarations), con) } package_native_routine_registration_skeleton(".") ## when R 3.4.0 is out you only need this line

Here I use /usr/bin/r as I happen to like littler a lot, but you can use Rscript the same way.

Easy enough now?

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Take your data frames to the next level.

Fri, 03/31/2017 - 04:27

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

 

While finishing up with R-rockstar Hadley Wickham’s book (Free Book – R for Data Science), the section on model building elaborates on something pretty cool that I had no idea about – list columns.

Most of us have probably seen the following data frame column format:

df <- data.frame("col_uno" = c(1,2,3),"col_dos" = c('a','b','c'), "col_tres" = factor(c("google", "apple", "amazon")))

And the output:

df ## col_uno col_dos col_tres ## 1 1 a google ## 2 2 b apple ## 3 3 c amazon

This is an awesome way to organize data and one of R’s strong points. However, we can use list functionality to go deeper. Check this out:

library(tidyverse) library(datasets) head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa nested <- iris %>% group_by(Species) %>% nest() ## # A tibble: 3 × 2 ## Species data ## <fctr> <list> ## 1 setosa <tibble [50 × 4]> ## 2 versicolor <tibble [50 × 4]> ## 3 virginica <tibble [50 × 4]>

Using nest we can compartmentalize our data frame for readability and more efficient iteration. Here we can use map from the purrr package to compute the mean of each column in our nested data.

means <- map(nested$data, colMeans) ## [[1]] ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.006 3.428 1.462 0.246 ## ## [[2]] ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.936 2.770 4.260 1.326 ## ## [[3]] ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.588 2.974 5.552 2.026

Once you’re done messing around with data-ception, use unnest to revert your data back to its original state.

head(unnest(nested)) ## # A tibble: 6 × 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## <fctr> <dbl> <dbl> <dbl> <dbl> ## 1 setosa 5.1 3.5 1.4 0.2 ## 2 setosa 4.9 3.0 1.4 0.2 ## 3 setosa 4.7 3.2 1.3 0.2 ## 4 setosa 4.6 3.1 1.5 0.2 ## 5 setosa 5.0 3.6 1.4 0.2 ## 6 setosa 5.4 3.9 1.7 0.4

I was pretty excited to learn about this property of data.frames and will definitely make use of it in the future. If you have any neat examples of nested dataset usage, please feel free to share in the comments.  As always, I’m happy to answer questions or talk data!

Kiefer Smith

To leave a comment for the author, please follow the link and comment on their blog: R – Real Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Who is old? Visualizing the concept of prospective ageing with animated population pyramids

Fri, 03/31/2017 - 02:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

This post is about illustrating the concept of prospective ageing, a relatively fresh approach in demography to refine our understanding of population ageing. This visualization was created in collaboration with my colleague Michael Boissonneault: (mostly) his idea and (mostly) my implementation. The animated visualization builds upon Michael’s viz prepared for the submission to the highly anticipated event at the end June 2017 – Rostock Retreat Visualization. My visualization of the provided Swedish dataset can be found in the previous post.

Prospective ageing

Over the past decades the alarmist views of the upcoming population ageing disaster became widely spread. True, with the growing number of countries approaching the ending of the Demographic Transition, the average/median age of their population increases rapidly, which is something unprecedented in the documented human history. But does that imply an unbearable burden of elderly population in the nearest future? Not necessarily.

The demographic prospects depend a lot on how we define ageing. Quite recently Waren Sanderson and Sergei Scherbov proposed 1 2 a new way to look at population ageing, they called it Prospective Ageing. The underlying idea is really simple – age is not static: a person aged 65 (the conventional border deliminating elderly population) today is in many aspects not the same as a person ages 65 half a century ago. Health and lifespan improved a lot in the last decades, meaning that today people generally have much more remaining years of life at the moment of being recognized as elderly by the conventional standards. Thus, Sanderson and Scherbov proposed to define elderly population based on the estimation of the expected remaining length of life rather than years lived. Such a refined view of population ageing disqualifies the alarmist claims of the approaching demographic collapse. The would be paradoxical title of one the latest papers of Sanderson and Scherbov 3 summarizes the phenomenon nicely: Faster Increases in Human Life Expectancy Could Lead to Slower Population Aging.

Of course, the choice of the new ageing threshold is a rather arbitrary question 4. It became usual to define this threshold at the remaining life expectancy of 15 years.

Population pyramids for Sweden

Population pyramid is a simple and nice way to represent population composition and to compare changes in the age structure of a given population over time. We show the difference between conventional and prospective approach to the definition of the elderly population using Swedish data for the last half a century. Sweden is a natural choice for demographers aiming to play with rich and reliable data.

The data used for this visualization comes from Human Mortality Database. It can be easily accessed from an R session using HMDHFDplus package by Tim Riffe (for examples see my previous posts – one and two). For this exercise, I will use the dataset for Sweden that was provided for an application task for Rostock Retreat Visualization 5.

Data preparation # load packages library(tidyverse) library(extrafont) myfont <- "Ubuntu Mono" # download data df_swe <- read_csv("http://www.rostock-retreat.org/files/application2017/SWE.csv") # copy at https://ikashnitsky.github.io/doc/misc/application-rostock-retreat/SWE.csv # define the selection of years to visualize years <- c(seq(1965, 2010, 5),2014) df <- df_swe %>% select(Year, Sex, Age, Exposure, ex) %>% filter(Year %in% years) %>% mutate(old_c = Age >= 65, old_p = ex <= 15) %>% gather("type", "old", contains("old")) %>% group_by(Year, Sex, type) %>% mutate(share = Exposure / sum(Exposure)) %>% ungroup() %>% mutate(share = ifelse(Sex == 'f', share, -share)) names(df) <- names(df) %>% tolower() df_old <- df %>% filter(old == T) %>% group_by(year, sex, type, old) %>% summarise(cum_old = sum(share)) %>% ungroup() Visualization

Let’s first have a look at the pyramids in 1965, 1990, and 2014 (the latest available year).

gg_three <- ggplot(df %>% filter(year %in% c(1965, 1990, 2014))) + geom_bar(aes(x = age, y = share, fill = sex, alpha = old), stat = 'identity', width = 1)+ geom_vline(xintercept = 64.5, size = .5, color = 'gold')+ scale_y_continuous(breaks = c(-.01, 0, .01), labels = c(.01, 0, .01), limits = c(-.02, .02), expand = c(0,0))+ facet_grid(year~type) + theme_minimal(base_family = 'Ubuntu Mono') + theme(strip.text = element_blank(), legend.position = 'none', plot.title = element_text(hjust = 0.5, size = 20), plot.caption = element_text(hjust = 0, size = 10)) + coord_flip() + labs(y = NULL, x = 'Age') + geom_text(data = data_frame(type = c('old_c', 'old_p'), label = c('CONVENTIONAL', 'PROSPECTIVE')), aes(label = label), y = 0, x = 50, size = 5, vjust = 1, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year %in% c(1965, 1990, 2014), sex == 'f'), aes(label = year), y = 0, x = 30, vjust = 1, hjust = .5, size = 7, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year %in% c(1965, 1990, 2014), sex == 'f'), aes(label = paste('Elderly\nfemales\n', round(cum_old*100,1), '%')), y = .0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year %in% c(1965, 1990, 2014), sex == 'm'), aes(label = paste('Elderly\nmales\n', round(-cum_old*100,1), '%')), y = -.0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') #ggsave("figures/three-years.png", gg_three, width = 6, height = 8)

Animated pyramid

To get an animated pyramid I simply saved all the separate plots and then use the very convenient free online tool to make an animated image – GIFCreator 6.

note <- 'The population pyramid can be used to compare change in the age structure of a given population over time. In many cases, doing so gives the impression of rapid aging. This is due to the fact that age is represented as a static variable; however, as Sanderson and Scherbov showed repeatedly, age is not static: a person age 65 in 1965 is in many aspects not the same as a person age 65 in 2015. In the right panel, old age is considered to start when the period remaining life expectancy reaches 15 years, thereby providing another look at the change in the age structure of a population. The gold line deliminates the conventional border of old age at 65. Elderly populations are filled with non-transparent colors. Authors: Michael Boissonneault, Ilya Kashnitsky (NIDI)' # I will store the plots in a list plots <- list() for (i in 1:length(years)){ gg <- ggplot(df %>% filter(year == years[[i]])) + geom_bar(aes(x = age, y = share, fill = sex, alpha = old), stat = 'identity', width = 1)+ geom_vline(xintercept = 64.5, size = .5, color = 'gold')+ scale_y_continuous(breaks = c(-.01, 0, .01), labels = c(.01, 0, .01), limits = c(-.02, .02), expand = c(0,0))+ facet_wrap(~type, ncol = 2) + theme_minimal(base_family = 'Ubuntu Mono') + theme(strip.text = element_blank(), legend.position = 'none', plot.title = element_text(hjust = 0.5, size = 20), plot.caption = element_text(hjust = 0, size = 10)) + coord_flip() + labs(title = paste("Sweden", years[i]), caption = paste(strwrap(note, width = 106), collapse = '\n'), y = NULL, x = 'Age') + geom_text(data = data_frame(type = c('old_c', 'old_p'), label = c('CONVENTIONAL', 'PROSPECTIVE')), aes(label = label), y = 0, x = 115, size = 5, vjust = 1, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year == years[[i]], sex == 'f'), aes(label = paste('Elderly\nfemales\n', round(cum_old*100,1), '%')), y = .0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') + geom_text(data = df_old %>% filter(year == years[[i]], sex == 'm'), aes(label = paste('Elderly\nmales\n', round(-cum_old*100,1), '%')), y = -.0125, x = 105, vjust = 1, hjust = .5, size = 4, family = 'Ubuntu Mono') plots[[i]] <- gg } # # a loop to save the plots # for (i in 1:length(years)){ # ggsave(paste0('figures/swe-', years[i], '.png'), plots[[i]], # width = 8, height = 5.6) # }

  1. Sanderson, W. C., & Scherbov, S. (2005). Average remaining lifetimes can increase as human populations age. Nature, 435(7043), 811–813. Retrieved from http://www.nature.com/nature/journal/v435/n7043/abs/nature03593.html 

  2. Sanderson, W. C., & Scherbov, S. (2010). Remeasuring Aging. Science, 329(5997), 1287–1288. https://doi.org/10.1126/science.1193647 

  3. Sanderson, W. C., & Scherbov, S. (2015). Faster Increases in Human Life Expectancy Could Lead to Slower Population Aging. PLoS ONE, 10(4), e0121922. http://doi.org/10.1371/journal.pone.0121922 

  4. See the working paper of my colleagues devoted to this question 

  5. By using this data, I agree to the user agreement 

  6. I did try to play with the package gganimate, though it produced a strange output. 

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building meaningful machine learning models for disease prediction

Fri, 03/31/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Webinar for the ISDS R Group

This document presents the code I used to produce the example analysis and figures shown in my webinar on building meaningful machine learning models for disease prediction.

My webinar slides are available on Github

Description: Dr Shirin Glander will go over her work on building machine-learning models to predict the course of different diseases. She will go over building a model, evaluating its performance, and answering or addressing different disease related questions using machine learning. Her talk will cover the theory of machine learning as it is applied using R.

Setup

All analyses are done in R using RStudio. For detailed session information including R version, operating system and package versions, see the sessionInfo() output at the end of this document.

All figures are produced with ggplot2.

The dataset

The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",") colnames(bc_data) <- c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes") bc_data$classes <- ifelse(bc_data$classes == "2", "benign", ifelse(bc_data$classes == "4", "malignant", NA))

Missing data bc_data[bc_data == "?"] <- NA # how many NAs are in the data length(which(is.na(bc_data))) ## [1] 16 # how many samples would we loose, if we removed them? nrow(bc_data) ## [1] 699 nrow(bc_data[is.na(bc_data), ]) ## [1] 16

Missing values are imputed with the mice package.

# impute missing data library(mice) bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x))) dataset_impute <- mice(bc_data[, 2:10], print = FALSE) bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1)) bc_data$classes <- as.factor(bc_data$classes) # how many benign and malignant cases are there? summary(bc_data$classes)

Data exploration
  • Response variable for classification
library(ggplot2) ggplot(bc_data, aes(x = classes, fill = classes)) + geom_bar()

  • Response variable for regression
ggplot(bc_data, aes(x = clump_thickness)) + geom_histogram(bins = 10)

  • Principal Component Analysis
library(pcaGoPromoter) library(ellipse) # perform pca and extract scores pcaOutput <- pca(t(bc_data[, -1]), printDropped = FALSE, scale = TRUE, center = TRUE) pcaOutput2 <- as.data.frame(pcaOutput$scores) # define groups for plotting pcaOutput2$groups <- bc_data$classes centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean) conf.rgn <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t) data.frame(groups = as.character(t), ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]), centre = as.matrix(centroids[centroids$groups == t, 2:3]), level = 0.95), stringsAsFactors = FALSE))) ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) + geom_point(size = 2, alpha = 0.6) + scale_color_brewer(palette = "Set1") + labs(color = "", fill = "", x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"), y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))

  • Features
library(tidyr) gather(bc_data, x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = classes, fill = classes)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

Machine Learning packages for R caret # configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret)

Training, validation and test data set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] library(dplyr) rbind(data.frame(group = "train", train_data), data.frame(group = "test", test_data)) %>% gather(x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = group, fill = group)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

Regression set.seed(42) model_glm <- caret::train(clump_thickness ~ ., data = train_data, method = "glm", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE)) model_glm ## Generalized Linear Model ## ## 490 samples ## 9 predictor ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 441, 441, 440, 442, 441, 440, ... ## Resampling results: ## ## RMSE Rsquared ## 1.974296 0.5016141 ## ## predictions <- predict(model_glm, test_data) # model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values data.frame(residuals = resid(model_glm), predictors = model_glm$finalModel$linear.predictors) %>% ggplot(aes(x = predictors, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

# y == train_data$clump_thickness data.frame(residuals = resid(model_glm), y = model_glm$finalModel$y) %>% ggplot(aes(x = y, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

data.frame(actual = test_data$clump_thickness, predicted = predictions) %>% ggplot(aes(x = actual, y = predicted)) + geom_jitter() + geom_smooth(method = "lm")

Classification Decision trees

rpart

library(rpart) library(rpart.plot) set.seed(42) fit <- rpart(classes ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) rpart.plot(fit, extra = 100)

Random Forests

Random Forests predictions are based on the generation of multiple classification trees. They can be used for both, classification and regression tasks. Here, I show a classification task.

set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE))

When you specify savePredictions = TRUE, you can access the cross-validation resuls with model_rf$pred.

model_rf$finalModel$confusion ## benign malignant class.error ## benign 313 8 0.02492212 ## malignant 4 165 0.02366864

  • Feature Importance
imp <- model_rf$finalModel$importance imp[order(imp, decreasing = TRUE), ] ## uniformity_of_cell_size uniformity_of_cell_shape ## 54.416003 41.553022 ## bland_chromatin bare_nuclei ## 29.343027 28.483842 ## normal_nucleoli single_epithelial_cell_size ## 19.239635 18.480155 ## clump_thickness marginal_adhesion ## 13.276702 12.143355 ## mitosis ## 3.081635 # estimate variable importance importance <- varImp(model_rf, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_rf, test_data), test_data$classes) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 133 2 ## malignant 4 70 ## ## Accuracy : 0.9713 ## 95% CI : (0.9386, 0.9894) ## No Information Rate : 0.6555 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9369 ## Mcnemar's Test P-Value : 0.6831 ## ## Sensitivity : 0.9708 ## Specificity : 0.9722 ## Pos Pred Value : 0.9852 ## Neg Pred Value : 0.9459 ## Prevalence : 0.6555 ## Detection Rate : 0.6364 ## Detection Prevalence : 0.6459 ## Balanced Accuracy : 0.9715 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_rf, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Extreme gradient boosting trees

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training. Can be used for classification and regression tasks. Here, I show a classification task.

set.seed(42) model_xgb <- caret::train(classes ~ ., data = train_data, method = "xgbTree", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE))

  • Feature Importance
importance <- varImp(model_xgb, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_xgb, test_data), test_data$classes) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 132 2 ## malignant 5 70 ## ## Accuracy : 0.9665 ## 95% CI : (0.9322, 0.9864) ## No Information Rate : 0.6555 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9266 ## Mcnemar's Test P-Value : 0.4497 ## ## Sensitivity : 0.9635 ## Specificity : 0.9722 ## Pos Pred Value : 0.9851 ## Neg Pred Value : 0.9333 ## Prevalence : 0.6555 ## Detection Rate : 0.6316 ## Detection Prevalence : 0.6411 ## Balanced Accuracy : 0.9679 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_xgb, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Feature Selection

Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

  • Correlation

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.

library(corrplot) # calculate correlation matrix corMatMy <- cor(train_data[, -1]) corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70, highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)] ## Compare row 2 and column 3 with corr 0.899 ## Means: 0.696 vs 0.575 so flagging column 2 ## Compare row 3 and column 7 with corr 0.736 ## Means: 0.654 vs 0.55 so flagging column 3 ## All correlations <= 0.7 # which variables are flagged for removal? highlyCor ## [1] "uniformity_of_cell_size" "uniformity_of_cell_shape" #then we remove these variables train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]

  • Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.

set.seed(7) results_rfe <- rfe(x = train_data[, -1], y = train_data$classes, sizes = c(1:9), rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10)) # chosen features predictors(results_rfe) ## [1] "bare_nuclei" "uniformity_of_cell_size" ## [3] "clump_thickness" "uniformity_of_cell_shape" ## [5] "bland_chromatin" "marginal_adhesion" ## [7] "normal_nucleoli" "single_epithelial_cell_size" ## [9] "mitosis" train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]

  • Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

set.seed(27) model_ga <- gafs(x = train_data[, -1], y = train_data$classes, iters = 10, # generations of algorithm popSize = 10, # population size for each generation levels = c("malignant", "benign"), gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel = TRUE, # Use parallel programming allowParallel = TRUE)) plot(model_ga) # Plot mean fitness (AUC) by generation

train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]

Grid search with caret
  • Automatic Grid
set.seed(42) model_rf_tune_auto <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, search = "random"), tuneLength = 15) model_rf_tune_auto ## Random Forest ## ## 490 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9692153 0.9323624 ## 2 0.9704277 0.9350498 ## 5 0.9645085 0.9216721 ## 6 0.9639087 0.9201998 ## 7 0.9632842 0.9186919 ## 8 0.9626719 0.9172257 ## 9 0.9636801 0.9195036 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2. plot(model_rf_tune_auto)

  • Manual Grid

  • mtry: Number of variables randomly sampled as candidates at each split.

set.seed(42) grid <- expand.grid(mtry = c(1:10)) model_rf_tune_man <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, search = "random"), tuneGrid = grid) model_rf_tune_man ## Random Forest ## ## 490 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9696153 0.9332392 ## 2 0.9706440 0.9354737 ## 3 0.9696194 0.9330647 ## 4 0.9661495 0.9253163 ## 5 0.9649252 0.9225586 ## 6 0.9653209 0.9233806 ## 7 0.9634881 0.9192265 ## 8 0.9624718 0.9169227 ## 9 0.9641005 0.9203072 ## 10 0.9628760 0.9176675 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2. plot(model_rf_tune_man)

Grid search with h2o

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

library(h2o) h2o.init(nthreads = -1) ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.out ## C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.err ## ## ## Starting H2O JVM and connecting: . Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 1 seconds 815 milliseconds ## H2O cluster version: 3.10.3.6 ## H2O cluster version age: 1 month and 10 days ## H2O cluster name: H2O_started_from_R_s_glan02_tvy462 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.54 GB ## H2O cluster total cores: 8 ## H2O cluster allowed cores: 8 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## R Version: R version 3.3.3 (2017-03-06) bc_data_hf <- as.h2o(bc_data) ## | | | 0% | |=================================================================| 100% h2o.describe(bc_data_hf) %>% gather(x, y, Zeros:Sigma) %>% mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% ggplot(aes(x = Label, y = as.numeric(y), color = x)) + geom_point(size = 4, alpha = 0.6) + scale_color_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + facet_grid(group ~ ., scales = "free") + labs(x = "Feature", y = "Value", color = "")

library(reshape2) # for melting bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1]) cor <- h2o.cor(bc_data_hf) rownames(cor) <- colnames(cor) melt(cor) %>% mutate(Var2 = rep(rownames(cor), nrow(cor))) %>% mutate(Var2 = factor(Var2, levels = colnames(cor))) %>% mutate(variable = factor(variable, levels = colnames(cor))) %>% ggplot(aes(x = variable, y = Var2, fill = value)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "Cor.") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(x = "", y = "")

Training, validation and test data splits <- h2o.splitFrame(bc_data_hf, ratios = c(0.7, 0.15), seed = 1) train <- splits[[1]] valid <- splits[[2]] test <- splits[[3]] response <- "classes" features <- setdiff(colnames(train), response) summary(train$classes, exact_quantiles = TRUE) ## classes ## benign :317 ## malignant:174 summary(valid$classes, exact_quantiles = TRUE) ## classes ## benign :71 ## malignant:35 summary(test$classes, exact_quantiles = TRUE) ## classes ## benign :70 ## malignant:32 pca <- h2o.prcomp(training_frame = train, x = features, validation_frame = valid, transform = "NORMALIZE", impute_missing = TRUE, k = 3, seed = 42) ## | | | 0% | |==================================================== | 80% | |=================================================================| 100% eigenvec <- as.data.frame(pca@model$eigenvectors) eigenvec$label <- features library(ggrepel) ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) + geom_point(color = "navy", alpha = 0.7) + geom_text_repel()

Classification Random Forest hyper_params <- list( ntrees = c(25, 50, 75, 100), max_depth = c(10, 20, 30), min_rows = c(1, 3, 5) ) search_criteria <- list( strategy = "RandomDiscrete", max_models = 50, max_runtime_secs = 360, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.0005, seed = 42 ) rf_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, # alternatively h2o.gbm # for Gradient boosting trees x = features, y = response, grid_id = "rf_grid", training_frame = train, validation_frame = valid, nfolds = 25, fold_assignment = "Stratified", hyper_params = hyper_params, search_criteria = search_criteria, seed = 42 ) # performance metrics where smaller is better -> order with decreasing = FALSE sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss") for (sort_by_1 in sort_options_1) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path="models", force = TRUE) } # performance metrics where bigger is better -> order with decreasing = TRUE sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity") for (sort_by_2 in sort_options_2) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path = "models", force = TRUE) } files <- list.files(path = "models") rf_models <- files[grep("rf_grid_model", files)] for (model_id in rf_models) { path <- paste0("U:\\Github_blog\\Webinar\\Webinar_ML_for_disease\\models\\", model_id) best_model <- h2o.loadModel(path) mse_auc_test <- data.frame(model_id = model_id, mse = h2o.mse(h2o.performance(best_model, test)), auc = h2o.auc(h2o.performance(best_model, test))) if (model_id == rf_models[[1]]) { mse_auc_test_comb <- mse_auc_test } else { mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test) } } mse_auc_test_comb %>% gather(x, y, mse:auc) %>% ggplot(aes(x = model_id, y = y, fill = model_id)) + facet_grid(x ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8, position = "dodge") + scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) + labs(x = "", y = "value", fill = "")

for (model_id in rf_models) { best_model <- h2o.getModel(model_id) finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)), actual = as.vector(test$classes), as.data.frame(h2o.predict(object = best_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) if (model_id == rf_models[[1]]) { finalRf_predictions_comb <- finalRf_predictions } else { finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions) } } ## | | | 0% | |=================================================================| 100% ## | | | 0% | |=================================================================| 100% ## | | | 0% | |=================================================================| 100% finalRf_predictions_comb %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 3) + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions_comb %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 3) + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

rf_model <- h2o.loadModel("models/rf_grid_model_6") h2o.varimp_plot(rf_model)

#h2o.varimp(rf_model) h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE) ## train valid xval ## 0.024674571 0.007042254 0.023097284 h2o.confusionMatrix(rf_model, valid = TRUE) ## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.293125896751881: ## benign malignant Error Rate ## benign 70 1 0.014085 =1/71 ## malignant 0 35 0.000000 =0/35 ## Totals 70 36 0.009434 =1/106 plot(rf_model, timestep = "number_of_trees", metric = "classification_error")

plot(rf_model, timestep = "number_of_trees", metric = "logloss")

plot(rf_model, timestep = "number_of_trees", metric = "AUC")

plot(rf_model, timestep = "number_of_trees", metric = "rmse")

h2o.auc(rf_model, train = TRUE) ## [1] 0.989521 h2o.auc(rf_model, valid = TRUE) ## [1] 0.9995976 h2o.auc(rf_model, xval = TRUE) ## [1] 0.9890496 perf <- h2o.performance(rf_model, test) perf ## H2OBinomialMetrics: drf ## ## MSE: 0.03673598 ## RMSE: 0.1916663 ## LogLoss: 0.1158835 ## Mean Per-Class Error: 0.0625 ## AUC: 0.990625 ## Gini: 0.98125 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## benign malignant Error Rate ## benign 70 0 0.000000 =0/70 ## malignant 4 28 0.125000 =4/32 ## Totals 74 28 0.039216 =4/102 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.735027 0.933333 25 ## 2 max f2 0.294222 0.952381 37 ## 3 max f0point5 0.735027 0.972222 25 ## 4 max accuracy 0.735027 0.960784 25 ## 5 max precision 1.000000 1.000000 0 ## 6 max recall 0.294222 1.000000 37 ## 7 max specificity 1.000000 1.000000 0 ## 8 max absolute_mcc 0.735027 0.909782 25 ## 9 max min_per_class_accuracy 0.424524 0.937500 31 ## 10 max mean_per_class_accuracy 0.294222 0.942857 37 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)` plot(perf)

h2o.logloss(perf) ## [1] 0.1158835 h2o.mse(perf) ## [1] 0.03673598 h2o.auc(perf) ## [1] 0.990625 head(h2o.metric(perf)) ## Metrics for Thresholds: Binomial metrics as a function of classification thresholds ## threshold f1 f2 f0point5 accuracy precision recall ## 1 1.000000 0.171429 0.114504 0.340909 0.715686 1.000000 0.093750 ## 2 0.998333 0.222222 0.151515 0.416667 0.725490 1.000000 0.125000 ## 3 0.998000 0.270270 0.187970 0.480769 0.735294 1.000000 0.156250 ## 4 0.997222 0.315789 0.223881 0.535714 0.745098 1.000000 0.187500 ## 5 0.996210 0.358974 0.259259 0.583333 0.754902 1.000000 0.218750 ## 6 0.994048 0.400000 0.294118 0.625000 0.764706 1.000000 0.250000 ## specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy ## 1 1.000000 0.257464 0.093750 0.546875 ## 2 1.000000 0.298807 0.125000 0.562500 ## 3 1.000000 0.335794 0.156250 0.578125 ## 4 1.000000 0.369755 0.187500 0.593750 ## 5 1.000000 0.401478 0.218750 0.609375 ## 6 1.000000 0.431474 0.250000 0.625000 ## tns fns fps tps tnr fnr fpr tpr idx ## 1 70 29 0 3 1.000000 0.906250 0.000000 0.093750 0 ## 2 70 28 0 4 1.000000 0.875000 0.000000 0.125000 1 ## 3 70 27 0 5 1.000000 0.843750 0.000000 0.156250 2 ## 4 70 26 0 6 1.000000 0.812500 0.000000 0.187500 3 ## 5 70 25 0 7 1.000000 0.781250 0.000000 0.218750 4 ## 6 70 24 0 8 1.000000 0.750000 0.000000 0.250000 5 finalRf_predictions <- data.frame(actual = as.vector(test$classes), as.data.frame(h2o.predict(object = rf_model, newdata = test))) ## | | | 0% | |=================================================================| 100% finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) finalRf_predictions %>% group_by(actual, predict) %>% dplyr::summarise(n = n()) ## Source: local data frame [3 x 3] ## Groups: actual [?] ## ## actual predict n ## <fctr> <fctr> <int> ## 1 benign benign 62 ## 2 benign malignant 8 ## 3 malignant malignant 32 finalRf_predictions %>% group_by(actual, predict_stringent) %>% dplyr::summarise(n = n()) ## Source: local data frame [4 x 3] ## Groups: actual [?] ## ## actual predict_stringent n ## <fctr> <chr> <int> ## 1 benign benign 61 ## 2 benign uncertain 9 ## 3 malignant malignant 26 ## 4 malignant uncertain 6 finalRf_predictions %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

df <- finalRf_predictions[, c(1, 3, 4)] thresholds <- seq(from = 0, to = 1, by = 0.1) prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA) for (threshold in thresholds) { pred <- ifelse(df$benign > threshold, "benign", "malignant") pred_t <- ifelse(pred == df$actual, TRUE, FALSE) group <- data.frame(df, "pred" = pred_t) %>% group_by(actual, pred) %>% dplyr::summarise(n = n()) group_b <- filter(group, actual == "benign") prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n) prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b group_m <- filter(group, actual == "malignant") prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n) prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m } prop_table %>% gather(x, y, prop_true_b:prop_true_m) %>% ggplot(aes(x = threshold, y = y, color = x)) + geom_point() + geom_line() + scale_color_brewer(palette = "Set1") + labs(y = "proportion of true predictions", color = "b: benign cases\nm: malignant cases")

h2o.shutdown()

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats4 parallel stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages: ## [1] ggrepel_0.6.5 reshape2_1.4.2 h2o_3.10.3.6 ## [4] corrplot_0.77 plyr_1.8.4 xgboost_0.6-4 ## [7] randomForest_4.6-12 dplyr_0.5.0 caret_6.0-73 ## [10] lattice_0.20-35 doParallel_1.0.10 iterators_1.0.8 ## [13] foreach_1.4.3 tidyr_0.6.1 pcaGoPromoter_1.18.0 ## [16] Biostrings_2.42.1 XVector_0.14.0 IRanges_2.8.1 ## [19] S4Vectors_0.12.1 BiocGenerics_0.20.0 ellipse_0.3-8 ## [22] ggplot2_2.2.1.9000 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.10 class_7.3-14 assertthat_0.1 ## [4] rprojroot_1.2 digest_0.6.12 R6_2.2.0 ## [7] backports_1.0.5 MatrixModels_0.4-1 RSQLite_1.1-2 ## [10] evaluate_0.10 e1071_1.6-8 zlibbioc_1.20.0 ## [13] lazyeval_0.2.0 minqa_1.2.4 data.table_1.10.4 ## [16] SparseM_1.76 car_2.1-4 nloptr_1.0.4 ## [19] Matrix_1.2-8 rmarkdown_1.4 labeling_0.3 ## [22] splines_3.3.3 lme4_1.1-12 stringr_1.2.0 ## [25] RCurl_1.95-4.8 munsell_0.4.3 mgcv_1.8-17 ## [28] htmltools_0.3.5 nnet_7.3-12 tibble_1.2 ## [31] codetools_0.2-15 MASS_7.3-45 bitops_1.0-6 ## [34] ModelMetrics_1.1.0 grid_3.3.3 nlme_3.1-131 ## [37] jsonlite_1.3 gtable_0.2.0 DBI_0.6 ## [40] magrittr_1.5 scales_0.4.1 stringi_1.1.3 ## [43] RColorBrewer_1.1-2 tools_3.3.3 Biobase_2.34.0 ## [46] pbkrtest_0.4-7 yaml_2.1.14 AnnotationDbi_1.36.2 ## [49] colorspace_1.3-2 memoise_1.0.0 knitr_1.15.1 ## [52] quantreg_5.29

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Customising Shiny Server HTML Pages

Fri, 03/31/2017 - 01:16

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Mark Sellors
Head of Data Engineering

At Mango we work with a great many clients using the Shiny framework for R. Many of those use Shiny Server or Shiny Server Pro to publish their shiny apps within their organisations. Shiny Server Pro in particular is a great product, but some of the stock html pages are a little plain, so I asked the good folks at RStudio if it was possible to customise them to match corporate themes and so on. It turns out that it’s a documented feature that’s been available for around 3 years now!

The stock Shiny Server Pro login screen

If you want to try this yourself, check out the <a href=”http://docs.rstudio.com/shiny-server/#custom-templates”>Shiny Server Admin Guide</a>, but it’s pretty simple to do. My main point of interest in this is to customise the Shiny Server Pro login screen, but you can customise several other pages too. Though it’s worth noting that if you create a custom 404, it will not be available from within an application, only in places where it would be rendered by Shiny Server rather than by Shiny itself.

To get a feel for how this works, I asked Mango’s resident web dev, Ben, to rustle me up a quick login page for a fake company and I then set about customising that to fit the required format. The finished article can be seen below and we hope you’ll agree, it’s a marked improvement on the stock one. (Eagle eyed readers and sci-fi/film-buffs will hopefully recognise the OCP logo!)

Our new, customised Shiny Server Pro login screen

This new login screen works exactly like the original and is configured for a single app only on the server, rather than all apps. We do have an additional “Forgot” button, that could be used to direct users to a support portal or to mail an administrator or similar.

Customisations are very simple, and use the reasonably common handlebars/mustache format. A snippet of our custom page is below. Values placed within handlebars, like `{{value}}`, are actually variables that Shiny Server will replace with the correct info when it renders the page. I stripped out some stuff from the original template that we didn’t need, like some logic to use if the server is using different authentication mechanisms to keep things simple.


<form action="{{url}}" method="POST">
<input id="successUrl" name="success_url" type="hidden" value="{{success_url}}" />
<input id="username" name="username" required="required" type="text" value="{{username}}" />
<input id="password" name="password" required="required" type="password" value="Password" />
<button>
<i class="fa fa-question-circle"></i> Forgot
</button>
<button type="submit">
<i class="fa fa-sign-in"></i> Login
</button>
</form>

Hopefully this post has inspired you to check out this feature. It’s an excellent way to provide custom corporate branding, especially as it can be applied on an app-by-app basis. It’s worth knowing that this feature is not yet available in RStudio Connect, but hopefully it will arrive in a future update. If you do create any customisations of your own be sure to let us know! You’ll find us on twitter @MangoTheCat.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

Thu, 03/30/2017 - 23:25

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, … . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.

  • The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
  • Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
  • OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
  • Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
  • SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.

Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:

  1. Easily installable in a corporate environment as a simple R package based on rJava
  2. Covering more than 40 languages:
    UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
    MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
    POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
  3. Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
  4. Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
  5. Cross-platform running on Windows/Linux/Mac
  6. It allows to do the Morphological, POS tagging and universal POS tagging of sentences

The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ‘ A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging’ available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:

library(RDRPOSTagger)
rdr_available_models()

## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
       "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
       "  ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

## This gives the following output
sentence.id word.id             word word.type
           1       1              Dus       ADV
           1       2   godvermehoeren      VERB
           1       3              met       ADP
           1       4              pus      NOUN
           1       5               in       ADP
           1       6             alle      PRON
           1       7          puisten      NOUN
           1       8                ,     PUNCT
           1       9              zei      VERB
           1      10              die      PRON
           1      11           schele       ADJ
           1      12              van       ADP
           1      13              Van     PROPN
           1      14          Bukburg     PROPN
           1      15                .     PUNCT
           2       1               Er       ADV
           2       2              was       AUX
           2       3             toen     SCONJ
           2       4              dat     SCONJ
           2       5           liedje      NOUN
           2       6              van       ADP
           2       7 tietenkonttieten      VERB
           2       8             kont     PROPN
           2       9           tieten      VERB
           2      10     kontkontkont     PROPN
           2      11                .     PUNCT
           3       0             <NA>      <NA>
           4       0             <NA>      <NA>
           5       0             <NA>      <NA>

The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.

Good luck with text mining.
If you need our help for a text mining project. Let us know, we’ll be glad to get you started.

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Live Presentation: Kyle Walker on the tigris Package

Thu, 03/30/2017 - 19:08

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Kyle Walker

Next Tuesday (April 4) at 9am PT I will be hosting a webinar with Kyle Walker about the tigris package in R. In addition to creating the Tigris package, Kyle is also a Professor of Geography at Texas Christian University.

If you are interested in R, Mapping and US Census Geography then I encourage you to attend!

What tigris Does

When I released choroplethr back in 2014, many people loved that it allows you to easily map US States, Counties and ZIP Codes. What many people don’t realize is that:

  1. These maps come from the US Census Bureau
  2. The Census Bureau publishes many more maps than choroplethr ships with

What Tigris does is remarkable: rather than relying on someone creating an R package with the US map you want, it allows you to download shapefiles directly from the US Census Bureau and import them into your R session. All without leaving the R console! From there you can render the map with your favorite graphics library.

Kyle will explain what shapefiles the Census Bureau has available, and also give a demonstration of how to use Tigris. After the presentation, there will be a live Q&A session.

If you are unable to attend live, please register so I can email you a link to the replay.

SAVE MY SEAT

The post Live Presentation: Kyle Walker on the tigris Package appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Learning Scrabble strategy from robots, using R

Thu, 03/30/2017 - 18:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While you might think of Scrabble as that game you play with your grandparents on a rainy Sunday, some people take it very seriously. There's an international competition devoted to Scrabble, and no end of guides and strategies for competitive play. James Curley, a psychology professor at Columbia University, has used an interesting method to collect data about what plays are most effective in Scrabble: by having robots play against each other, thousands of times.

The data were generated with a Visual Basic script that automated two AI players completing a game in Quackle. Quackle emulates the Scrabble board, and provides a number of AI players; the simulation used the "Speedy Player" AI which tends to make tactical scoring moves while missing some longer-term strategic plays (like most reasonably skilled Scrabble players). He recorded the results of 2566 games between two such computer players and provided the resulting plays and boards in an R package. With these data, you can see some interesting statistics on long-term outcomes from competitive Scrabble games, like this map (top-left) of which squares on the board are most used in games (darker means more frequently), and also for just the Q, Z and blank tiles. Scrabble games in general tend to follow the diagonals where the double-word score squares are located, while the high-scoring Q and Z tiles tend to be used on double- and triple-letter squares. The zero-point blank tile, by comparison, is used fairly uniformly across the board.

 

Further analysis of the actual plays during the simulated games reveals some interesting Scrabble statistics:

It's best to play first. Player 1 won 54.6% of games, while Player 2 won 44.9%, a statistically significant difference. (The remaining 0.4% of the games were ties.) 

Some uncommon words are frequently used. The 10 most frequently-played words were QI, QAT, QIN, XI, OX, EUOI, XU, ZO, ZA, and EX; not necessarily words you'd use in casual conversation, but all words very familiar to competitive scrabble players. As we'll see later though, not all are high-scoring plays. Sometimes it's a good idea to get rid of a high-scoring but restrictive letter (QI, the life energy in Chinese philosophy), or simply to shake up a vowel-heavy rack for more opportunities next turn (EUOI, a cry of impassioned rapture in ancient Bacchic revels).

For high-scoring plays, go for the bingo. A Scrabble Bingo, where you lay down all 7 tiles in your rack in one play, comes with a 50-point bonus. The top three highest-scoring plays in the simulation were all bingo plays: REPIQUED (239 points), CZARISTS (230 points), and IDOLIZED. (Remember though, that this is from just a couple of thousand simulated games; there are many many more potentially high-scoring words.)

High-scoring non-bingo plays can be surprisingly long words. It's super-satisfying to lay down a short word like EX with the X making a second word on a triple-letter tile (for a 9x bonus), so I was surprised to see the top 10 highest-scoring non-bingo plays were still fairly long words: XENURINE (144 points), CYANOSES (126 pts) and SNAPWEED (126 points), all using at least 2 tiles already on the board. The shortest word in the top 10 of this list was ZITS.

Some tiles just don't work well together. The pain of getting a Q without a U seems obvious, but it turns out getting two U's is way worse in point-scoring potential. From the simulation, you can estimate the point-scoring potential of any pair of tiles in your rack: lighter is better, and darker is worse.

Managing the scoring potential of the tiles in your rack is a big part of Scrabble strategy, as we saw in another Scrabble analysis using R a few years ago. The lowly zero-point blank is actually worth a lot of potential points, while the highest-scoring tile Q is actually a liability. Here are the findings from that analysis:

  • The blank is worth about 30 points to a good player, mainly by making 50-point "bingo" plays possible.
  • Each S is worth about 10 points to the player who draws it.
  • The Q is a burden to whichever player receives it, effectively serving as a 5 point penalty for having to deal with it due to its effect in reducing bingo opportunities, needing either a U or a blank for a chance at a bingo and a 50-point bonus.
  • The J is essentially neutral pointwise.
  • The X and the Z are each worth about 3-5 extra points to the player who receives them. Their difficulty in playing in bingoes is mitigated by their usefulness in other short words.

For more James Curley's recent Scrabble analysis, including the R code using his scrabblr package, follow the link to below.

RPubs: Analyzing Scrabble Games

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Delaporte package: The SPARCmonster is sated

Thu, 03/30/2017 - 18:26

Finally, finally after months of pulling out my hair, the Delaporte project on CRAN passes all of its checks on Solaris SPARC. The last time it did that, it was still using serial C++. Now it uses OpenMP-based parallel Fortran 2003. What a relief! One of these days I should write up what I did and why, but for now, I’ll be glad to put the project down for a month or seventeen!

R-Lab #2: Shiny & maps + earthquakes | Milan, April 11th

Thu, 03/30/2017 - 16:22

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

 

R-Lab #2 – April 11th, Milano Shiny & Maps: a geo representation of earthquakes evolution with R

[The map above, that represents the earthquakes of magnitude>5 in the last 3 years, is done following this great post of Andrew Collier. The nice colors were easy to choose thanks to the amazing colourpicker addin

Hi everybody! We are ready to announce our second R-Lab! Either if you are an R expert, a beginner, or you just curious, you are welcome to join us! The event will be on April 11th, in Mikamai, Milano, and you can register from here: https://www.meetup.com/R-Lab-Milano/events/238824405/

This time the R topic is very intriguing: Shiny and maps! As several of you would know, Shiny is the R specific framework for interactive visualization, while packages as ggmap and Leaflet provide several functions for georeferencing and mapping data.

These tools will be very useful for handling the task proposed by our first presenter: EarthCloud, an organisation aimed at building a close link between information technologies and earth sciences related to risk prevention. With their help we will work on a very hot-topic: map the earthquakes that happened over a specific fault, to study the fault evolution over time and provide qualitative and quantitative insights about it. As always, working in team, hands on R code!

 

Agenda

18:45 : Meeting at Mikamai

19:00 : Intro: Earthquakes data and fault evolution + R tools for mapping and Shiny

19:30 : Get a pizza (for those who want it)

19:45 : Coding together while eating the pizza.

The goal is: build a shiny app to map the fault evolution over time, enriching as much as possible the app with qualitative and quantitative insight about the earthquakes

22:30 : Greetings and see you soon!

 

A bit more about the proposer – EarthCloud

EarthCloud is an association that aims at creating a strong link between new information technologies (distributed systems, Cloud, serverless, etc.) and earth sciences related to risk prevention (seismic, geological and environmental), in order to increase population protection.

 

Who is this event for

Either you are an R expert, a basic user or just curious, you are welcome! The R-Lab is a space for learning and sharing knowledge, working together on a challenging goal. If you are a geologist or a an expert of earth science, you are welcome as well! We appreciate your contribution about the meaning of what we are doing

 

What to bring

Be sure to bring your own computer, possibly with the latest version of RStudio.

 

Where to go

We will be hosted by Mikamai and LinkMe in their location, in Via Giulio e Corrado Venini, 42 (very close to the Pasteur metro station). The doorbell is Mikamai: when you enter, go straight, cross the inner courtyard of the building, you face a metal door with a sheet with Mikamai written. The office is the last (and only) floor.

 

For any additional info, please contact us via meetup: https://www.meetup.com/it-IT/R-Lab-Milano/

Looking forward to seeing you there!!!

 

If you still don’t know the R-Lab project… What is an R-Lab?

The R-Labs are evening kind of mini-hackathons where we work together with R on a real problem.

In short, this is what we do:

  • A company/institution/university proposes a real problem, and teaches something about that issue
  • We work together on the solution, possibly having fun
  • Everything we do is released on Github, exactly here

 

Where can I join the group?

You can join the group on the Meetup platform here:

https://www.meetup.com/it-IT/R-Lab-Milano/

 

 

The post R-Lab #2: Shiny & maps + earthquakes | Milan, April 11th appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – II

Thu, 03/30/2017 - 13:42

This week’s R bulletin will cover functions calls, sorting data frame, creating time series object, and functions like is.na, na.omit, paste, help, rep, and seq function. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To show files – Ctrl+5
2. To show plots – Ctrl+6
3. To show packages – Ctrl+7

Problem Solving Ideas Calling a function in an R script

If you want to call a custom-built function in your R script from another script, one can use the “exists” function along with the “source” function. See the example below:

Example:

if(exists("daily_price_data", mode="function")) source("Stock price data.R")

In this case, the expression will check whether a function called “daily_price_data” exists in the “Stock price data.R” script, and if it does, it will load the function in the current script. We can then use the function any number of times in our script by providing the relevant arguments.

Convert dates from Google finance to a time series object

When we download stock price data from Google finance, the “DATE” column shows a date in the yyymmdd format. This format is not considered as a time series object in R. To convert the dates from Google Finance into a time series object, one can use the ymd function from the lubridate package. The ymd function accepts dates in the form year, month, day. In the case of dates in other formats, the lubridate package has functions like ydm, mdy, myd, dmy, and dym, which can be used to convert it into a time series object.

Example:

library(lubridate) dt = ymd(20160523) print(dt)

[1] “2016-05-23”

Sorting a data frame in an ascending or descending order

The arrange function from the dplyr package can be used to sort a data frame. The first argument is the data.frame and the next argument is the variable to sort by, either in an ascending or in a descending order.

In the example below, we create a two column data frame comprising of stock symbols and their respective percentage price change. We then sort the Percent change column first in an ascending order, and in the second instance in a descending order.

Example:

library(dplyr) # Create a dataframe Ticker = c("UNITECH", "RCOM", "VEDL", "CANBK") Percent_Change = c(2.3, -0.25, 0.5, 1.24) df = data.frame(Ticker, Percent_Change) print(df)

Ticker          Percent_Change
1  UNITECH    2.30
2      RCOM   -0.25
3        VEDL    0.50
4     CANBK    1.24

# Sort in an ascending order df_descending = arrange(df, Percent_Change) print(df_descending)

Ticker     Percent_Change
1     RCOM    -0.25
2       VEDL    0.50
3    CANBK    1.24
4 UNITECH    2.30

# Sort in a descending order df_descending = arrange(df, desc(Percent_Change)) print(df_descending)

Ticker         Percent_Change
1 UNITECH   2.30
2    CANBK   1.24
3       VEDL   0.50
4     RCOM   -0.25

Functions Demystified paste function

The paste is a very useful function in R and is used to concatenate (join) the arguments supplied to it. To include or remove the space between the arguments use the “sep” argument.

Example 1: Combining a string of words and a function using paste

x = c(20:45) paste("Mean of x is", mean(x), sep = " ")

[1] “Mean of x is 32.5”

Example 2: Creating a filename using the dirPath, symbol, and the file extension name as the arguments to the paste function.

dirPath = "C:/Users/MyFolder/" symbol = "INFY" filename = paste(dirPath, symbol, ".csv", sep = "") print(filename)

[1] “C:/Users/MyFolder/INFY.csv”

is.na and na.omit function

The is.na functions checks whether there are any NA values in the given data set, whereas, the na.omit function will remove all the NA values from the given data set.

Example: Consider a data frame comprising of open and close prices for a stock corresponding to each date.

date = c(20160501, 20160502, 20160503, 20160504) open = c(234, NA, 236.85, 237.45) close = c(236, 237, NA, 238) df = data.frame(date, open, close) print(df)

date           open        close
1  20160501  234.00     236
2  20160502        NA     237
3  20160503  236.85     NA
4  20160504  237.45     238

Let us check whether the data frame has any NA values using the is.na function.

is.na(df)

date      open      close
[1,]  FALSE  FALSE   FALSE
[2,]  FALSE  TRUE     FALSE
[3,]  FALSE  FALSE   TRUE
[4,]  FALSE  FALSE   FALSE

As you can see from the result, it has two NA values. Let us now use the na.omit function, and view the results.

na.omit(df)

date           open         close
1  20160501  234.00      236
4  20160504  237.45      238

As can be seen from the result, the rows having NA values got omitted, and the resultant data frame now comprises of non-NA values only.

These functions can be used to check for any NA values in large data sets on which we wish to apply some computations. The presence of NA values can cause the computations to give unwanted results, and hence such NA values need to be either removed or replaced by relevant values.

rep and seq function

The rep function repeats the arguments for the specified number of times, while the sequence function is used to form a required sequence of numbers. Note that in the sequence function we use a comma and not a colon.

Example 1:

rep("Strategy", times = 3)

[1] “Strategy” “Strategy” “Strategy”

rep(1:3, 2)

[1] 1 2 3 1 2 3

Example 2:

seq(1, 5)

[1] 1 2 3 4 5

seq(1, 5, 0.5)

[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

help and example function

The help function provides information on the topic sort, while the example function provides examples on the given topic.

help(sum)
example(sum)

To access the R help files associated with specific functions within a particular package, include the function name as the first argument to the help function along with the package name mentioned in the second argument.

Example:

help(barplot, package="graphics")

Alternatively one can also type a question mark followed by the function name (e.g. ?barplot) and execute the command to know more about the function.

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – II appeared first on .

Are you fluent in R?

Thu, 03/30/2017 - 10:00

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

A few weeks ago, I wrote an article saying that you should master R. The basic argument, is that if you want to actually work as a data scientist, you need to know the essential tools backwards and forwards.

In response, a reader left a comment. I have to say that it’s unfortunate to read something like this, but sadly it’s very common.

Essentially, he wrote that he took an online course, but still can’t write code:

I was able to take a data analysis course on edX with no problem, following all the instruction and guides they provide, acing the course, but as you noted astutely, not learning or remembering much of it, because I had mostly cut and pasted the code. I tried to do some elementary analysis recently, after almost a year later, and was not able to even do that”

Does this sound like you?

If you’re an aspiring data scientist, and you’re taking online courses and reading books, but not getting results, you need to regroup.

How many online courses have you taken?

How many data science books have you bought?

After taking courses and buying books, can you write R code rapidly and from memory?

… or instead, are you doing Google searches to find out how to execute basic data science techniques in R?

Which one are you? Are you fluent in R, or are you constantly struggling to remember the essentials?

If you’re still struggling to fluently write code, keep reading.

Are you fluent in R?

fluent
– Able to speak a language accurately, rapidly, and confidently – in a flowing way.

Usage notes
In casual use, “fluency” refers to language proficiency broadly, while in narrow use it refers to speaking a language flowingly, rather than haltingly.

– Wiktionary

I find it a little odd that the concept of “fluency” is used so rarely among programmers and technologists. Honestly, this idea of “fluency” almost perfectly encapsulates the skill level you need in order to achieve your goals.

Read the definition. To be fluent, is to be able to execute proficiently, accurately, rapidly, and confidently.

Is that how you write R data science code?

Do you write code proficiently, accurately, rapidly, and confidently? Do you write your code from memory?



Or do you write your code slowly? Laboriously? Can you remember the syntax at all? Are you constantly looking things up?

Be honest.

You need to be honest with yourself, because if you have weaknesses as a data scientist (or data science candidate), the only way you can correct those weaknesses is by being honest about where you need to improve.

The reality is that many data science students are not fluent in the essential techniques.

To be clear, when I say “essential techniques” I’m not talking about advanced techniques, like machine learning, deep learning, etc. If you’re a beginner or intermediate data science student, it’s normal to still struggle with machine learning.

No. I’m talking about the essential, foundational techniques, like data manipulation, data visualization, and data analysis.

Most data science students (and even some practitioners) can’t do these things fluently.

If that sounds like you, you need to rethink your learning strategy. If you can’t fluently execute essential techniques – like visualization, manipulation, and analysis – then you need to revisit those things and focus on them.

To get a data science job, to keep a data science job, and to excel in a data science job, you need to master the foundational data science techniques.

Getting a job as a data scientist without fluency in R (or another data language) is like trying to get a job as a writer for a Spanish magazine without having basic fluency in Spanish.

Don’t kid yourself.

Your first milestone: fluency with the essential techniques

Your real first milestone as an aspiring data scientist is achieving basic fluency in writing R code. More specifically, you need to be fluent in writing R code to perform data visualization, data manipulation, and data analysis. These are the foundations, and you need to be able to execute them proficiently, rapidly, from memory.

If you can’t do visualization, manipulation, and analysis rapidly and from memory then you’re probably not ready to do real data science work. Which means, you’re not ready to apply for a data science job.

Your first milestone: fluency in the essentials.

Let’s break that down more. Here are some of the things you should be able to execute “with your eyes closed”:

  1. Data visualization basics
    Bar charts
    Line charts
    Histograms
    – Scatterplots
    – Box Plots
    Small multiples (these are rarely used, but very useful)
  2. Intermediate visualization
    – Manipulating colors
    – Manipulating size (I.e., bubble charts)
    – Dealing with data visualization problems (e.g., over plotting)
    – Formatting plots
  3. Data Manipulation:
    – How to read in a dataset (from a file, or inline)
    How to add variables to a dataset
    How to remove variables from a dataset
    – How to aggregate data
    – …

… I could go on.

This is a brief (and incomplete) list of things that you should be able to execute without even thinking about it. These are basic, essential tools. If you can’t do them fluently, you need to refocus your efforts until you can.

How to become fluent in R

This sounds great in theory. “Fluent in R!” It sounds good.

But how do you accomplish this?

I’ve said it before, and I’ll repeat:

Practice.

To become fluent in data science in R, you need to practice the essential techniques. You need to drill these essential techniques until they become second nature.

It’s not enough to just cut-and-paste some code one time and assume that you’ve learned it.

And it’s not enough to watch a few videos. To be clear: videos are an excellent way to learn something new the very first time. Good video lectures can teach you how something works and offer explanations. They are good for giving you basic understanding.

However, learning a technique from a lecture video is not the same thing as practiced, internalized skill. Lots of people watch a video or a lecture and say “yep, that makes sense.” Great.

But they don’t actually practice the technique, so they never internalize it. Actually, what happens is that they “learn” the technique from the video, but forget the technique soon after. They forget because they fail to practice.

Example: learning R is like learning a foreign language

As I’ve already suggested, learning R is much like learning a foreign language, like Spanish.

Let’s use that as an example. Let’s say you’re learning Spanish.

One day, in a lecture, you learn a little piece of grammar. As you learn that piece of grammar in the lecture, you understand it. Because you “learned” it, you’ll be able to use that grammatical construct by simply repeating it. You’ll also likely be able to use it for a few minutes or hours after class (although, you’re likely to struggle a little bit).

Next, you leave the classroom and don’t practice that grammatical construct.

A week later, do you think you’d still be able to use it? Would you remember it? How “fluent” will you be with that piece of grammar?

Here’s my bet: if you don’t practice that piece of grammar, you will forget it.

Foreign language vocabulary is the same. To remember a vocabulary word, it’s not enough to learn the word a single time. I bet you’ve had that experience. You learn the Spanish word for “cat,” and you can remember it for a few minutes, but if you don’t practice it, you will for forget it.

In foreign language, if you learn grammar and words, and you want to remember them in the long run, you need to practice them. You need to drill. The best way to remember a word, is to learn it, and then practice it repeatedly over time.

Guess what? Learning a programming language is almost exactly the same.

To learn and remember programming language syntax, you need to practice. You need to drill the basic “vocabulary” and syntax of the programming language until you know it without thinking about it.

If you do this … if you learn the basic syntax, and practice it syntax until you can write it fluidly and quickly … you will achieve fluency.

I will repeat: identify the essential techniques and practice them relentlessly.

How long does it take to achieve basic fluency in R?

You might be asking, how long will this take.

Actually, it depends on how good you are at learning. Learning itself is a technical skill.

If you don’t know how to practice, this could take years. I know people who started learning R years ago, and they still aren’t fluent. They bought dozens of books, but still can’t write code very well because they never really practiced. Again, it’s like foreign languages. I know people who have been studying Spanish for years and they still can’t have conversations.

This is one of your major risks. It might take you years to achieve basic fluency in R.

Even worse: you might fail altogether.

The problem here is that most online courses will not show you how to practice. They might show you syntax and explain how the language works, but they don’t show you how to practice to achieve mastery.

On the other hand, there is some good news …

If you know how to practice programming languages, you could achieve basic fluency as fast as about 6 weeks.

I won’t go into the details here, but if you know how to “hack your memory” you can learn R very, very quickly. Essentially, you need to know exactly how to practice for maximum gains and efficiency.

If you know how to do this, and you practice diligently every day, it’s possible to master the foundations of R within 6-8 weeks. (In fact, it’s probably possible to do faster, if you really hustle.)

To succeed as a data scientist, become fluent in the essentials

I strongly believe that to succeed as a data scientist, you need fluency. You need a rapid, unconscious mastery of the essential syntax and techniques for data science in R.

And that requires practice.

If you want to be a data scientist, here is my recommendation. Learn and drill the major techniques from the following R packages:

  • ggplot2
  • dplyr
  • tidyr
  • lubridate
  • stringr
  • forcats
  • readr

These give you essential tools for manipulating, cleaning, wrangling, visualizing and analyzing data.

If you can become fluent with these, you’ll have all of the tools that you need to get things done at an entry level.

You’ll be prepared to work on your own projects.

You’ll be prepared for more advanced topics.

And you’ll be well on your way to becoming a top-performer.

Our data science course opens next week

If you’re interested in rapidly mastering data science, then sign up for our list right now.

Next week, we will re-open registration for our flagship course, Starting Data Science.

Starting Data Science will teach you the essentials of R, including ggplot2, dplyr, tidyr, stringr, and lubridate.

It will also give you a practice system that you can use to rapidly master the tools from these packages.

If you sign up for our email list, you’ll get an exclusive invitation to join the course when it opens.

SIGN UP NOW

The post Are you fluent in R? appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Most Popular Learners in mlr

Thu, 03/30/2017 - 02:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

For the development of mlr as well as for an “machine learning expert” it can be handy to know what are the most popular learners used.
Not necessarily to see, what are the top notch performing methods but to see what is used “out there” in the real world.
Thanks to the nice little package cranlogs from metacran you can at least get a slight estimate as I will show in the following…

First we need to install the cranlogs package using devtools:

devtools::install_github("metacran/cranlogs")

Now let’s load all the packages we will need:

library(mlr) library(stringi) library(cranlogs) library(data.table)

Do obtain a neat table of all available learners in mlr we can call listLearners().
This table also contains a column with the needed packages for each learner separated with a ,.

# obtain used packages for all learners lrns = as.data.table(listLearners()) all.pkgs = stri_split(lrns$package, fixed = ",")

Note: You might get some warnings here because you likely did not install all packages that mlr suggests – which is totally fine.

Now we can obtain the download counts from the rstudio cran mirror, i.e. from the last month.
We use data.table to easily sum up the download counts of each day.

all.downloads = cran_downloads(packages = unique(unlist(all.pkgs)), when = "last-month") all.downloads = as.data.table(all.downloads) monthly.downloads = all.downloads[, list(monthly = sum(count)), by = package]

As some learners need multiple packages we will use the download count of the package with the least downloads.

lrn.downloads = sapply(all.pkgs, function(pkgs) { monthly.downloads[package %in% pkgs, min(monthly)] })

Let’s put these numbers in our table:

lrns$downloads = lrn.downloads lrns = lrns[order(downloads, decreasing = TRUE),] lrns[, .(class, name, package, downloads)]

Here are the first 5 rows of the table:

class name package downloads surv.coxph Cox Proportional Hazard Model survival 153681 classif.naiveBayes Naive Bayes e1071 102249 classif.svm Support Vector Machines (libsvm) e1071 102249 regr.svm Support Vector Machines (libsvm) e1071 102249 classif.lda Linear Discriminant Analysis MASS 55852

Now let’s get rid of the duplicates introduced by the distinction of the type classif, regr and we already have our…

nearly final table lrns.small = lrns[, .SD[1,], by = .(name, package)] lrns.small[, .(class, name, package, downloads)]

The top 20 according to the rstudio cran mirror:

class name package downloads surv.coxph Cox Proportional Hazard Model survival 153681 classif.naiveBayes Naive Bayes e1071 102249 classif.svm Support Vector Machines (libsvm) e1071 102249 classif.lda Linear Discriminant Analysis MASS 55852 classif.qda Quadratic Discriminant Analysis MASS 55852 classif.randomForest Random Forest randomForest 52094 classif.gausspr Gaussian Processes kernlab 44812 classif.ksvm Support Vector Machines kernlab 44812 classif.lssvm Least Squares Support Vector Machine kernlab 44812 cluster.kkmeans Kernel K-Means kernlab 44812 regr.rvm Relevance Vector Machine kernlab 44812 classif.cvglmnet GLM with Lasso or Elasticnet Regularization (Cross Validated Lambda) glmnet 41179 classif.glmnet GLM with Lasso or Elasticnet Regularization glmnet 41179 surv.cvglmnet GLM with Regularization (Cross Validated Lambda) glmnet 41179 surv.glmnet GLM with Regularization glmnet 41179 classif.cforest Random forest based on conditional inference trees party 36492 classif.ctree Conditional Inference Trees party 36492 regr.cforest Random Forest Based on Conditional Inference Trees party 36492 regr.mob Model-based Recursive Partitioning Yielding a Tree with Fitted Models Associated with each Terminal Node party,modeltools 36492 surv.cforest Random Forest based on Conditional Inference Trees party,survival 36492

As we are just looking for the packages let’s compress the table a bit further and come to our…

final table lrns[,list(learners = paste(class, collapse = ",")),by = .(package, downloads)]

Here are the first 20 rows of the table:

package downloads learners survival 153681 surv.coxph e1071 102249 classif.naiveBayes, classif.svm, regr.svm MASS 55852 classif.lda, classif.qda randomForest 52094 classif.randomForest, regr.randomForest kernlab 44812 classif.gausspr, classif.ksvm, classif.lssvm, cluster.kkmeans, regr.gausspr, regr.ksvm, regr.rvm glmnet 41179 classif.cvglmnet, classif.glmnet, regr.cvglmnet, regr.glmnet, surv.cvglmnet, surv.glmnet party 36492 classif.cforest, classif.ctree, multilabel.cforest, regr.cforest, regr.ctree party,modeltools 36492 regr.mob party,survival 36492 surv.cforest fpc 33664 cluster.dbscan rpart 28609 classif.rpart, regr.rpart, surv.rpart RWeka 20583 classif.IBk, classif.J48, classif.JRip, classif.OneR, classif.PART, cluster.Cobweb, cluster.EM, cluster.FarthestFirst, cluster.SimpleKMeans, cluster.XMeans, regr.IBk gbm 19554 classif.gbm, regr.gbm, surv.gbm nnet 19538 classif.multinom, classif.nnet, regr.nnet caret,pls 18106 classif.plsdaCaret pls 18106 regr.pcr, regr.plsr FNN 16107 classif.fnn, regr.fnn earth 15824 regr.earth neuralnet 15506 classif.neuralnet class 14493 classif.knn, classif.lvq1 Remarks

This is not really representative of how popular each learner is, as some packages have multiple purposes (e.g. multiple learners).
Furthermore it would be great to have access to the trending list.
Also most stars at GitHub gives a better view of what the developers are interested in.
Looking for machine learning packages we see there e.g: xgboost, h2o and tensorflow.

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Pseudo-Science: Developing a Biorhythm Calculator

Thu, 03/30/2017 - 01:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Data science is a serious occupation. Just like any other science, however, it can also be used for spurious topics, such as calculating your biorhythm.

This post provides an example of data Pseudo-Science though a function that calculates and visualises your biorhythm. Based on the graph, I must be having a great day right now.

The broader and more pertinent message in this post is that data pseudo-science is more common than you would think. Is our belief in machine learning justified or are some of these models also a pseudo-science with not much more reliability than a biorhythm?

Biorhythm Theory

The idea that our physical states follow a predetermined rhythm has been around as long as mathematics. The basic concept of biorhythm is that a regular sinusoid cycle accurately describes our physical, emotional and intellectual states. Each of these three cycles has a different wavelength ():

  • physical: days
  • emotional: days
  • intellectual: days

The cycle is calculated with , where indicates the number of days since birth. This idea was developed by German surgeon Wilhelm Fliess in the late 19th century and was popularised in the United States in the late 1970s. There is no scientific evidence of the validity of this theory but it is an entertaining way to play with data.

The combination of the 23- and 28-day cycles repeats every 644 days, while the triple combination of 23-, 28-, and 33-day cycles repeats every 21,252 days, 58 years, two months and three weeks. You can, by the way, never reach a point where all cycles are maximised. The best you can achieve is 299.7 our of a maximum 300 which occurs when you are 17,003 days old.

Calculating your Biorhythm

When I was a teenager in the 1980s, several books and magazines described computer code to calculate your biorhythm. I used to play with these functions on my Atari 130XE computer.

Building a biorhythm calculator in R is easy. This function takes two dates as input and plots the biorhythm for the two weeks before and after the date. To calculate your biorhythm, run the function with your date of birth and target date: biorhythm(“yyyy-mm-dd”). The default version uses today as the target.

library(ggplot2) library(reshape2) biorhythm <- function(dob, target = Sys.Date()) { dob <- as.Date(dob) target <- as.Date(target) t <- round(as.numeric(difftime(target, dob))) days <- (t - 14) : (t + 14) period <- data.frame(Date = seq.Date(from = target - 15, by = 1, length.out = 29), Physical = sin (2 * pi * days / 23) * 100, Emotional = sin (2 * pi * days / 28) * 100, Intellectual = sin (2 * pi * days / 33) * 100) period <- melt(period, id.vars = "Date", variable.name = "Biorhythm", value.name = "Percentage") ggplot(period, aes(x = Date, y = Percentage, col = Biorhythm)) + geom_line() + ggtitle(paste("DoB:", format(dob, "%d %B %Y"))) + geom_vline(xintercept = as.numeric(target)) } biorhythm("1969-09-12", "2017-03-30")

Biorhythms are an early attempt for human beings to predict the future. Although there is no relationship between this algorithm and reality, many people believed in its efficacy. Does the same hold true for the hyped capabilities of machine learning?

Data Pseudo-Science

Data pseudo-science is not only an issue when people use spurious mathematical relationships such as biorhythms or astrology. This post is also written as a warning not to only rely on numerical models to predict qualitative aspects of life.

The recent failures in predicting the results of elections, even days before the event, are a case in point. There are many reasons machine learning methods can go wrong. When machine learning algorithms fail, they are often just as useful as a biorhythm. It would be fun to write a predictive analysis package for R using only pseudoscientific approaches such as I-Ching, astrology or biorhythm.

The post Data Pseudo-Science: Developing a Biorhythm Calculator appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R and Singularity

Wed, 03/29/2017 - 18:25

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Bryan Lewis

R (https://www.r-project.org) is a premier system for statistical and scientific computing and data science. At its core, R is a very carefully curated high-level interface to low-level numerical libraries. True to this principle, R packages have greatly expanded the scope and number of these interfaces over the years, among them interfaces to a large number of distributed and parallel computing tools. Despite its impressive breadth of sophisticated high-performance computing (HPC) tools, R is not often that widely used for “big” problems.

I believe the idiosyncrasies of most HPC technologies represent the major road block to their adoption (in any language or system). HPC technologies are often difficult to set up, use, and manage. They often rely on frequently changing and complex software library dependencies, and sometimes highly specific library versions. Managing all this boils down to spending more time on system administration, and less time on research.

How do we make things easier? One approach to help accelerate the adoption of HPC technology by the R community uses Singularity, a modern application containerization technique suited to HPC (http://singularity.lbl.gov/).

Containers

A container is a collection of the software requirements to run an application. Importantly, containers are defined and generated from a simple text recipe that can be easily communicated and versioned. Containers leverage modern operating system capabilities for virtualizing process and name spaces in a high-performance, low-overhead way. Container technology allows us to quickly turn recipes into runnable applications, and then deploy them anywhere.

The success of Docker, CoreOS, and related systems in enterprise business applications shows that there is a huge demand for lightweight, versionable, and portable containers. Notably, these technologies have not been all that widely successful in HPC settings, despite significant effort. Shifter (https://github.com/NERSC/shifter) is the most successful application of Docker to HPC, and while it is very impressive, it suffers from a few important drawbacks. The root-capable daemon program used by Docker is difficult to accommodate in many HPC environments. And the relatively heavy-weight nature of Docker virtualization can degrade the performance of high-performance hardware resources like Infiniband networking.

Singularity is a lightweight and very simple container technology that is particularly well-suited to HPC environments. Singularity virtualizes the minimum amount necessary to compute, allowing applications full access to fast hardware resources like Infiniband networks and GPUs. And Singularity runs without a server at all, eliminating possible server security exploits. The minimalist philosophy of Singularity makes it easy to install and run on everything from laptops to supercomputers, promoting the ability to quickly test containers before using them across large systems. Singularity is now widely available in supercomputer centers across the world.

Reproducible research

Publishing results with code and data that can be reproduced and validated by others is an obviously important concept that has seen increased urgency these days. The idea is an old one that has been supported by S, S+ and R from the beginning with ideas like Sweave and more recently knitr and R markdown. R even promotes reproducible simulation in distributed/parallel settings by including high-quality, reproducible, distributed random number generators out of the box.

However, as R integrates with an increasing number of external libraries and frameworks like cuDNN, Spark, and others, the ability to reproduce the
software environment that R runs in is becoming both more important and more complex. Containers help us define these complex set ups with simple, versionable text files, and then portably run them in diverse environments.

Examples

The following examples assume that Singularity is installed on your system. See http://singularity.lbl.gov/ for details — it’s very easy to install. The examples can be run from nearly any modern Unix operating system, although the processor architecture must be supported by the container operating system.

Hello TensorFlow

The first example below shows a canonical “hello world” program. Instead of a completely trivial example, we print “Hello, TensorFlow!” using TensorFlow from R via Python (https://github.com/tensorflow/tensorflow, https://github.com/python/cpython), introducing a complex but typical software dependency chain. A test program validates operation by printing the “hello world” message from R through Tensorflow. The container generically will run any R program named main.R in its working directory.

Here is the Singularity container definition file for the example using the Ubuntu Xenial operating system. (Note that you can build a container from this definition file on any Singularity-supported operating system.)

TIP If you’re running on Red Hat or CentOS, you’ll need the debootstrap program: sudo yum install debootstrap. See the Singularity documentation for more information.

Assuming that the above definition file is named tensorflow.def, you can bootstrap a Singularity container image named tensorflow.img with:

The %post section of the definition file installs R, Python, Tensorflow and miscellaneous utilities into the container. The %test section runs the “hello world” program as an example to verify things are working. The %run section of this example simply runs an arbitrary user R program named main.R in the container’s working directory.

Run the “hello world” %test script with:

I love Singularity’s ability to include unit tests in container definition files — it reminds me of building R packages! I encourage using the test section judiciously to confirm that the container will work as intended.

You can run an arbitrary R program in the container by creating a main.R file in the container working directory and running:

Full-genome variant Principal Components

The previous example illustrated a complex tool chain, but only running on a single computer. This example is closer to a complete distributed R application.

Genomic variants record differences in a genome relative to a reference. Many types of differences exist, see for instance https://en.wikipedia.org/wiki/Structural_variation. This example focuses on differences among the 2,504 whole human genomes curated by the 1000 Genomes Project (see: “A global reference for human genetic variation”, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi:10.1038/nature15393). The example downloads whole genome data files in VCF 4.1 format. Although the 1000 Genome Project data files are used here, the example will work for any input set of VCF files (it processes all files named *.vcf.gz in the working directory).

The example constructs a sparse 2,504 row (people) by 81,271,844 column (genomic variants) R matrix from the VCF data files. The matrix entries are one if a particular variant occurs in the person, or a zero otherwise. Because not every person exhibits every variant, the matrix is very sparse with about 9.8 billion nonzero-elements, or about 2% fill-in. Rather than construct a single giant sparse matrix, the example partitions the data and saves many smaller sub-matrices each with CHUNKSIZE non-zero elements as R data files in the working directory, where CHUNKSIZE is an optional user-defined parameter that defaults to a value based on system memory size.

The example computes the first NCOMP principal components, where NCOMP is a user-specified environment variable specified by the user, of sparse genomic variant VCF files. The example is very general, requiring an arbitrary number of VCF data files as input and running on any number of computers. It uses MPI to coordinate parallel activity across computers, along with the Rmpi, doMPI, and foreach packages in R. The choice of MPI is well-suited to supercomputer deployment, and the example assumes that MPI is available along with the following assumptions:

  • Launched by MPI
  • One or more gzip-compressed variant files ending in “.vcf.gz” (the program will use all files matching this pattern)
  • The input variant files are split up among working directories across the worker computers — each worker will parse and process only the variant files in its local working directory
  • Optional CHUNKSIZE environment variable in number of variants per chunk
  • Optional NCOMP environment variable specifying the number of principal components to return, defaulting to 3

A successful run produces the following output:

  • A file ‘pca.rdata’ in serialized R format containing the largest NCOMP singular values and corresponding principal component vectors of the variant data

This example was designed for deployment with supercomputer systems in mind. See https://github.com/bwlewis/1000_genomes_examples for other implementations that don’t require MPI.

Singularity encapsulates the program logic and the external library dependency chain (MPI, etc.) required by the computation in the following definition file:

Build and bootstrap a Singularity container using the variant_pca.def definition file with:

Unit test

The container includes a simple unit test that verifies MPI operation invoked by:

Small example

A small, fast-running example computes principal components for the first 10,000 variants from the 1000 Genomes Project chromosomes 21 and 22 as follows:

Read the output pca.rdata file from R using readRDS(). The following code plots the first three estimated principal components.

We see some obvious clusters in the data, but the clusters are not all that well-defined because we only use data from two smaller chromosomes (21 and 22) in this example. The clusters correspond to distinct genetic superpopulations. See the following example for a refined plot using the whole genomes.

Full-sized example

Finally, compute the whole genome principal components across all chromosomes and all 2,504 people in the 1000 Genomes project with:

When running on more than one computer, first distribute the vcf.gz files by scattering them across working directories on each computer. Each computer will only process the files located in its working directory, so copy a subset of
the files to each computer.

The Singularity container image must also be available to run on each computer,
so copy the image to each one.

Now scatter the *.vcf.gz files across your MPI computers, for instance using scp. Let’s assume for this example that we have four total computers. Then we need to invoke the program on 4 + 1 = 5 total MPI hosts, as outlined in https://cran.r-project.org/web/packages/doMPI/vignettes/doMPI.pdf (the first listed host will operate as the R master program in a master/slave configuration).

Assume that our four host computers are listed in a comma-separated list by the environment variable HOSTS, for instance by

Then a typical openmpi invocation is (for our four hosts):

Replace the host list and -np 5 with the number of computers available in your cluster plus one. Or, submit the job using an available cluster job manager like Slurm. See https://cran.r-project.org/web/packages/doMPI/vignettes/doMPI.pdf for more details on using MPI with R.

Example output

To give you an idea of performance, I ran this example on four Amazon EC2 r4-4xlarge instances. The parsing step completed in about 20 minutes, and principal component computation took about 11 minutes (680 seconds).

As with the small example above, we can read the output file and plot the principal components:

The resulting clusters are much more highly defined, and split into four or five very well-defined data clusters, corresponding almost exactly to the NIH superpopulation categories for each person. Some of the data clusters themselves exhibit sub-cluster structure.

Additional Notes

The computation uses an R program downloaded from https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/pca-mpi.R that we don’t reproduce here. See that file and https://github.com/bwlewis/1000_genomes_examples/blob/master/PCA_whole_genome.Rmd for additional notes.

The computation proceeds in two sequential phases, first processing the raw VCF files into chunks of sparse R matrices corresponding to the variant data, and then computing principal components on the R matrices. Parallel computation is used within each phase.

Sparse matrix chunk size is specified by the user with the environment variable CHUNKSIZE to indicate the maximum number of nonzero matrix elements per chunk. If unspecified, CHUNKSIZE is automatically determined based on a heuristic using the host computer’s memory size.

The first processing phase of the computation stores the R sparse matrix chunks corresponding to the input available VCF files for re-use iteratively by the algorithm. In particular, this algorithm process the chunked VCF data out of core — alternative versions of the program pin sparse matrix chunks in memory on each computer and avoid intermediate file system use. That can be obviously more efficient than using a file system. But, importantly, the file system approach scales easily. In particular, this program will run (slowly) on a single laptop even if the total variant sparse matrix size vastly exceeds available RAM size. Thus, this example trades best performance for flexibility. Despite this trade off, performance can be excellent in the example, thanks to the efficient algorithm used and the fact that files are cached in each computer’s buffer cache if memory permits.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Massively-parallel computations on Azure clusters with R, made easy with doAzureParallel

Wed, 03/29/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by JS Tan (Program Manager, Microsoft)

For users of the R language, scaling up their work to take advantage of cloud-based computing has generally been a complex undertaking. We are therefore excited to announce doAzureParallel, a lightweight R package built on Azure Batch that allows you to easily use Azure’s flexible compute resources right from your R session. The doAzureParallel package complements Microsoft R Server and provides the infrastructure you need to run massively parallel simulations on Azure directly from R.

The doAzureParallel package is a parallel backend for the popular foreach package, making it possible to execute multiple processes across a cluster of Azure virtual machines with just a few lines of R code. The package helps you create and manage the cluster in Azure, and register it as a parallel backend to be used with foreach.

With doAzureParallel, there is no need to manually create, configure and manage a cluster of individual VMs. Running your scale jobs is as easy as running algorithms on your local machine. With Azure Batch’s autoscaling capabilities, you can also increase or decrease your cluster size to fit your workloads, saving you time and money. doAzureParallel also uses the Azure Data Science Virtual Machine (DSVM), allowing Azure Batch to easily and quickly configure the appropriate environment in as little time as possible.

doAzureParallel is ideal for running embarrassingly parallel work such as parametric sweeps or Monte Carlo simulations, making it a great fit for many financial modelling algorithms (back-testing, portfolio scenario modelling, etc).

The doAzureParallel is available for download on Github under the open-source MIT license, and there is no additional cost for these capabilities – you only pay for the Azure VMs you use.

Performance Gains with doAzureParallel

To illustrate the kind of speed up you can get with doAzureParallel, we did a test that compares the performance of computing the Mandelbrot set on a single machine of 2 cores, a 5-node clusters of 10 cores, a 10-node cluster of 20 cores, and finally a 20-node cluster of 40 cores.

Time in seconds for computing the Mandelbrot set

From the graph, we can see that using doAzureParallel on Azure can get you a speed up of about 2X with a cluster of 5 nodes, a speed up of about 3X with a cluster of 10 nodes, and a speed up of about 6X with a cluster of 20 nodes. From a cost perspective, running the 20-node cluster for about 50 seconds with 6X performance ended up costing only $0.03 when using a cluster of Standard_F2 Linux VMs. (This was for the WestUS region. Pricing varies by region and can change over time.)

The doAzureParallel package is available now on Github, where you can also find detailed documentation. For more detailed information, including installation steps and demo code, check out the announcement at the Microsoft Azure blog linked below. 

Microsoft Azure blog: doAzureParallel: Take advantage of Azure’s flexible compute directly from your R session

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Coordinatized Data: A Fluid Data Specification

Wed, 03/29/2017 - 17:33

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Authors: John Mount and Nina Zumel. Introduction

It’s been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).


Boris Artzybasheff illustration

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R.

Example

Consider four data scientists who perform the same set of modeling tasks, but happen to record the data differently.

In each case the data scientist was asked to test two decision tree regression models (a and b) on two test-sets (x and y) and record both the model quality on the test sets under two different metrics (AUC and pseudo R-squared). The two models differ in tree depth (in this case model a has depth 5, and model b has depth 3), which is also to be recorded.

Data Scientist 1

Data scientist 1 is an experienced modeler, and records their data as follows:

library("tibble") d1 <- frame_data( ~model, ~depth, ~testset, ~AUC, ~pR2, 'a', 5, 'x', 0.4, 0.2, 'a', 5, 'y', 0.6, 0.3, 'b', 3, 'x', 0.5, 0.25, 'b', 3, 'y', 0.5, 0.25 ) print(d1) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25

Data Scientist 1 uses what is called a denormalized form. In this form each row contains all of the facts we want ready to go. If we were thinking about "column roles" (a concept we touched on briefly in Section A.3.5 "How to Think in SQL" of Practical Data Science with R, Zumel, Mount; Manning 2014), then we would say the columns model and testset are key columns (together they form a composite key that uniquely identifies rows), the depth column is derived (it is a function of model), and AUC and pR2 are payload columns (they contain data).

Denormalized forms are the most ready for tasks that reason across columns, such as training or evaluating machine learning models.

Data Scientist 2

Data Scientist 2 has data warehousing experience and records their data in a normal form:

models2 <- frame_data( ~model, ~depth, 'a', 5, 'b', 3 ) d2 <- frame_data( ~model, ~testset, ~AUC, ~pR2, 'a', 'x', 0.4, 0.2, 'a', 'y', 0.6, 0.3, 'b', 'x', 0.5, 0.25, 'b', 'y', 0.5, 0.25 ) print(models2) ## # A tibble: 2 × 2 ## model depth ## <chr> <dbl> ## 1 a 5 ## 2 b 3 print(d2) ## # A tibble: 4 × 4 ## model testset AUC pR2 ## <chr> <chr> <dbl> <dbl> ## 1 a x 0.4 0.20 ## 2 a y 0.6 0.30 ## 3 b x 0.5 0.25 ## 4 b y 0.5 0.25

The idea is: since depth is a function of the model name, it should not be recorded as a column unless needed. In a normal form such as above, every item of data is written only one place. This means that we cannot have inconsistencies such as accidentally entering two different depths for a given model. In this example all our columns are either key or payload.

Data Scientist 2 is not concerned about any difficulty that might arise by this format as they know they can convert to Data Scientist 1’s format by using a join command:

library("dplyr", warn.conflicts= FALSE) d1_2 <- left_join(d2, models2, by='model') %>% select(model, depth, testset, AUC, pR2) %>% arrange(model, testset) print(d1_2) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25 all.equal(d1, d1_2) ## [1] TRUE

Relational data theory (the science of joins) is the basis of Structured Query Language (SQL) and a topic any data scientist must master.

Data Scientist 3

Data Scientist 3 has a lot of field experience, and prefers an entity/attribute/value notation. They log each measurement as a separate row:

d3 <- frame_data( ~model, ~depth, ~testset, ~measurement, ~value, 'a', 5, 'x', 'AUC', 0.4, 'a', 5, 'x', 'pR2', 0.2, 'a', 5, 'y', 'AUC', 0.6, 'a', 5, 'y', 'pR2', 0.3, 'b', 3, 'x', 'AUC', 0.5, 'b', 3, 'x', 'pR2', 0.25, 'b', 3, 'y', 'AUC', 0.5, 'b', 3, 'y', 'pR2', 0.25 ) print(d3) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 x AUC 0.40 ## 2 a 5 x pR2 0.20 ## 3 a 5 y AUC 0.60 ## 4 a 5 y pR2 0.30 ## 5 b 3 x AUC 0.50 ## 6 b 3 x pR2 0.25 ## 7 b 3 y AUC 0.50 ## 8 b 3 y pR2 0.25

In this form model, testset, and measurement are key columns. depth is still running around as a derived column and the new value column holds the measurements (which could in principle have different types in different rows!).

Data Scientist 3 is not worried about their form causing problems as they know how to convert into Data Scientist 1’s format with an R command:

library("tidyr") d1_3 <- d3 %>% spread('measurement', 'value') %>% select(model, depth, testset, AUC, pR2) %>% # to guarantee column order arrange(model, testset) # to guarantee row order print(d1_3) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25 all.equal(d1, d1_3) ## [1] TRUE

You can read a bit on spread() here.

We will use the term moveValuesToColumns() for this operation later. The spread() will be replaced with the following.

moveValuesToColumns(data = d3, columnToTakeKeysFrom = 'measurement', columnToTakeValuesFrom = 'value', rowKeyColumns = c('model', 'testset'))

The above operation is a bit exotic and it (and its inverse) already go under number of different names:

  • pivot / un-pivot (Microsoft Excel)
  • pivot / anti-pivot (databases)
  • crosstab / un-crosstab (databases)
  • unstack / stack (R)
  • cast / melt (reshape, reshape2)
  • spread / gather (tidyr)
  • "widen" / "narrow" (colloquial)
  • moveValuesToColumns() and moveValuesToRows() (this writeup)

And we are certainly neglecting other namings of the concept. We find none of these particularly evocative (though cheatsheets help), so one purpose of this note will be to teach these concepts in terms of the deliberately verbose ad-hoc terms: moveValuesToColumns() and moveValuesToRows().

Note: often the data re-arrangement operation is only exposed as part of a larger aggregating or tabulating operation. Also moveValuesToColumns() is considered the harder transform direction (as it has to group rows to work), so it is often supplied in packages, whereas analysts often use ad-hoc methods for the simpler moveValuesToRows() operation (to be defined next).

Data Scientist 4

Data Scientist 4 picks a form that makes models unique keys, and records the results as:

d4 <- frame_data( ~model, ~depth, ~x_AUC, ~x_pR2, ~y_AUC, ~y_pR2, 'a', 5, 0.4, 0.2, 0.6, 0.3, 'b', 3, 0.5, 0.25, 0.5, 0.25 ) print(d4) ## # A tibble: 2 × 6 ## model depth x_AUC x_pR2 y_AUC y_pR2 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 a 5 0.4 0.20 0.6 0.30 ## 2 b 3 0.5 0.25 0.5 0.25

This is not a problem as it is possible to convert to Data Scientist 3’s format.

d3_4 <- d4 %>% gather('meas', 'value', x_AUC, y_AUC, x_pR2, y_pR2) %>% separate(meas, c('testset', 'measurement')) %>% select(model, depth, testset, measurement, value) %>% arrange(model, testset, measurement) print(d3_4) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 x AUC 0.40 ## 2 a 5 x pR2 0.20 ## 3 a 5 y AUC 0.60 ## 4 a 5 y pR2 0.30 ## 5 b 3 x AUC 0.50 ## 6 b 3 x pR2 0.25 ## 7 b 3 y AUC 0.50 ## 8 b 3 y pR2 0.25 all.equal(d3, d3_4) ## [1] TRUE

We will replace the gather operation with moveValuesToRows() and the call will look like the following.

moveValuesToRows(data = d4, nameForNewKeyColumn = 'meas', nameForNewValueColumn = 'value', columnsToTakeFrom = c('x_AUC', 'y_AUC', 'x_pR2', 'y_pR2'))

moveValuesToRows() is (under some restrictions) an inverse of moveValuesToColumns().



Although we implement moveValuesToRows() and moveValuesToColums() as thin wrappers of tidyr‘s gather and spread, we find the more verbose naming (and calling interface) more intuitive. So we encourage you to think directly in terms of moveValuesToRows() as moving values to different rows (in the same column), and moveValuesToColums() as moving values to different columns (in the same row). It will usually be apparent from your problem which of these operations you want to use.

The Theory of Coordinatized Data

When you are working with transformations you look for invariants to keep your bearings. All of the above data share an invariant property we call being coordinatized data. In this case the invariant is so strong that one can think of all of the above examples as being equivalent, and the row/column transformations as merely changes of frame of reference.

Let’s define coordinatized data by working with our examples. In all the above examples a value carrying (or payload) cell or entry can be uniquely named as follows:

c(Table=tableName, (KeyColumn=KeyValue)*, ValueColumn=ValueColumnName)

The above notations are the coordinates of the data item (hence "coordinatized data").

For instance: the AUC of 0.6 is in a cell that is named as follows for each of our scientists as:

  • Data Scientist 1: c(Table='d1', model='a', testset='y', ValueColumn='AUC')
  • Data Scientist 2: c(Table='d2', model='a', testset='y', ValueColumn='AUC')
  • Data Scientist 3: c(Table='d3', model='a', testset='y', measurement='AUC', ValueColumn='value')
  • Data Scientist 4: c(Table='d4', model='a', ValueColumn= paste('y', 'AUC', sep= '_'))

From our point of view these keys all name the same data item. The fact that we are interpreting one position as a table name and another as a column name is just convention. We can even write R code that uses these keys on all our scientists’ data without performing any reformatting:

# take a map from names to scalar conditions and return a value. # inefficient method; notional only lookup <- function(key) { table <- get(key[['Table']]) col <- key[['ValueColumn']] conditions <- setdiff(names(key), c('Table', 'ValueColumn')) for(ci in conditions) { table <- table[table[[ci]]==key[[ci]], , drop= FALSE] } table[[col]][[1]] } k1 <- c(Table='d1', model='a', testset='y', ValueColumn='AUC') k2 <- c(Table='d2', model='a', testset='y', ValueColumn='AUC') k3 <- c(Table='d3', model='a', testset='y', measurement='AUC', ValueColumn='value') k4 = c(Table='d4', model='a', ValueColumn= paste('y', 'AUC', sep= '_')) print(lookup(k1)) ## [1] 0.6 print(lookup(k2)) ## [1] 0.6 print(lookup(k3)) ## [1] 0.6 print(lookup(k4)) ## [1] 0.6

The lookup() procedure was able to treat all these keys and key positions uniformly. This illustrates that what is in tables versus what is in rows versus what is in columns is just an implementation detail. Once we understand that all of these data scientists recorded the same data we should not be surprised we can convert between representations.

The thing to remember: coordinatized data is in cells, and every cell has unique coordinates. We are going to use this invariant as our enforced precondition before any data transform, which will guarantee our data meets this invariant as a postcondition. I.e., if we restrict ourselves to coordinatized data and exclude wild data, the operations moveValuesToColumns() and moveValuesToRows() become well-behaved and much easier to comprehend. In particular, they are invertible. (In math terms, the operators moveValuesToColumns() and moveValuesToRows() form a groupoid acting on coordinatized data.)


The 15 puzzle: another groupoid

By "wild" data we mean data where cells don’t have unique lookup() addresses. This often happens in data that has repeated measurements. Wild data is simply tamed by adding additional keying columns (such as an arbitrary experiment repetition number). Hygienic data collection practice nearly always produces coordinatized data, or at least data that is easy to coordinatize. Our position is that your data should always be coordinatized; if it’s not, you shouldn’t be working with it yet.

Rows and Columns

Many students are initially surprised that row/column conversions are considered "easy." Thus, it is worth taking a little time to review moving data between rows and columns.

Moving From Columns to Rows ("Thinifying data")

Moving data from columns to rows (i.e., from Scientist 1 to Scientist 3) is easy to demonstrate and explain.

The only thing hard about this operation is remembering the name of the operation ("gather()") and the arguments. We can remove this inessential difficulty by writing a helper function (to check our preconditions) and a verbose wrapper function (also available as a package from CRAN or Github):

library("wrapr") checkColsFormUniqueKeys <- function(data, keyColNames, allowNAKeys = FALSE) { # check for NA keys if((!allowNAKeys) && (length(keyColNames)>0)) { allGood <- data %>% dplyr::select(dplyr::one_of(keyColNames)) %>% complete.cases() %>% all() if(!allGood) { stop("saw NA in keys") } } # count the number of rows ndata <- nrow(data) if(ndata<=1) { return(TRUE) } # count the number of rows identifiable by keys nunique <- min(1, ndata) if(length(keyColNames)>0) { nunique <- data %>% dplyr::select(dplyr::one_of(keyColNames)) %>% dplyr::distinct() %>% nrow() } # compare return(nunique==ndata) } moveValuesToRows <- function(data, nameForNewKeyColumn, nameForNewValueColumn, columnsToTakeFrom) { cn <- colnames(data) dcols <- setdiff(cn, columnsToTakeFrom) if(!checkColsFormUniqueKeys(dplyr::select(data, dplyr::one_of(dcols)), dcols)) { stop("moveValuesToRows: rows were not uniquely keyed") } # assume gather_ is going to be deprecated, as is happening # with dplyr methods : # https://github.com/hadley/dplyr/blob/da7fc6ecef1c6d329f014feb96c9c99d6cebc880/R/select-vars.R wrapr::let(c(NAMEFORNEWKEYCOLUMM= nameForNewKeyColumn, NAMEFORNEWVALUECOLUMN= nameForNewValueColumn), tidyr::gather(data, key= NAMEFORNEWKEYCOLUMM, value= NAMEFORNEWVALUECOLUMN, dplyr::one_of(columnsToTakeFrom)) ) }

In this notation moving from Data Scientist 1’s records to Data Scientist 3’s looks like the following.

d3from1 <- moveValuesToRows(data=d1, nameForNewKeyColumn= 'measurement', nameForNewValueColumn= 'value', columnsToTakeFrom = c('AUC', 'pR2')) %>% select(model, depth, testset, measurement, value) %>% arrange(model, testset, measurement) print(d3from1) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 x AUC 0.40 ## 2 a 5 x pR2 0.20 ## 3 a 5 y AUC 0.60 ## 4 a 5 y pR2 0.30 ## 5 b 3 x AUC 0.50 ## 6 b 3 x pR2 0.25 ## 7 b 3 y AUC 0.50 ## 8 b 3 y pR2 0.25 all.equal(d3, d3from1) ## [1] TRUE

In a moveValuesToRows() operation each row of the data frame is torn up and used to make many rows. Each of the columns we specify that we want measurements from gives us a new row from each of the original data rows.

The pattern is more obvious if we process any rows of d1 independently:

row <- d1[3,] print(row) ## # A tibble: 1 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 b 3 x 0.5 0.25 moveValuesToRows(data=row, nameForNewKeyColumn= 'measurement', nameForNewValueColumn= 'value', columnsToTakeFrom = c('AUC', 'pR2')) %>% select(model, depth, testset, measurement, value) %>% arrange(model, testset, measurement) ## # A tibble: 2 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 b 3 x AUC 0.50 ## 2 b 3 x pR2 0.25 Moving From Rows to Columns ("Widening data")

Moving data from rows to columns (i.e., from Scientist 3 to Scientist 1) is a bit harder to explain, and usually not explained well.

In moving from rows to columns we group a set of rows that go together (match on keys) and then combine them into one row by adding additional columns.



Note: to move data from rows to columns we must know which set of rows go together. That means some set of columns is working as keys, even though this is not emphasized in the spread() calling interface or explanations. For invertible data transforms, we want a set of columns (rowKeyColumns) that define a composite key that uniquely identifies each row of the result. For this to be true, the rowKeyColumns plus the column we are taking value keys from must uniquely identify each row of the input.

To make things easier to understand and remember, we introduce another wrapping function.

moveValuesToColumns <- function(data, columnToTakeKeysFrom, columnToTakeValuesFrom, rowKeyColumns) { cn <- colnames(data) # we insist that the rowKeyColumns plus # columnToTakeKeysFrom are unique keys over the input rows if(!checkColsFormUniqueKeys(data, c(rowKeyColumns, columnToTakeKeysFrom))) { stop(paste0("\n moveValuesToColumns: specified", "\n rowKeyColumns plus columnToTakeKeysFrom", "\n isn't unique across rows")) } # we are also checking that other columns don't prevent us # from matching values dcols <- setdiff(colnames(data), c(columnToTakeKeysFrom, columnToTakeValuesFrom)) dsub <- data %>% dplyr::select(dplyr::one_of(dcols)) %>% dplyr::distinct() if(!checkColsFormUniqueKeys(dsub, rowKeyColumns)) { stop(paste0("\n some columns not in", "\n c(rowKeyColumns, columnToTakeKeysFrom, columnToTakeValuesFrom)", "\n are splitting up row groups")) } wrapr::let(c(COLUMNTOTAKEKEYSFROM= columnToTakeKeysFrom, COLUMNTOTAKEVALUESFROM= columnToTakeValuesFrom), tidyr::spread(data, key= COLUMNTOTAKEKEYSFROM, value= COLUMNTOTAKEVALUESFROM) ) }

This lets us rework the example of moving from Data Scientist 3’s format to Data Scientist 1’s:

d1from3 <- moveValuesToColumns(data= d3, columnToTakeKeysFrom= 'measurement', columnToTakeValuesFrom= 'value', rowKeyColumns= c('model', 'testset')) %>% select(model, depth, testset, AUC, pR2) %>% arrange(model, testset) print(d1from3) ## # A tibble: 4 × 5 ## model depth testset AUC pR2 ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 a 5 x 0.4 0.20 ## 2 a 5 y 0.6 0.30 ## 3 b 3 x 0.5 0.25 ## 4 b 3 y 0.5 0.25 all.equal(d1, d1from3) ## [1] TRUE

If the structure of our data doesn’t match our expected keying we can have problems. We emphasize that these problems arise from trying to work with non-coordinatized data, and not from the transforms themselves.

Too little keying

If our keys don’t contain enough information to match rows together we can have a problem. Suppose our testset record was damaged or not present and look how a direct call to spread works:

d3damaged <- d3 d3damaged$testset <- 'z' print(d3damaged) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <dbl> <chr> <chr> <dbl> ## 1 a 5 z AUC 0.40 ## 2 a 5 z pR2 0.20 ## 3 a 5 z AUC 0.60 ## 4 a 5 z pR2 0.30 ## 5 b 3 z AUC 0.50 ## 6 b 3 z pR2 0.25 ## 7 b 3 z AUC 0.50 ## 8 b 3 z pR2 0.25 spread(d3damaged, 'measurement', 'value') ## Error: Duplicate identifiers for rows (1, 3), (5, 7), (2, 4), (6, 8)

This happens because the precondition is not met: the columns (model, testset, measurement) don’t uniquely represent each row of the input. Catching the error is good, and we emphasize that in our wrapper.

moveValuesToColumns(data= d3damaged, columnToTakeKeysFrom= 'measurement', columnToTakeValuesFrom= 'value', rowKeyColumns= c('model', 'testset')) ## Error in moveValuesToColumns(data = d3damaged, columnToTakeKeysFrom = "measurement", : ## moveValuesToColumns: specified ## rowKeyColumns plus columnToTakeKeysFrom ## isn't unique across rows

The above issue is often fixed by adding additional columns (such as measurement number or time of measurement).

Too much keying

Columns can also contain too fine a key structure. For example, suppose our data was damaged and depth is no longer a function of the model id, but contains extra detail. In this case a direct call to spread produces a way too large result because the extra detail prevents it from matching rows.

d3damaged <- d3 d3damaged$depth <- seq_len(nrow(d3damaged)) print(d3damaged) ## # A tibble: 8 × 5 ## model depth testset measurement value ## <chr> <int> <chr> <chr> <dbl> ## 1 a 1 x AUC 0.40 ## 2 a 2 x pR2 0.20 ## 3 a 3 y AUC 0.60 ## 4 a 4 y pR2 0.30 ## 5 b 5 x AUC 0.50 ## 6 b 6 x pR2 0.25 ## 7 b 7 y AUC 0.50 ## 8 b 8 y pR2 0.25 spread(d3damaged, 'measurement', 'value') ## # A tibble: 8 × 5 ## model depth testset AUC pR2 ## * <chr> <int> <chr> <dbl> <dbl> ## 1 a 1 x 0.4 NA ## 2 a 2 x NA 0.20 ## 3 a 3 y 0.6 NA ## 4 a 4 y NA 0.30 ## 5 b 5 x 0.5 NA ## 6 b 6 x NA 0.25 ## 7 b 7 y 0.5 NA ## 8 b 8 y NA 0.25

The frame d3damaged does not match the user’s probable intent: that the columns (model, testset) should uniquely specify row groups, or in other words, they should uniquely identify each row of the result.

In the above case we feel it is good to allow the user to declare intent (hence the extra rowKeyColumns argument) and throw an exception if the data is not structured how the user expects (instead of allowing this data to possibly ruin a longer analysis in some unnoticed manner).

moveValuesToColumns(data= d3damaged, columnToTakeKeysFrom= 'measurement', columnToTakeValuesFrom= 'value', rowKeyColumns= c('model', 'testset')) ## Error in moveValuesToColumns(data = d3damaged, columnToTakeKeysFrom = "measurement", : ## some columns not in ## c(rowKeyColumns, columnToTakeKeysFrom, columnToTakeValuesFrom) ## are splitting up row groups

The above issue is usually fixed by one of two solutions (which one is appropriate depends on the situation):

  1. Stricter control (via dplyr::select()) of which columns are in the analysis. In our example, we would select all the columns of d3damaged except depth.
  2. Aggregating or summing out the problematic columns. For example if the problematic column in our example were runtime, which could legitimately vary for the same model and dataset, we could use dplyr::group_by/summarize to create a data frame with columns (model, testset, mean_runtime, measurement, value), so that (model, testset) does uniquely specify row groups.
Conclusion

The concept to remember is: organize your records so data cells have unique consistent abstract coordinates. For coordinatized data the actual arrangement of data into tables, rows, and columns is an implementation detail or optimization that does not significantly change what the data means.

For coordinatized data different layouts of rows and columns are demonstrably equivalent. We document and maintain this equivalence by asking the analyst to describe their presumed keying structure to our methods, which then use this documentation to infer intent and check preconditions on the transforms.

It pays to think fluidly in terms of coordinatized data and delay any format conversions until you actually need them. You will eventually need transforms as most data processing steps have a preferred format. For example, machine learning training usually requires a denormalized form.

We feel the methods moveValuesToRows() and moveValuesToColumns() are easier to learn and remember than abstract terms such as "stack/unstack", "melt/cast", or "gather/spread" and thus are a good way to teach. Perhaps they are even a good way to document (and confirm) your intent in your own projects.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#1: Easy Package Registration

Wed, 03/29/2017 - 13:52

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the first actual post in the R4 series, following the short announcement earlier this week.

Context

Last month, Brian Ripley announced on r-devel that registration of routines would now be tested for by R CMD check in r-devel (which by next month will become R 3.4.0). A NOTE will be issued now, this will presumably turn into a WARNING at some point. Writing R Extensions has an updated introduction) of the topic.

Package registration has long been available, and applies to all native (i.e. "compiled") function via the .C(), .Call(), .Fortran() or .External() interfaces. If you use any of those — and .Call() may be the only truly relevant one here — then is of interest to you.

Brian Ripley and Kurt Hornik also added a new helper function: tools::package_native_routine_registration_skeleton(). It parses the R code of your package and collects all native function entrypoints in order to autogenerate the registration. It is available in R-devel now, will be in R 3.4.0 and makes adding such registration truly trivial.

But as of today, it requires that you have R-devel. Once R 3.4.0 is out, you can call the helper directly.

As for R-devel, there have always been at least two ways to use it: build it locally (which we may cover in another R4 installment), or using Docker. Here will focus on the latter by relying on the Rocker project by Carl and myself.

Use the Rocker drd Image

We assume you can run Docker on your system. How to add it on Windows, macOS or Linux is beyond our scope here today, but also covered extensively elsewhere. So we assume you can execute docker and e.g. bring in the ‘daily r-devel’ image drd from our Rocker project via

~$ docker pull rocker/drd

With that, we can use R-devel to create the registration file very easily in a single call (which is a long command-line we have broken up with one line-break for the display below).

The following is real-life example when I needed to add registration to the RcppTOML package for this week’s update:

~/git/rcpptoml(master)$ docker run --rm -ti -v $(pwd):/mnt rocker/drd \ ## line break RD --slave -e 'tools::package_native_routine_registration_skeleton("/mnt")' #include <R.h> #include <Rinternals.h> #include <stdlib.h> // for NULL #include <R_ext/Rdynload.h> /* FIXME: Check these declarations against the C/Fortran source code. */ /* .Call calls */ extern SEXP RcppTOML_tomlparseImpl(SEXP, SEXP, SEXP); static const R_CallMethodDef CallEntries[] = { {"RcppTOML_tomlparseImpl", (DL_FUNC) &RcppTOML_tomlparseImpl, 3}, {NULL, NULL, 0} }; void R_init_RcppTOML(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } edd@max:~/git/rcpptoml(master)$ Decompose the Command

We can understand the docker command invocation above through its components:

  • docker run is the basic call to a container
  • --rm -ti does subsequent cleanup (--rm) and gives a terminal (-t) that is interactive (-i)
  • -v $(pwd):/mnt uses the -v a:b options to make local directory a available as b in the container; here $(pwd) calls print working directory to get the local directory which is now mapped to /mnt in the container
  • rocker/drd invokes the ‘drd’ container of the Rocker project
  • RD is a shorthand to the R-devel binary inside the container, and the main reason we use this container
  • -e 'tools::package_native_routine_registration_skeleton("/mnt") calls the helper function of R (currently in R-devel only, so we use RD) to compute a possible init.c file based on the current directory — which is /mnt inside the container

That it. We get a call to the R function executed inside the Docker container, examining the package in the working directory and creating a registration file for it which is display to the console.

Copy the Output to src/init.c

We simply copy the output to a file src/init.c; I often fold one opening brace one line up.

Update NAMESPACE

We also change one line in NAMESPACE from (for this package) useDynLib("RcppTOML") to useDynLib("RcppTOML", .registration=TRUE). Adjust accordingly for other package names.

That’s it!

And with that we a have a package which no longer provokes the NOTE as seen by the checks page. Calls to native routines are now safer (less of a chance for name clashing), get called quicker as we skip the symbol search (see the WRE discussion) and best of all this applies to all native routines whether written by hand or written via a generator such as Rcpp Attributes.

So give this a try to get your package up-to-date.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Burtin’s Antibiotics visualization in Plotly and R

Wed, 03/29/2017 - 12:35

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

In this post, we’ll try to re-create Burtin’s antibiotics visualization using Plotly. The post follows Mike Bostock’s original re-creation in Protovis. See here.

library(plotly) library(reshape2) library(dplyr) # Data df <- read.csv("https://cdn.rawgit.com/plotly/datasets/5360f5cd/Antibiotics.csv", stringsAsFactors = F) N <- nrow(df) # Melting for easier use later on df$Seq <- 1:N df <- melt(df, id.vars = c("Bacteria", "Gram", "Seq")) %>% arrange(Seq) # Angle generation theta.start <- (5/2)*pi - pi/10 theta.end <- pi/2 + pi/10 theta.range <- seq(theta.start, theta.end, length.out = N * 4 - 1) dtheta <- diff(theta.range)[1] # Fine adjustment for larger ticks (black) inc <- 0.04 # Angles for larger ticks big_ticks <- theta.range[1:length(theta.range) %% 4 == 0] big_ticks <- c(theta.range[1] - dtheta, big_ticks, theta.range[length(theta.range)] + dtheta) # Angles for smaller ticks small_ticks <- theta.range[1:length(theta.range) %% 4 != 0] # Set inner and outer radii inner.radius <- 0.3 outer.radius <- 1 # Set colors cols <- c("#0a3264","#c84632","#000000") pos_col <- "rgba(174, 174, 184, .8)" neg_col <- "rgba(230, 130, 110, .8)" # Function to calculate radius given minimum inhibitory concentration # Scaling function is sqrt(log(x)) radiusFUNC <- function(x){ min <- sqrt(log(0.001 * 1e4)) max <- sqrt(log(1000 * 1e4)) a <- (outer.radius - inner.radius)/(min - max) b <- inner.radius - a * max rad <- a * sqrt(log(x * 1e4)) + b return(rad) } SEQ <- function(start, by, length){ vec <- c() vec[1] <- start for(i in 2:length){ vec[i] <- vec[i-1] + by } return(vec) } # Generate x and y coordinates for large ticks radial_gridlines <- data.frame(theta = big_ticks, x = (inner.radius - inc) * cos(big_ticks), y = (inner.radius - inc) * sin(big_ticks), xend = (outer.radius + inc) * cos(big_ticks), yend = (outer.radius + inc) * sin(big_ticks)) # Generate x and y coordinates for antibiotics antibiotics <- df antibiotics$x <- inner.radius * cos(small_ticks) antibiotics$y <- inner.radius * sin(small_ticks) antibiotics$xend <- radiusFUNC(df$value) * cos(small_ticks) antibiotics$yend <- radiusFUNC(df$value) * sin(small_ticks) antibiotics$text <- with(antibiotics, paste("<b>Bacteria:</b>", Bacteria, "<br>", "<b>Antibiotic:</b>", variable, "<br>", "<b>Min. conc:</b>", value, "<br>")) # Generate x and y coordinates for white circles (grid) rad <- c(100, 10, 1, 0.1, 0.01, 0.001) rad <- c(inner.radius, radiusFUNC(rad)) theta <- seq(0, 2 * pi, length.out = 100) circles <- lapply(rad, function(x){ x.coord <- x * cos(theta) y.coord <- x * sin(theta) return(data.frame(label = which(rad == x), x = x.coord, y = y.coord)) }) circles <- do.call(rbind, lapply(circles, data.frame)) # Generate gram-negative polygon theta <- seq(big_ticks[1], big_ticks[10], length.out = 100) gram.neg <- data.frame(theta = theta, x = inner.radius * cos(theta), y = inner.radius * sin(theta)) theta <- rev(theta) gram.neg <- rbind(gram.neg, data.frame(theta = theta, x = outer.radius * cos(theta), y = outer.radius * sin(theta))) # Generate gram-positive polygon theta <- seq(big_ticks[10], big_ticks[length(big_ticks)], length.out = 100) gram.pos <- data.frame(theta = theta, x = inner.radius * cos(theta), y = inner.radius * sin(theta)) theta <- rev(theta) gram.pos <- rbind(gram.pos, data.frame(theta = theta, x = outer.radius * cos(theta), y = outer.radius * sin(theta))) # Text annotations - bacteria name bacteria.df <- data.frame(bacteria = unique(antibiotics$Bacteria), theta = big_ticks[-length(big_ticks)] - 0.17, stringsAsFactors = F) bacteria.df$x <- (outer.radius + inc) * cos(bacteria.df$theta) bacteria.df$y <- (outer.radius + inc) * sin(bacteria.df$theta) bacteria.df$textangle <- SEQ(-70, by = 21, length = nrow(bacteria.df)) bacteria.df$textangle[9:nrow(bacteria.df)] <- (bacteria.df$textangle[9:nrow(bacteria.df)] - 90) - 90 bacteria.df <- lapply(1:nrow(bacteria.df), function(x){ list(x = outer.radius*cos(bacteria.df$theta[x]) , y = outer.radius*sin(bacteria.df$theta[x]), xref = "plot", yref = "plot", xanchor = "center", yanchor = "middle", text = bacteria.df$bacteria[x], showarrow = F, font = list(size = 12, family = "arial"), textangle = bacteria.df$textangle[x]) }) # Title bacteria.df[[17]] <- list(x = 0, y = 1, xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", text = "<b>Burtin’s Antibiotics</b>", showarrow = F, font = list(size = 30, family = "serif")) # Text annotations - scale scale.annotate <- data.frame(x = 0, y = c(inner.radius, radiusFUNC(c(100, 10, 1, 0.1, 0.01, 0.001))), scale = c("", "100", "10", "1", "0.1", "0.01","0.001")) # Plot p <- plot_ly(x = ~x, y = ~y, xend = ~xend, yend = ~yend, hoverinfo = "text", height = 900, width = 800) %>% # Gram negative sector add_polygons(data = gram.neg, x = ~x, y = ~y, line = list(color = "transparent"), fillcolor = neg_col, inherit = F, hoverinfo = "none") %>% # Gram positive sector add_polygons(data = gram.pos, x = ~x, y = ~y, line = list(color = "transparent"), fillcolor = pos_col, inherit = F, hoverinfo = "none") %>% # Antibiotics add_segments(data = antibiotics %>% filter(variable == "Penicillin"), text = ~text, line = list(color = cols[1], width = 7)) %>% add_segments(data = antibiotics %>% filter(variable == "Streptomycin"), text = ~text, line = list(color = cols[2], width = 7)) %>% add_segments(data = antibiotics %>% filter(variable == "Neomycin"), text = ~text, line = list(color = cols[3], width = 7)) %>% # Black large ticks add_segments(data = radial_gridlines, line = list(color = "black", width = 2), hoverinfo = "none") %>% # White circles add_polygons(data = circles, x = ~x, y = ~y, group_by = ~label, line = list(color = "#eeeeee", width = 1), fillcolor = "transparent", inherit = F, hoverinfo = "none") %>% # Scale labels add_text(data = scale.annotate, x = ~x, y = ~y, text = ~scale, inherit = F, textfont = list(size = 10, color = "black"), hoverinfo = "none") %>% # Gram labels add_markers(x = c(-0.05, -0.05), y = c(-1.4, -1.5), marker = list(color = c(neg_col, pos_col), size = 15), inherit = F, hoverinfo = "none") %>% add_text(x = c(0.15, 0.15), y = c(-1.4, -1.5), text = c("Gram Negative", "Gram Positive"), textfont = list(size = 10, color = "black"), inherit = F, hoverinfo = "none") %>% # Antibiotic legend add_markers(x = c(-0.15, -0.15, -0.15), y = c(0.1, 0, -0.1), marker = list(color = c(cols), size = 15, symbol = "square"), inherit = F, hoverinfo = "none") %>% add_text(x = c(0.05, 0.05, 0.05), y = c(0.1, 0, -0.1), text = c("Penicillin", "Streptomycin", "Neomycin"), textfont = list(size = 10, color = "black"), inherit = F, hoverinfo = "none") %>% layout(showlegend = F, xaxis = list(showgrid = F, zeroline = F, showticklabels = F, title = "", domain = c(0.05, 0.95)), yaxis = list(showgrid = F, zeroline = F, showticklabels = F, title = "", domain = c(0.05, 0.95)), plot_bgcolor = "rgb(240, 225, 210)", paper_bgcolor = "rgb(240, 225, 210)", # Annotations annotations = bacteria.df) p

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Make ggplot2 purrr

Wed, 03/29/2017 - 06:45

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

I’ll be honest: the title is a bit misleading. I will not use purrr that much in this blog post. Actually, I will use one single purrr function, at the very end. I use dplyr much more. However Make ggplot2 purrr sounds better than Make ggplot dplyr or whatever the verb for dplyr would be.

Also, this blog post was inspired by a stackoverflow question and in particular one of the answers. So I don’t bring anything new to the table, but I found this stackoverflow answer so useful and so underrated (only 16 upvotes as I’m writing this!) that I wanted to write something about it.

Basically the idea of this blog post is to show how to create graphs using ggplot2, but by grouping by a factor variable beforehand. To illustrate this idea, let’s use the data from the Penn World Tables 9.0. The easiest way to get this data is to install the package called pwt9 with:

install.packages("pwt9")

and then load the data with:

data("pwt9.0")

Now, let’s load the needed packages. I am also using ggthemes which makes themeing your ggplots very easy. I’ll be making Tufte-style plots.

library(ggplot2) library(ggthemes) library(dplyr) library(purrr) library(pwt9)

First let’s select a list of countries:

country_list <- c("France", "Germany", "United States of America", "Luxembourg", "Switzerland", "Greece") small_pwt <- pwt9.0 %>% filter(country %in% country_list)

Let’s us also order the countries in the data frame as I have written them in country_list:

small_pwt <- small_pwt %>% mutate(country = factor(country, levels = country_list, ordered = TRUE))

You might be wondering why this is important. At the end of the article, we are going to save the plots to disk. If we do not re-order the countries inside the data frame as in country_list, the name of the files will not correspond to the correct plots!

Now when you want to plot the same variable by countries, say avh (Average annual hours worked by persons engaged), the usual way to do this is with one of facet_wrap() or facet_grid():

ggplot(data = small_pwt) + theme_tufte() + geom_line(aes(y = avh, x = year)) + facet_wrap(~country)

ggplot(data = small_pwt) + theme_tufte() + geom_line(aes(y = avh, x = year)) + facet_grid(country~.)

As you can see, for this particular example, facet_grid() is not very useful, but do notice its argument, country~., which is different from facet_wrap()’s argument. This way, I get the graphs stacked horizontally. If I had used facet_grid(~country) the graphs would be side by side and completely unreadable.

Now, let’s go to the meat of this post: what if you would like to have one single graph for each country? You’d probably think of using dplyr::group_by() to form the groups and then the graphs. This is the way to go, but you also have to use dplyr::do(). This is because as far as I understand, ggplot2 is not dplyr-aware, and using an arbitrary function with groups is only possible with dplyr::do().

plots <- small_pwt %>% group_by(country) %>% do(plot = ggplot(data = .) + theme_tufte() + geom_line(aes(y = avh, x = year)) + ggtitle(unique(.$country)) + ylab("Year") + xlab("Average annual hours worked by persons engaged"))

If you know dplyr at least a little bit, the above lines should be easy for you to understand. But notice how we get the title of the graphs, with ggtitle(unique(.$country)), which was actually the point of the stackoverflow question. What might be surprising though, is the object that is created by this code. Let’s take a look at plots:

print(plots) ## Source: local data frame [6 x 2] ## Groups: <by row> ## ## # A tibble: 6 × 2 ## country plot ## * <ord> <list> ## 1 France <S3: gg> ## 2 Germany <S3: gg> ## 3 United States of America <S3: gg> ## 4 Luxembourg <S3: gg> ## 5 Switzerland <S3: gg> ## 6 Greece <S3: gg>

As dplyr::do()’s documentation tells us, the return values get stored inside a list. And this is exactly what we get back; a list of plots! Lists are a very flexible and useful class, and you cannot spell list without purrr (at least not when you’re a neRd).

Here are the final lines that use purrr::map2() to save all these plots at once inside your working directory:

file_names <- paste0(country_list, ".pdf") map2(file_names, plots$plot, ggsave)

As I said before, if you do not re-order the countries inside the data frame, the names of the files and the plots will not match. Try running all the code without re-ordering, you’ll see!

I hope you found this post useful. You can follow me on twitter for blog updates.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages