Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 days 3 hours ago

Why I rarely use apply

Sat, 07/14/2018 - 02:00

(This article was first published on Florian Privé, and kindly contributed to R-bloggers)

In this short post, I talk about why I’m moving away from using function apply.

With matrices

It’s okay to use apply with a dense matrix, although you can often use an equivalent that is faster.

N <- M <- 8000 X <- matrix(rnorm(N * M), N) system.time(res1 <- apply(X, 2, mean)) ## user system elapsed ## 0.73 0.05 0.78 system.time(res2 <- colMeans(X)) ## user system elapsed ## 0.05 0.00 0.05 stopifnot(isTRUE(all.equal(res2, res1)))

“Yeah, there are colSums and colMeans, but what about computing standard deviations?”

There are lots of apply-like functions in package {matrixStats}.

system.time(res3 <- apply(X, 2, sd)) ## user system elapsed ## 0.96 0.01 0.97 system.time(res4 <- matrixStats::colSds(X)) ## user system elapsed ## 0.2 0.0 0.2 stopifnot(isTRUE(all.equal(res4, res3))) With data frames head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa apply(head(iris), 2, identity) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 "5.1" "3.5" "1.4" "0.2" "setosa" ## 2 "4.9" "3.0" "1.4" "0.2" "setosa" ## 3 "4.7" "3.2" "1.3" "0.2" "setosa" ## 4 "4.6" "3.1" "1.5" "0.2" "setosa" ## 5 "5.0" "3.6" "1.4" "0.2" "setosa" ## 6 "5.4" "3.9" "1.7" "0.4" "setosa"

A DATA FRAME IS NOT A MATRIX (it’s a list).

The first thing that apply does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).

What can you use as a replacement of apply with a data frame?

  • If you want to operate on all columns, since a data frame is just a list, you can use sapply instead (or map* if you are a purrrist).

    sapply(iris, typeof) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "double" "double" "double" "double" "integer"
  • If you want to operate on all rows, I recommend you to watch this webinar.

With sparse matrices

The memory problem is even more important when using apply with sparse matrices, which makes using apply very slow for such data.

library(Matrix) X.sp <- rsparsematrix(N, M, density = 0.01) ## X.sp is converted to a dense matrix when using `apply` system.time(res5 <- apply(X.sp, 2, mean)) ## user system elapsed ## 0.78 0.46 1.25 system.time(res6 <- Matrix::colMeans(X.sp)) ## user system elapsed ## 0.01 0.00 0.02 stopifnot(isTRUE(all.equal(res6, res5)))

You could implement your own apply-like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i and j storing positions of non-null elements, and x storing values of these elements). Then, you could use a group_by–summarize approach.

For instance, for the previous example, you can do this in base R:

apply2_sp <- function(X, FUN) { res <- numeric(ncol(X)) X2 <- as(X, "dgTMatrix") tmp <- tapply(X2@x, X2@j, FUN) res[as.integer(names(tmp)) + 1] <- tmp res } system.time(res7 <- apply2_sp(X.sp, sum) / nrow(X.sp)) ## user system elapsed ## 0.03 0.00 0.03 stopifnot(isTRUE(all.equal(res7, res5))) Conclusion

Using apply with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Florian Privé. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Seasonal decomposition of short time series

Sat, 07/14/2018 - 02:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

Many users have tried to do a seasonal decomposition with a short time series, and hit the error “Series has less than two periods”.
The problem is that the usual methods of decomposition (e.g., decompose and stl) estimate seasonality using at least as many degrees of freedom as there are seasonal periods. So you need at least two observations per seasonal period to be able to distinguish seasonality from noise.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

From Data to Viz | Find the graphic you need

Fri, 07/13/2018 - 15:14

(This article was first published on Blog – The R Graph Gallery, and kindly contributed to R-bloggers)

I’m delighted to announce a new dataviz project called ‘Data to Viz’.

—> data-to-viz.com

 

What it is

From Data to Viz is a classification of chart types based on input data format. It comes in the form of a decision tree leading to a set of potentially appropriate visualizations to represent the dataset.

 

The decision trees are available in a poster that has been presented at UseR in Brisbane.

Philosophie

The project is built on two underlying philosophies. First, that most data analysis can be summarized in about twenty different dataset formats. Second, that both data and context determine the appropriate chart.

Thus, our suggested method consists in identifying and trying all feasible chart types to find out which suits your data and idea best.

Once this set of graphic identified, data-to-viz.com aims to guide you toward the best decision.

 

Content

Several sections are available on top of the decision tree:
portfolio – an overview of all chart possibilities. For each, an extensive description is given, showing variations, pros and cons, common pitfalls and more.
stories – for each input data format, a real life example is analyzed to illustrate the different chart types applicable to it.
caveat gallery – a list of common dataviz pitfalls, with suggested workarounds.

 

Link with R

From Data to Viz aims to give general advices for data visualization in general and is not targeting R users especialy.

However, 100% of the charts are made using R, mostly using ggplot2 and the tidyverse. The reproducible code snippets are always available. The biggest part of the website is built using R Markdown, using a good amount of hacks described here.

The website is tightly linked with the R graph gallery. Once you’ve identified the graphic that suits your needs, you will be redirected to the appropriate section of the gallery to get the R code in minutes.

 

 

Next step

From Data to Viz is still in beta version, and a lot remains to be done. The caveat gallery is incomplete and some chart types are missing. Also, a few inaccuracies may be present in the decision tree. Last but not least the English is horrible but this is not likely to change unfortunately, I apologize for that.

If you find any mistake or potential improvement, please fill an issue on GitHub, contact me at Yan.holtz.data@gmail.com or on twitter, or leave a comment below. Any feedback will be very welcome.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog – The R Graph Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stencila – an office suite for reproducible research

Fri, 07/13/2018 - 10:59

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Stencila launches the first version of its (open source) word processor and spreadsheet editor designed for researchers.

By Michael Aufreiter, Substance, and Aleksandra Pawlik and Nokome Bentley, Stencila

Stencila is an open source office suite designed for researchers. It allows the authoring of interactive, data-driven publications in visual interfaces, similar to those in conventional office suites, but is built from the ground up for reproducibility.

Stencila aims to make it easier for researchers with differing levels of computational skills to collaborate on the same research article. Researchers used to tools like Microsoft Word and Excel will find Stencila’s interfaces intuitive and familiar. And those who use tools such as Jupyter Notebook or R Markdown are still able to embed code for data analysis within their research articles. Once published, Stencila documents are self-contained, interactive and reusable, containing all the text, media, code and data needed to fully support the narrative of research discovery.


Source: https://stenci.la

The Stencila project aims to be part of the wider vision to enable the next generation of research article – all the way from authoring through to publication as a reproducible, self-contained webpage. A key limitation of the current research publishing process is that conventional document formats (e.g. Word, PDF and LaTeX) do not support the inclusion of reproducible research elements, nor do they produce content in the structured format used for science publishing and dissemination (XML). Stencila aims to remove the need for manual conversion of content from source documents to XML and web (HTML) publishing formats, whilst enabling the inclusion of source data and computational methods within the manuscript. We hope that establishing a digital-first, reproducible archive format for publications will facilitate research communication that is faster and more open, and which lowers the barrier for collaboration and reuse. The development of Stencila is driven by community needs and in coordination with the goals of the Reproducible Document Stack, an initiative started by eLife, Substance and Stencila.

A word processor for creating journal-ready scientific manuscripts

Stencila’s article editor builds on Texture, an open source editor built for visually editing JATS XML documents (a standard widely used by scientific journals). Supporting all elements of a standardised research article, the editor features semantic content-oriented editing that allows the user to focus on the research without worrying about layout information, which is normally stripped during the publishing process. While Texture implements all static elements (abstract, figures, references, citations and so on), Stencila extends Texture with code cells which enable computed, data-driven figures.

Spreadsheets for source data and analysis

In Stencila, datasets are an integral part of the publication. They live as individual spreadsheet documents holding structured data. This data can then be referenced from the research article to drive analysis and plots. As within Excel, cells can contain formulas and function calls to run computations directly in a spreadsheet. But not only can users enter simple expressions, they can also add and execute code in a variety of supported programming languages (at the moment R, Python, SQL and Javascript).

A walk-through of some of the features of Stencila, using this Stencila Article. Source: YouTube; video CC-BY Stencila.

Code evaluation in the browser and beyond

Stencila’s user interfaces build on modern web technology and run entirely in the browser – making them available on all major operating systems. The predefined functions available in Stencila use Javascript for execution so they can be run directly in the editor. For example, the plotly() function generates powerful, interactive visualizations solely using Plotly’s Javascript library.

Stencila can also connect to R, Python and SQL sessions, allowing more advanced data analysis and visualization capabilities. Stencila’s execution engine keeps a track of the dependency between code cells, enabling a reactive, spreadsheet-like programming experience both in Stencila Articles and Sheets.

An example of using R within a Stencila Sheet. Source: YouTube; video CC-BY Stencila.

Reproducible Document Archive (Dar)

Stencila stores projects in an open file archive format called Dar. A Dar is essentially a folder with a number of files encompassing the manuscript itself (usually one XML per document) and all associated media.

The Dar format is open source: inspect it and provide feedback at https://github.com/substance/dar

Dar uses existing standards when possible. For instance, articles are represented as JATS XML, the standard preferred by a number of major publishers. The Dar format is a separate effort from Stencila, and aims to establish a strict standard for representing self-contained reproducible publications, which can be submitted directly to publishers. Any other tool should be able to easily read and write such archives, either by supporting it directly or by implementing converters.

Interoperability with existing tools and workflows

Stencila is developed not to replace existing tools, but to complement them. Interoperability is at the heart of the project, with the goal of supporting seamless collaboration between users of Jupyter Notebooks, R Markdown and spreadsheet applications. We are working closely with the communities of existing open source tools to improve interoperability. For instance, we are working with the Jupyter team on tools to turn notebooks into journal submissions. We are also evaluating whether the Stencila editor could be used as another interface to edit Jupyter Notebooks or R Markdown files: we hope this could help researchers who use existing tools to collaborate with peers who are used to other office tools, such as Word and Excel, and thus encourage wider adoption of reproducible computational research practises.

State of development

Over the past two years, we’ve built Stencila from the ground up as a set of modular components that support community-driven open standards for publishing and computation. Stencila Desktop is our prototype of a ‘researcher’s office suite’, built by combining these components into an integrated application. During this beta phase of the project, we are working to address bugs and add missing features, and welcome your feedback and suggestions (see below).

One of our next priorities will be to develop a toolset for generating a web page from a reproducible article in the Dar format. Using progressive enhancement, the reader should be able to reproduce a scientific article right from the journal’s website in various forms, ranging from a traditional static representation of the manuscript and its figures to a fully interactive, executable publication.

We will continue working on Stencila’s various software components, such as the converter module and execution contexts for R and Python, towards improved integration and interoperability with other tools in the open science toolbox (e.g. Jupyter, RStudio and Binder).

Get involved

We’d love to get your input to help shape Stencila. Download Stencila Desktop and take it for a test drive. You could also try porting an existing manuscript over to Stencila using the Stencila command line tool. Give us your feedback and contribute ideas on our community forum or in our chat channel, or drop us an email at aleksandra@stenci.la or nokome@stenci.la.

Acknowledgments

Development of Stencila has been generously supported by the Alfred P. Sloan Foundation and eLife.

This post was originally published on eLife Labs.

eLife welcomes comments, questions and feedback. Please annotate publicly on the article or contact us at innovation [at] elifesciences [dot] org.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Is GINA really about to die?!?

Fri, 07/13/2018 - 10:00

(This article was first published on R – Nathan Chaney, and kindly contributed to R-bloggers)

Introduction

During a recent negotiation of an informed consent form for use in a clinical trial, the opposing lawyer and I skirmished over the applicability of the Genetic Information Nondiscrimination Act of 2008, commonly known as GINA. Specifically, the opposing lawyer thought that guidance issued by the U.S. Office for Human Research Protections in 2009 was now outdated, in part because enforcement efforts were erratic. The argument was primarily driven by policy, rather than data.

Being a data-driven guy, I wanted to see whether the data supported the argument advanced by the other lawyer. Fortunately, the U.S. Equal Employment Opportunity Commission (EEOC), which is responsible for administering GINA complaints, maintains statistics regarding GINA claims and resolutions. I’m not great at making sense of numbers in a table, so I thought this presented the perfect opportunity to rvest some data!

libraries <- c("tidyverse", "rvest", "magrittr") lapply(libraries, require, character.only = TRUE) Data Scraping

In standard rvest fashion, we’ll read a url, extract the table containing the GINA enforcement statistics, and then do some data cleaning. Once we read the table and gather all of the annual results into key/pair of year/value, we get the following results:

url <- "https://www.eeoc.gov/eeoc/statistics/enforcement/genetic.cfm" GINA.resolutions <- read_html(url) %>% html_nodes("table") %>% extract2(1) %>% html_table(trim = TRUE, fill = TRUE, header = TRUE) names(GINA.resolutions)[1] <- "metric" # Top left table cell is blank, will throw errors names(GINA.resolutions) <- gsub("FY (.+)", "\\1", names(GINA.resolutions)) # Remove FY from year so we can convert to numeric GINA.resolutions <- GINA.resolutions %>% filter(! metric == "") %>% # Remove percentage rows filter(! metric == "Resolutions By Type") %>% # Remove blank line gather(year, value, 2:9) %>% # short and wide data to tall and skinny mutate( year = as.integer(year), value = gsub("[\\$\\%]", "", value) ) %>% mutate( value = as.numeric(value) ) %>% as.tibble() GINA.resolutions ## # A tibble: 88 x 3 ## metric year value ## ## 1 Receipts 2010 201 ## 2 Resolutions 2010 56 ## 3 Settlements 2010 3 ## 4 Withdrawals w/Benefits 2010 2 ## 5 Administrative Closures 2010 11 ## 6 No Reasonable Cause 2010 38 ## 7 Reasonable Cause 2010 2 ## 8 Successful Conciliations 2010 1 ## 9 Unsuccessful Conciliations 2010 1 ## 10 Merit Resolutions 2010 7 ## # ... with 78 more rows Claim Numbers over Time

Now that we have the data in a format we can use, we’ll look at the volume of claims and resolutions over time:

GINA.resolutions %>% filter(metric == "Receipts" | metric == "Resolutions") %>% ggplot(aes(year, value, color = metric)) + geom_line() + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Claims and Resolutions, FY 2010 - FY 2017", caption = paste0("Source: ", url), x = "", y = "" ) + scale_color_brewer("", palette = "Paired")

The number of GINA claims rose for the first few years, but then declined down to enactment-year levels. This could represent support for the opposing lawyer’s argument that enforcement is waning. However, it could just as likely be showing that the deterrent effect of the law has proven effective, and most firms subject to GINA are now aware of the law and have taken appropriate steps toward compliance. Given these competing interpretations, we’ll need to look at little deeper to see if we can derive trends from the data.

GINA Claim Resolutions

One of the arguments made by the opposing lawyer is that the Obama administration was pushing GINA enforcement, and that the Trump administration hates the law and won’t enforce it. We can look at the resolution types to test this hypothesis:

GINA.resolutions %>% filter(metric != "Receipts" & metric != "Resolutions") %>% ggplot(aes(year, value)) + geom_line() + facet_wrap(~ metric, scales = "free_y") + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Resolutions by Type, FY 2010 - FY 2017", caption = paste0("Source: ", url), x = "", y = "" )

In 2017, the first year of the Trump administration, administrative closures were down and resolutions on the merits were up, which contradicts the opposing lawyer’s argument. While findings of no reasonable cause were up about 10-15%, findings of reasonable cause were up 50%; if anything, this also contradicts the opposing lawyer’s argument. Monetary settlements appear to be relatively flat from 2013 – 2017, and in any event a million dollars isn’t a ton of money in light of the EEOC’s annual budget of about $376 million (note that the EEOC handles many other types of charges besides GINA).

The resolution type that jumped most markedly in 2017 was “unsuccessful conciliation.” A conciliation is where the EEOC “attempt[s] to achieve a just resolution of all violations found and to obtain agreement that the respondent will eliminate the unlawful employment practice and provide appropriate affirmative relief.” 29 C.F.R. § 1601.24. It’s unclear why this jump occurred from the summary statistics provided by the EEOC.

Finally, I thought it was useful to plot all the resolution types together to show relative numbers:

GINA.resolutions %>% filter(metric != "Receipts" & metric != "Resolutions" & metric != "Monetary Benefits (Millions)*") %>% ggplot(aes(year, value, color = metric)) + geom_line() + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Resolutions by Type, FY 2010 - FY 2017", caption = paste0("Source: ", url), x = "", y = "" ) + # scale_y_sqrt() + scale_color_brewer("Resolution Type", palette="Paired")

From this perspective, it does look like the long-term trend has been for the EEOC to dismiss a majority of GINA-related charges as unfounded (i.e., no reasonable cause). However, for the cases that do have merit, it appears the EEOC has reversed an initial trend showing a preference toward administrative closure.

Conclusion

In all, I didn’t find the opposing lawyer’s argument particularly compelling in light of the data from President Trump’s first year in office. However, the first month of 2017 was President Obama’s last in office, and there was a flurry of activity by many regulatory agencies. It wouldn’t surprise me if EEOC also participated in a high volume of lame-duck activity, and a lot of activity in January 2017 could haved skewed the annual results. Monthly statistics would be nice but didn’t appear to be readily available. The goal with any R project is for it to be repeatable with additional data, so it will be interesting to see what the data from FY2018 shows.

This wasn’t a particularly complicated coding project – in fact, this writeup took me longer to produce than writing the actual code and coming to conclusions about whether GINA is on its last leg or not. Despite that fact, I thought it was a good example of how data science can be used to inform solutions to simple as well as complex problems.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing the Kernelheaping Package II

Fri, 07/13/2018 - 10:00

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

In the first part of Introducing the Kernelheaping Package I showed how to compute and plot kernel density estimates on rounded or interval censored data using the Kernelheaping package. Now, let’s make a big leap forward to the 2-dimensional case. Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard area-level mapping methods such as chloropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these area-level data to a smooth point estimate. For the German capital city of Berlin, for example, there exists an open data initiative, where data on e.g. demographics is publicly available. We first load a dataset on the Berlin population, which can be downloaded from: https://www.statistik-berlin-brandenburg.de/opendata/EWR201512E_Matrix.csv ```r library(dplyr) library(fields) library(ggplot2) library(Kernelheaping) library(maptools) library(RColorBrewer) library(rgdal) gpclibPermit() ``` ```r data <- read.csv2("EWR201512E_Matrix.csv") ``` This dataset contains the numbers of inhabitants in certain age groups for each administrative districts. Afterwards, we load a shapefile with these administrative districts, available from: https://www.statistik-berlin-brandenburg.de/opendata/RBS_OD_LOR_2015_12.zip ```r berlin <- readOGR("RBS_OD_LOR_2015_12/RBS_OD_LOR_2015_12.shp") ``` ```r berlin <- spTransform(berlin, CRS("+proj=longlat +datum=WGS84")) ``` Now, we form our input data set, which contains the area/polygon centers and the variable of interest, whose density should be calculated. In this case we' like to calculate the spatial density of people between 65 and 80 years of age: ```r dataIn <- lapply(berlin@polygons, function(x) x@labpt) %>% do.call(rbind, .) %>% cbind(data$E_E65U80) ``` In the next step we calculate the bivariate kernel density with the “dshapebivr” function (this may take some minutes) using the prepared data and the shape file: ```r est <- dshapebivr(data = dataIn, burnin = 5, samples = 10, adaptive = FALSE, shapefile = berlin, gridsize = 325, boundary = TRUE) ``` To plot the map with "ggplot2", we need to perform some additional data preparation steps: ```r berlin@data$id <- as.character(berlin@data$PLR) berlin@data$E_E65U80 <- data$E_E65U80 berlinPoints <- fortify(berlin, region = "id") berlinDf <- left_join(berlinPoints, berlin@data, by = "id") kData <- data.frame(expand.grid(long = est$Mestimates$eval.points[[1]], lat = est$Mestimates$eval.points[[2]]), Density = as.vector(est$Mestimates$estimate)) %>% filter(Density > 0) ``` Now, we are able to plot the density together with the administrative districts: ```r ggplot(kData) + geom_raster(aes(long, lat, fill = Density)) + ggtitle("Bivariate density of Inhabitants between 65 and 80 years") + scale_fill_gradientn(colours = c("#FFFFFF", "#5c87c2", "#19224e")) + geom_path(color = "#000000", data = berlinDf, aes(long, lat, group = group)) + coord_quickmap() ```

This map gives a much better overall impression of the distribution of older people than a simple choropleth map: ```r ggplot(berlinDf) + geom_polygon(aes(x = long, y = lat, fill = E_E65U80, group = id)) + ggtitle("Number of Inhabitants between 65 and 80 years by district") + scale_fill_gradientn(colours = c("#FFFFFF", "#5c87c2", "#19224e"), "n") + geom_path(color = "#000000", data = berlinDf, aes(long, lat, group = group)) + coord_quickmap() ```

Often, as the case with Berlin we may have large uninhabited areas such as woods or lakes. Furthermore, we may like to compute the proportion of older people compared to the overall population in a spatial setting. The third part of this series shows how you can compute boundary corrected and smooth proportion estimates with the Kernelheaping package.

Further parts of the article series Introducing the Kernelheaping Package:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using Machine Learning for Causal Inference

Fri, 07/13/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many "off-the-shelf" ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization.

"Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from hand-written scrawls…[But] most statistical surveys have the identification of causal factors as their ultimate goal." – Bradley Efron

Introduction

However, the literature of combining ML and casual inferencing is growing by the day. One common problem of causal inference is the estimation of heterogeneous treatment effects. So, we will take a look at three interesting and different approaches for it and focus on a very recent paper by Athey et al. which is forthcoming in "The Annals of Statistics"1.

Model-based Recursive Partitioning

One of the earlier papers about causal trees is by Zeileis et al., 20082. They describe an algorithm for Model-based Recursive Partitioning (MOB), which looks at recursive partitioning for more complex models. They fit at first a parametric model to the data set, while using Maximum-Likelihood, then test for parameter instability for a set of predefined variables and lastly split the model with the variable regarding the highest parameter instability. Those steps are repeated in each of the daughter nodes till a stopping criterion is reached. However, they do not provide statistical properties for the mob and the partitions are still quite large.

Bayesian Additive Regression Tree

Another paper uses Bayesian Additive Regression Tree (BART) for the estimation of heterogeneous treatment effects3. Hereby, one advantage of this approach is, that BART can detect and handle interactions and non-linearity in the response surface. It uses a Sum-of-Tree Model. First, a weak-learning tree is grown, whereby the residuals are calculated and the next tree is fitted according to these residuals. Similar to Boosting Algorithms, BART wants do avoid overfitting. This is achieved by using a regularization prior, which restricts overfitting and the contribution of each tree to the final result.

Generalized Random Forest

However, this and the next blog post will be mainly focused on the Generalized Random Forest (GRF) by Athey et al., who have already been exploring the possibilities of ML in economics before. It is a method for non-parametric statistical estimation, which uses the basic ideas of the Random Forest. Therefore, it keeps the recursive partitioning, subsampling and random split selection. Nevertheless, the final outcome is not estimated via simple averaging over the trees. The Forest is used to estimate an adaptive weighting function. So, we grow a set of trees and each observation gets weighted equalling how often it falls into the same leaf as the target observation. Those weights are used to solve a "local GMM" model.

Another important piece of the GRF is the split selection algorithm, which emphasizes maximizing heterogeneity. With this framework, a wide variety of applications is possible like quantile regressions but also the estimation of heterogeneous treatment effects. Therefore, the split selection must be suitable for a lot of different purposes. As in Breiman's Random Forest, splits are selected greedily. However, in the case of general moment estimation, we don't have a direct loss criterion to minimize. So instead we want to maximize a criterion ∆ , which favors splits that are increasing the heterogeneity of our in-sample estimation. Maximizing ∆ directly on the other side would be computationally costly, therefore Athey et al. are using a gradient-based approximation for it. This results in a computational performance, similar to standard CART- approaches.

Comparing the regression forest of GRF to standard random forest

Athey et al. are claiming in their paper that in the special case of a regression forest, the GRF gets the same results as the standard random forest by Breiman (2001). So, one already implemented estimation method in the grf-package4 is a regression forest. Therefore, I will compare those results, with the random forest implementations of the randomForest-package as well as the implementation of the ranger-packages. For tuning porpuses, I will use a random search with 50 iterations for the randomForest and ranger-package and for the grf the implemented tune_regression_forest()-function. The Algorithms will be benchmarked on 3 data sets, which have been already used in another blog post, while using the RMSE to compare the results. For easy handling, I implemented the regression_forest() into the caret framework, which can be found on my GitHub.

Data Set Metric grf ranger randomForest air RMSE 0.25 0.24 0.24 bike RMSE 2.90 2.41 2.67 gas RMSE 36.0 32.6 34.4

The GRF performs a little bit worse in comparison with the other implementations. However, this could be also due to the tuning of the parameters, because there are more parameters to tune. According to their GitHub, they are planning on improving the tune_regression_forest()-Function.
One advantage of the GRF is, that it produces unbiased confidence intervals for each estimation point. In order to do so, they are performing honest tree splitting, which was first described in their paper about causal trees5. With honest stree splitting, one sample is used to make the splits and another distinct sample is used to estimate the coefficients.

However, standard regression is not the exciting part of the Generalized Random Forest. Therefore, I will take a look at how the GRF performs in estimating heterogeneous treatment effects with simulated data and compare it to the estimation results of the MOB and the BART in my next blog post.

References
  1. Athey, Tibshirani, Wager. Forthcoming."Generalized Random Forests"
  2. Zeileis, Hothorn, Hornik. 2008."Model-based Recursive Partitioning"
  3. Hill. 2011."Bayesian Nonparametric Modeling for Causal Inference"
  4. https://github.com/swager/grf
  5. Athey and Imbens. 2016."Recursive partitioning for heterogeneous causal effects."
Über den Autor

Markus Berroth

Markus ist Mitglied in unserem Data Science Team und Python Experte. In seiner Freizeit geht er gerne übers Wochenende auf Städtetrips.

Der Beitrag Using Machine Learning for Causal Inference erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcing the R Markdown Book

Fri, 07/13/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

It is exciting for us to see the amazing growth of the R Markdown ecosystem over the four years since the creation of R Markdown in 2014. Now you can author many types of documents, and build a wide range of applications based on R Markdown. As an effort to unite and improve the documentation of the R Markdown base package (rmarkdown) and several other extensions (such as bookdown, blogdown, pkgdown, flexdashboard, tufte, xaringan, rticles, and learnr) in one place, we authored a book titled “R Markdown: The Definitive Guide”, which is to be published by Chapman & Hall/CRC in about two weeks.

You can pre-order a copy now if you like. Our publisher is generous enough to allow us to provide a complete online version of this book at https://bookdown.org/yihui/rmarkdown/, which you can always read for free. The full source of this book is also freely and publicly available in the Github repo https://github.com/rstudio/rmarkdown-book.

Please feel free to let us know if you have any questions or suggestions regarding this book. You are always welcome to send Github pull requests to help us improve the book.1 We hope you will find this book useful.

  1. When you are reading the online version of the book, you may click the Edit button on the toolbar to edit the source file of a page, and follow the guide on Github to create a pull request.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Aggregate Data in R

Fri, 07/13/2018 - 01:30

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases. These two stages are wrapped into a single function.

To perform aggregation, we need to specify three things in the code:

  • The data that we want to aggregate
  • The variable to group by within the data
  • The calculation to apply to the groups (what you want to find out)
Example data

The raw data shown below consists of one row per case. Each case is an employee at a restaurant.

Load the example data by running the following R code:

library(flipAPI) data = DownloadXLSX("https://wiki.q-researchsoftware.com/images/1/1b/Aggregation_data.xlsx", want.row.names = FALSE, want.data.frame = TRUE)

Perform aggregation with the following R code.

agg = aggregate(data, by = list(data$Role), FUN = mean)

This produces a table of the average salary and age by role, as below.

The aggregate function

The first argument to the function is usually a data.frame.

The by argument is a list of variables to group by. This must be a list even if there is only one variable, as in the example.

The FUN argument is the function which is applied to all columns (i.e., variables) in the grouped data. Because we cannot calculate the average of categorical variables such as Name and Shift, they result in empty columns, which I have removed for clarity.

Other aggregation functions

Any function that can be applied to a numeric variable can be used within aggregate. Maximum, minimum, count, standard deviation and sum are all popular.

For more specific purposes, it is also possible to write your own function in R and refer to that within aggregate. I’ve demonstrated this below where the second largest value of each group is returned, or the largest if the group has only one case. Note also that the groups are formed by Role and by Shift together.

second = function(x) { if (length(x) == 1) return(x) return(sort(x, decreasing = TRUE)[2])} agg = aggregate(data, by = list(data$Role, data$Shift), FUN = second)

Additional features

The aggregate function has a few more features to be aware of:

  • Grouping variable(s) and variables to be aggregated can be specified with R’s formula notation.
  • Setting drop = TRUE means that any groups with zero count are removed.
  • na.action controls the treatment of missing values within the data.

TRY IT OUT
The analysis in this post was performed in Displayr using R. You can repeat or amend this analysis for yourself in Displayr.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New open data sets from Microsoft Research

Fri, 07/13/2018 - 00:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft has released a number of data sets produced by Microsoft Research and made them available for download at Microsoft Research Open Data.  

The Datasets in Microsoft Research Open Data are categorized by their primary research area, such as Physics, Social Science, Environmental Science, and Information Science. Many of the data sets have not been previously available to the public, and many are large and useful for research in AI and Machine Learning techniques. Many of the datasets also include links to associated papers from Microsoft Research. For example, the 10Gb DESM Word Embeddings dataset provides the IN and the OUT word2vec embeddings for 2.7M words trained on a Bing query corpus of 600M+ queries.

Other data sets of note include:

  • A collection of 38M tweets related to the 2012 US election
  • 3-D capture data from individuals performing a variety of hand gestures
  • Infer.NET, a framework for running Bayesian inference in graphical models
  • Images for 1 million celebrities, and associated tags
  • MS MARCO, is a new large-scale dataset for reading comprehension and question answering

Most data sets are provided as plain text files, suitable for importing into Python, R, or other analysis tools. In addition to downloading the data, you can also deploy the datasets for analysis in to Microsoft Azure: see the FAQ for details on this and other ways the datasets can be used.

As of this writing, the archive includes 51 datasets. You can explore and download the Microsoft Research datasets at the link below.

Microsoft: Microsoft Research Open Data

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Coloured output in the R console

Thu, 07/12/2018 - 21:29

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

Just a little fun today… the R console isn’t the most interesting of things… text is typically either black or red (assuming default settings in RStudio). There’s a package though called crayon which allows one to change the style of text in terms of colour, background and some font-type settings. It could be an interesting way to spice up summaries (I’m not recommending it, but it’s a possibility. As far as packages are concerned, it’s just another dependency…)…

devtools::install_github("r-lib/crayon") library(crayon) cat(green( 'I am a green line ' %+% blue$underline$bold('with a blue substring') %+% yellow$italic(' that becomes yellow and italicised!\n') ))

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My upcoming conference talks & workshops: M-cubed, ML Summit & data2day

Thu, 07/12/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

I’ll be giving talks and workshops at the following three upcoming conferences; hope to meet some of you there!

Traditional machine learning workflows focus heavily on model training and optimization; the best model is usually chosen via performance measures like accuracy or error and we tend to assume that a model is good enough for deployment if it passes certain thresholds of these performance criteria. Why a model makes the predictions it makes, however, is generally neglected. But being able to understand and interpret such models can be immensely important for improving model quality, increasing trust and transparency and for reducing bias. Because complex machine learning models are essentially black boxes and too complicated to understand, we need to use approximations.

Required audience experience: Basic knowledge of machine learning

Objective of the talk: Listeners will get an overview of why understanding machine learning models is important, how it can help us improve models and help gain trust in their decisions. I will explain in detail how one popular approach to explaining complex models – LIME – works and show an example analysis.

M-cubed banner

Bildklassifikation leicht gemacht – mit Keras und TensorFlow

Tuesday, 2. October 2018 | 10:00 – 13:00 Lange Zeit galt die automatische Erkennung von Objekten, Menschen und Szenen auf Bildern durch Computer als unmöglich. Die Komplexität schien schlicht zu groß, um sie einem Algorithmus programmatisch beibringen zu können. Doch Neuronale Netze haben dies drastisch verändert! Inzwischen ist Bilderkennung ist ein weit verbreitetes Anwendungsgebiet von Maschinellem Lernen. Häufig werden dafür sogenannte “Convolutional Neuronal Networks”, oder “ConvNets” verwendet. In diesem Workshop werde ich zeigen, wie einfach es ist, solch ein Neuronales Netz selber zu bauen. Dafür werden wir Keras und TensorFlow verwenden. Wir werden zunächst ein komplettes Netz selber trainieren: vom Einlesen der Bilder, über das Definieren des Netzes, hin zum Evaluieren auf Testbildern. Anschließend gucken wir uns an, wie man mit Transfer Learning und vortrainierten Netzen auch mit wenigen eigenen Bildern schnell Erfolge sehen kann. Im letzten Teil des Workshops soll es dann darum gehen, wie wir diese Bilderkennungsmodelle besser verstehen können – zum Beispiel indem wir die Knoten in Zwischenschichten visualisieren; so können wir Muster und für die Klassifikation wichtige Bildbereiche finden und die Klassifikation durch das Modell nachvollziehen. Installationshinweise: Wir werden mit Python3 in Google Collaboratory arbeiten.

ML Summit banner

Durch das stark wachsende Datenvolumen hat sich das Rollenverständnis von Data Scientists erweitert. Statt Machine-Learning-Modelle für einmalige Analysen zu erstellen, wird häufiger in konkreten Entwicklungsprojekten gearbeitet, in denen Prototypen in produktive Anwendungen überführt werden. Keras ist eine High-Level-Schnittstelle, die ein schnelles, einfaches und flexibles Prototypisieren von Neuronalen Netzwerken mit TensorFlow ermöglicht. Zusammen mit Luigi lassen sich beliebig komplexe Datenverarbeitungs-Workflows in Python erstellen. Das führt dazu, dass auch Nicht-Entwickler den End-2-End-Workflow des Keras-TensorFlow-Modells zur Produktionsreife leicht implementieren können.

data2day banner

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

phylogram: dendrograms for evolutionary analysis

Thu, 07/12/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Evolutionary biologists are increasingly using R for building,
editing and visualizing phylogenetic trees.
The reproducible code-based workflow and comprehensive array of tools
available in packages such as ape,
phangorn and
phytools make R an ideal platform for
phylogenetic analysis.
Yet the many different tree formats are not well integrated,
as pointed out in a recent
post.

The standard data structure for phylogenies in R is the “phylo”
object, a memory efficient, matrix-based tree representation.
However, non-biologists have tended to use a tree structure
called the “dendrogram”, which is a deeply nested list with
node properties defined by various attributes stored at each level.
While certainly not as memory efficient as the matrix-based format,
dendrograms are versatile and intuitive to manipulate, and hence
a large number of analytical and visualization functions exist
for this object type. A good example is the
dendextend package,
which features an impressive range of options for editing dendrograms
and plotting publication-quality trees.

To better integrate the phylo and dendrogram object types,
and hence increase the options available for both camps,
we developed the phylogram
package, which is now a part of the rOpenSci
project.
This small package features a handful of functions for tree conversion,
importing and exporting trees as parenthetic text, and manipulating
dendrograms for phylogenetic applications.
The phylogram package draws heavily on ape,
but currently has no other non-standard dependencies.

Installation

To download phylogram from CRAN and load the package, run

install.packages("phylogram") library(phylogram)

Alternatively, to download the latest development version from GitHub,
first ensure that the devtools,
kmer, and
dendextend
packages are installed,
then run:

devtools::install_github("ropensci/phylogram", build_vignettes = TRUE) library(phylogram)

Tree import/export

A wide variety of tree formats can be parsed as phylo objects using either the
well-optimized ape::read.tree function
(for Newick
strings),
or the suite of specialized functions in the versatile
treeio package.
To convert a phylo object to a dendrogram, the phylogram package includes
the function as.dendrogram, which retains node height attributes and can handle
non-ultrametric trees.

For single-line parsing of dendrograms from Newick text,
the read.dendrogram function wraps ape::read.tree
and converts the resulting phylo class object to a dendrogram using as.dendrogram.

Similarly, the functions write.dendrogram and as.phylo are used to
export dendrogram objects to parenthetic text and phylo objects, respectively.

Tree editing

The phylogram package includes some new functions for manipulating
trees in dendrogram format.
Leaf nodes and internal branching nodes can be removed
using the function prune, which identifies and
recursively deletes nodes based on pattern
matching of “label” attributes.
This is slower than ape::drop.tip, but offers
the benefits of versatile string matching using regular expressions,
and the ability to remove inner nodes (and by extension all subnodes)
that feature matching “label” attributes.
To aid visualization, the function ladder rearranges
the tree, sorting nodes by the number of members
(analogous to ape::ladderize).

For more controlled subsetting or when creating trees from scratch
(e.g. from a standard nested list), the function remidpoint
recursively corrects all “midpoint”, “members” and “leaf” attributes.
Node heights can then be manipulated using either reposition, which
scales the heights of all nodes in a tree by a given constant, or
as.cladogram, which resets the “height” attributes of all terminal
leaf nodes to zero and progressively resets the heights of the inner nodes
in single incremental units.

As an example, a simple three-leaf dendrogram can be created from
a nested list as follows:

x <- list(1, list(2, 3)) ## set class, midpoint, members and leaf attributes for each node x <- remidpoint(x) ## set height attributes for each node x <- as.cladogram(x)

A nice feature of the dendrogram object type is that tree
editing operations can be carried out recursively
using fast inbuilt functions in the “apply” family such as dendrapply
and lapply.

For example, to label each leaf node of the tree alphabetically we can
create a simple labeling function and apply it to the tree nodes recursively using
dendrapply.

set_label <- function(node){ if(is.leaf(node)) attr(node, "label") <- LETTERS[node] return(node) } x <- dendrapply(x, set_label) plot(x, horiz = TRUE)

Applications

One application motivating bi-directional conversion between phylo and
dendrogram objects involves creating publication-quality ‘tanglegrams’ using
the dendextend package.
For example, to see how well the fast, alignment-free k-mer distance
from the kmer package
performs in comparison to the standard Kimura 1980 distance measure,
we can create neighbor-joining trees using each method and plot them side by side
to check for incongruent nodes.

## load woodmouse data and remove columns with ambiguities data(woodmouse, package = "ape") woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))] ## compute Kimura 1980 pairwise distance matrix dist1 <- ape::dist.dna(woodmouse, model = "K80") ## deconstruct alignment (not strictly necessary) woodmouse <- as.list(as.data.frame(unclass(t(woodmouse)))) ## compute kmer distance matrix dist2 <- kmer::kdistance(woodmouse, k = 7) ## build and ladderize neighbor-joining trees phy1 <- ape::nj(dist1) phy2 <- ape::nj(dist2) phy1 <- ape::ladderize(phy1) phy2 <- ape::ladderize(phy2) ## convert phylo objects to dendrograms dnd1 <- as.dendrogram(phy1) dnd2 <- as.dendrogram(phy2) ## plot the tanglegram dndlist <- dendextend::dendlist(dnd1, dnd2) dendextend::tanglegram(dndlist, fast = TRUE, margin_inner = 5)

In this case, the trees are congruent and branch lengths are similar.
However, if we reduce the k-mer size from 7 to 6,
the accuracy of the tree reconstruction is affected, as shown by the
incongruence between the original K80 tree (left) and the tree derived
from the 6-mer distance matrix (right):

## compute kmer distance matrix dist3 <- kmer::kdistance(woodmouse, k = 6) phy3 <- ape::nj(dist3) phy3 <- ape::ladderize(phy3) dnd3 <- as.dendrogram(phy3) dndlist <- dendextend::dendlist(dnd1, dnd3) dendextend::tanglegram(dndlist, fast = TRUE, margin_inner = 5)

Hopefully users will find the package useful for a range of other applications.
Bug reports and other suggestions are welcomed, and can be directed to the
GitHub issues page
or the phylogram google group.
Thanks to Will Cornwell and
Ben J. Ward
for reviewing the code and suggesting improvements,
and to Scott Chamberlain
for handling the rOpenSci
onboarding process.

The phylogram package is available for download from
GitHub and
CRAN,
and a summary of the package is published in the
Journal of Open Source Software.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Four Ways to Write Better Stan Code

Thu, 07/12/2018 - 01:00

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

1. Improve sampler efficiency by picking the right model

We need to address how we specify our models before even discussing writing code that is optimized computationally. Make sure that your model is parameterized in such a way that Stan can easily sample it using its algorithms – No U-TURN Sampler (NUTS) or Hamiltonian Monte Carlo (HMC). Simply selecting the right model can make a big difference in terms of efficiency. Due to differences in implementations and algorithms, an efficient parameterization in stan is not necessarily the one that was best in other software we’ve tried.

A poorly specified model will require more samples to reach convergence and adequately explore the posterior distribution or they may not converge at all. For some models, reparameterization can be an effective means of improve sampling efficiency by replacing a complex distribution Stan has difficulty sampling from, with one from which it can draw more easily. One example discussed in Section 28.6 of the Stan manual involves reparameterizing the Cauchy distribution, a challenge for Stan to sample from because of the heavy tails. The difficulties can be fixed by instead sampling a uniform random variable and using the probability integral transform.

2. Matrices or arrays, pick carefully!

It can also be confusing whether to use matrices or arrays when writing your code. There are actually four different ways to specify a two-dimensional collection of real numbers! But which one should you pick? This largely depends on the operations you need to perform. If you need to do matrix computations, you should be using a matrix. However, if you frequently need to index into the rows of the matrix it is more efficient to use arrays. In this situation, it will save you a headache to declare an array of type row_vector than to work with matrices.

Matrices and arrays should also be traversed in different order. Loops involving arrays should be traversed with the last index varying fastest, whereas the opposite is true for matrices. Additionally, traversing through matrices is slightly more efficient. If for example your code involves an I x J array of matrices each having dimension R x C, then the most efficient way to write a loop that traverses every element is:

matrix[R,C] a[I,J];
for (i in 1:I)
  for (j in 1:J)
    for (c in 1:C)
      for (r in 1:R)
        a[i,j,r,c] = ……

3. Let built-in functions and vectorization save you time

Stan has a number of optimized built-in functions for common computations such as dot products and special matrix multiplications. You should use these whenever possible to save yourself from having to write your own code to perform the calculations. There are also functions that will improve the speed by vectorizing code. For example, this loop

matrix[R,C] m;
for(r in 1:R)
  m[r] = normal(0,1);

should be replaced with the faster, vectorized code:

matrix[R,c] m;
to_vector(m) ~ normal(0,1);

4. Customize with compiler optimizations

Because Stan relies on compiled C++ code, it may be possible for advanced users to further optimize by changing compiler flags in R’s version of a Makefile, known as a Makevars file. It is recommended to do this in a user specific file located in ~/.R/Makevars. Be careful though – using overly aggressive compiler options can result in code that is not portable across machines and architectures. The trade-off however is code that is as fast as possible on your own computer. The RStan installation guide recommends adding CXXFLAGS==-O3 to the Makevars file for the highest level of overall optimization possible, but be warned – this can result in increased memory usage and larger binary file sizes!

We hope these four tips help you improve the efficiency of your coded models! Check out more tips and tricks here. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

John Mount speaking on rquery and rqdatatable

Wed, 07/11/2018 - 20:04

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.




Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

One may ask: “why share new packages with an “R Language Beginners Group?” (Though, I prefer to use the term “part time R user”.) Because: such packages can give such users easy, safe, consistent, and performant access to powerful data systems. rquery establishes a notation and links to external data providers such as PostgreSQL, Amazon Redshift, and Apache Spark. rqdatatable supplies an additional in-memory implementation of the rquery notation using the powerful data.table package.

rquery and rqdatatable can, if given a fair chance, be significant R tools. Users are that chance.

If you are around please come see the talk.

Some earlier articles on rquery and rqdatatable can be found here and here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Course: Python for R Users

Wed, 07/11/2018 - 16:11

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the course link.

Course Description

Python and R have seen immense growth in popularity in the "Machine Learning Age". They both are high-level languages that are easy to learn and write. The language you use will depend on your background and field of study and work. R is a language made by and for statisticians, whereas Python is a more general purpose programming language. Regardless of the background, there will be times when a particular algorithm is implemented in one language and not the other, a feature is better documented, or simply, the tutorial you found online uses Python instead of R. In either case, this would require the R user to work in Python to get his/her work done, or try to understand how something is implemented in Python for it to be translated into R. This course helps you cross the R-Python language barrier.

Chapter 1: The Basics (Free)

Learn about some of the most important data types (integers, floats, strings, and booleans) and data structures (lists, dictionaries, numpy arrays, and pandas DataFrames) in Python and how they compare to the ones in R.

Chapter 2: Control flow, Loops, and Functions

This chapter covers control flow statements (if-else if-else), for loops and shows you how to write your own functions in Python!

Chapter 3: Pandas

In this chapter you will learn more about one of the most important Python libraries, Pandas. In addition to DataFrames, pandas provides several data manipulation functions and methods.

Chapter 4: Plotting

You will learn about the rich ecosystem of visualization libraries in Python. This chapter covers matplotlib, the core visualization library in Python along with the pandas and seaborn libraries.

Chapter 5: Capstone

As a final capstone, you will apply your Python skills on the NYC Flights 2013 dataset.

Prerequisites

Writing Functions in R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

[ggplot2] Welcome viridis !

Wed, 07/11/2018 - 10:52

(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

Let’s welcome the viridis palette into the new version of {ggplot2}!

Viri-what ?

viridis is one of the favorite color palettes of one of the member of the team (guesswho).

The viridis palette was first developed for the python package matplotlib, and has been implemented in R since.

The strengths of this palette are that:

  • plots are beautiful (which is good enough a reason to use it)
  • colors are perfectly perceptually-uniform, even when printed in black and white (yes, people still print stuffs in 2018…)
  • colors are perceived by the most common forms of color blindness (an important feature)

This palette is now the default color scheme for colormap in the matplolib python package. It had been available as a color scale in R with the {viridisLite} package, then as a ggplot2 scale in {viridis}, and is now fully integrated in {ggplot2} with the last release.

Why viridis? The story behind

As said in A Better Default Colormap for Matplotlib: “colormap are interfaces between your data and your brain” — which means that the choices of colors can have a significant effect on the way you make decision. We won’t go into the details of color theory (there is a lot of literature on that topic, and we are not the best to speak in depth about it), but it’s pretty obvious that colors influence the way you perceive things.

For example, in western Europe (and I guess in many countries around the globe), we are culturally conditionned to think of a red sign as an alert. Green on the other hand tends to indicate something sage. If today I had to design a color scheme for error / warning / success, I would undoubtly go for a red / orange / green solution. I’m not the only one, and the first example that comes to my mind is the {shinyalert} package, which has this exact model of colors. This seems simple (and it is), and there is a lot of articles and books about this paradigm applied to marketing and selling (I’ll let you Google-scholar that). For example, here is an “emotion color wheel” called the Plutchik’s wheel of emotions, made by the psychologist of the same name:

NB: we do not share / endorse this simplified theory of color, it’s just used there as an example  

Let’s take a more serious example, borrowed from the talk I linked just before. In an article (that sounds like a lot of fun) called “Evaluation of artery visualizations for heart disease diagnosis” (link), the authors showed how a color palette can play a critical role in diagnosing the presence of an heart disease. I’ll let you dig into the details if you want to, but long story short, the color palette of the software used to visualise the results has an effect on the capacity of the professional to read and detect a specific condition.

Why are we talking about that? Because as we want R to be used in production for visualising data, it’s important to keep in mind that the color scheme we choose in the end product to use is not without impact: being simply on readability or in the emotions that is conveyed by the graph.

Here comes a new challenger

The standard color palette in a lot of tools is called the Jet palette. You might be familiar with it, as it is implemented in a lot of softwares. This color scheme appears as a good choice to many as we tend to think that this palette makes it easy to distinguish colors (spoiler: it doesn’t).

It looks like this:

library(matlab) ## Attaching package: 'matlab' ## The following object is masked from 'package:stats': ## reshape ## The following objects are masked from 'package:utils': ## find, fix ## The following object is masked from 'package:base': ## sum

This example is taken from the viridis vignette, put into a function

with_palette <- function(palette) { x <- y <- seq(-8 * pi, 8 * pi, len = 40) r <- sqrt(outer(x^2, y^2, "+")) filled.contour(cos(r^2) * exp(-r / (2 * pi)), axes = FALSE, color.palette = palette, asp = 1 ) } with_palette(jet.colors)

Let’s just for a second compare to the same data with viridis:

library(viridis) ## Loading required package: viridisLite with_palette(viridis)

If it seems obvious to you that the viridis palette is easier to read… it’s because it is.

How to implement a new color palette

With the viridis colormap, the purpose was to have something that is:

  • colorful
  • pretty
  • sequential
  • perceptually uniform
  • working with black and white
  • accessible to colorblind

So how can one do that? How can one implement a new palette that answers these demands? Before getting into the that, let’s analyse how the data go from your computer to your brain.

To be projected, the data is transformed in RGB notation (for “Red Green Blue”), either as an hex (#FFCC99), as a percentage (rgb(100%,80%,60%) : 100% red, 80% green, 60% blue) or as a number between 0 and 255 (rgb(255,204,153)). All these notations are the same: the hexadecimal notation being used to represent numbers from 0 to 255 with 16 symbols (from 0-9 & A-F, 16^2 being 256).

In a lot of cases, with color, there are two symbols added to the end of the hex, used for indicating transparency. For example:

viridis(2) ## [1] "#440154FF" "#FDE725FF"

Note: this is made possible by the fact that a pixel is coded using 32 bits, and a color is represented only 24 out of these 32 bits (3*8), leaving room for another 8 bits.

Note bis: you remember that the hex notation allows 256 combinations, which correspond to the number of possible combination for 8 bits (2^8).

So once you have your lowest color, and your highest, you can use a color palette generator to get a range of color from one to the other. It’s done, for example, by the colorRampPalette function from {grDevices} :

colfunc <- colorRampPalette(c("black", "white"))

colfunc(10)
[1] "#000000" "#1C1C1C" "#383838" "#555555" "#717171" "#8D8D8D" "#AAAAAA" "#C6C6C6"
[9] "#E2E2E2" "#FFFFFF"

The computation of this range is out of the scope of this article, but you get the idea Once we have decided what our palette should look like, it is then possible to scale your data to fit your palette: as your palette is going from one point of the color scale to another by passing by a series of intermediary tones, you can give each value of your data a spot in you colormap. The lowest value of your data is mapped to the lowest value of the palette, and the highest to the highest.

The viridis colormap

So now that we have the theory, let’s get to the big question: how can we choose a palette that is colorbling friendly? To do that, the solution found by the viridis colormap is to avoid the red and to go from blue to yellow,as  these colors can be seen by most of the colorblind.

Let’s use the {dichromat} package to simulate that :

library(dichromat)

Let’s see what the jet palette looks like when you simulate several cases of colorblindness:

library(purrr) with_palette( compose( partial(dichromat, type = "deutan"), jet.colors ) )

with_palette( compose( partial(dichromat, type = "protan"), jet.colors ) )

with_palette( compose( partial(dichromat, type = "tritan"), jet.colors ) )

And with viridis:

with_palette( compose( partial(dichromat, type = "deutan"), viridis ) )

with_palette( compose( partial(dichromat, type = "protan"), viridis ) )

with_palette( compose( partial(dichromat, type = "tritan"), viridis ) )

As you can see, the contrast stays readable in every situation.

Finally, to suit the printing need, the palette had to go from dark to light. And the wider range was from dark blue to bright yellow. Let’s just compare two examples:

with_palette( compose( colorspace::desaturate, jet.colors ) )

with_palette( compose( colorspace::desaturate, viridis ) )

Back to R

So, enough theory, time to get back to R.

viridis as a

Let’s imagine for a minute we are using the old version of ggplot2 (as many still do as we write this lines).

When using this (now old) version of {ggplot2}, we needed the {viridis} package to use this color palette.

library(ggplot2) library(viridis) ggplot(iris) + aes(Sepal.Length, Sepal.Width, color = Petal.Length) + geom_point() + scale_colour_viridis() # from the viridis package

Another way to do that was to get a vector of hex colors, with the {viridisLite} package (note that the functions from {viridis} call functions from {viridisLite}).

pal <- viridisLite::viridis(3) pal ## [1] "#440154FF" "#21908CFF" "#FDE725FF" ggplot(iris) + aes(Sepal.Length, Sepal.Width, color = Species) + geom_point() + scale_color_manual(values = pal)

 

viridis as a ggplot scale

But now, thanks to the new ggplot2 (version 3.0.0), you can call directly the viridis palette with the ad hoc functions.

ggplot(iris) +

    aes(Sepal.Length, Sepal.Width, color = Species) +

    geom_point() +

    scale_color_viridis_d()#From ggplot2

There are two scales for numeric inputs: discrete (scale_color_viridis_d) and continous (scale_color_viridis_c).

And there is of course the counterpart with fill :

ggplot(faithfuld) + aes(waiting, eruptions, fill = density) + geom_tile() + scale_fill_viridis_c()

Note the difference in readability compared to the default palette:

ggplot(faithfuld) + aes(waiting, eruptions, fill = density) + geom_tile()

Also, viridis is the new default palette for ordered factors :

diamonds %>% dplyr::count(cut, color) %>% ggplot() + aes(cut, n, fill = color) + geom_col()

Finally, and I didn’t presented them so far, but there are 4 other available colormaps: “magma” (or “A”), “inferno” (or “B”), “plasma” (or “C”), and “cividis” (or “E”), which can be passed to the option argument in the various scales:

ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "A")

ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "B")

ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "C")

ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "E")

So, to sum up:

And in conclusion, here is why the inclusion of viridis as a native scale in ggplot2 is a good news:

  • More beautiful plots (but in the end, that’s secondary)
  • Better and more inclusive end products : easier to read for people with colorblindness or vision problem

NB: of course, this was already possible before, but the fact that it is now native will make it more used and reknown.

The post [ggplot2] Welcome viridis ! appeared first on (en) The R Task Force.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science For Business: 3 Reasons You Need To Learn The Expected Value Framework

Wed, 07/11/2018 - 06:45

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

One of the most difficult and most critical parts of implementing data science in business is quantifying the return-on-investment or ROI. As a data scientist in an organization, it’s of chief importance to show the value that your improvements bring. In this article, we highlight three reasons you need to learn the Expected Value Framework, a framework that connects the machine learning classification model to ROI. Further, we’ll point you to a new video we released on the Expected Value Framework: Modeling Employee Churn With H2O that was recently taught as part of our flagship course: Data Science For Business (DS4B 201). The video serves as an overview of the steps involved in calculating ROI from reducing employee churn with H2O, tying the key H2O functions to the process. Last, we’ll go over some Expected Value Framework FAQ’s that are commonly asked in relation to applying Expected Value to machine learning classification problems in business.

Articles In This Series

If you’re interested in data science for business and / or the HR Analytics Employee Turnover problem and how it can be solved with H2O Automated Machine Learning, check out some of these tutorials.

Learning Trajectory

We’ll touch on the following Expected Value Framework topics in this article:

Alright, let’s get started!

Get The Best Resources In Data Science. Every Friday!

Sign up for our free “5 Topic Friday” Newsletter. Every week, I’ll send you the five coolest topics in data science for business that I’ve found that week. These could be new R packages, free books, or just some fun to end the week on.

Sign Up For Five-Topic-Friday!

3 Reasons You Need To Learn The Expected Value Framework

Here are the 3 reasons you need to know about Expected Value if you want to tie data science to ROI for a machine learning classifier. We’ll use examples from the DS4B 201 course that are related to employee churn (also called employee turnover or employee attrition).

Reason #1: We Can Avoid The Problem With Max F1

The problem with machine learning classification is that most data scientists use the threshold that maximizes F1 as the classification threshold.

The problem with machine learning classification is that most data scientists use the threshold that maximizes F1 as the classification threshold. Using the threshold that maximizes F1 is the default for most machine learning classification algorithms. For many problems outside of the business domain this is OK. However, for business problems, this is rarely the best choice.

For those that are wondering what the threshold at max F1 is, it’s the threshold that harmonically balances the precision and recall (in other words, it optimally aims to reduce both the false positives and the false negatives finding a threshold that achieves a relative balance). The problem is that, in business, the costs associated with false positives (Type 1 Errors) and false negatives (Type 2 Errors) are rarely equal. In fact, in many cases false negatives are much more costly ( by a factor of 3 to 1 or more!).

Example: Cost of Type 1 And Type 2 Errors For Employee Attrition

As an example, let’s focus on the cost difference between Type 1 (false positives) and Type 2 (false negatives) in an employee attrition problem. Say we are considering a mandatory overtime reduction because saw that employees flagged as working overtime are 5X more likely to quit. We develop a prediction algorithm using H2O Automated Machine Learning and then run our LIME algorithm to develop explanations at the employee level. The LIME results confirm our suspicion. Overtime is a key feature supporting Employee Attrition.

Calculating Expected Attrition Cost From H2O + LIME Results

We develop a proposal to reduce overtime using our H2O classification model, which by default uses the threshold that maximizes F1 (treats Type 1 and Type 2 errors equally). We then begin targeting people for overtime reduction. We end up misclassifying people that leave as stay (Type 2 error) at roughly the same rate as we misclassify people that stay as leave (Type 1 error). The cost of overtime reduction for an employee is estimated at 30% of the lost productivity if the employee quits.

Here lies the problem: The cost of reducing the overtime incorrectly for some one that stays is 30% of missing the opportunity to reduce overtime for an employee incorrectly predicted to stay when they leave. In other words, Type 2 Error is 3X more costly than Type 1 Error, yet we are treating them the same!

“Type 2 Error is 3X more costly than Type 1 Error, yet we are treating them the same!”

Because of this, the optimal threshold for business problems is almost always less than the F1 threshold. This leads us to our second reason you need to know the Expected Value Framework.

Reason #2: We Can Maximize Expected Profit Using Threshold Optimization

When we have a calculation to determine the expected value using business costs, we can perform the calculation iteratively to find the optimal threshold that maximizes the expected profit or savings of the business problem. By iteratively calculating the savings generated at different thresholds, we can see which threshold optimizes the targeting approach.

In the example below, we can see in the threshold optimization results that the maximum savings ($546K) occurs at a threshold of 0.149, which is 16% more savings than the savings at threshold at max F1 ($470K). It’s worth mentioning that the threshold that maximizes F1 was 0.280, and that for a test set containing 15% of the total population it cost $76K due to being sub-optimal ($546K – $470K). Extending this inefficiency to the full population (train + test data), this is a missed opportunity of $500K annually!

Threshold Optimization Results Results in 16% Benefit for A Test Set Containing 15% Of Total Population, Extending To Full Set is $500K Savings

The Expected Value Framework enables us to find the optimal threshold that accounts for the business costs, thus weighting Type 2 Errors appropriately. As shown, this can result in huge savings over the F1.

The Expected Value Framework enables us to find the optimal threshold that accounts for the business costs, thus weighting Type 2 Errors appropriately.

Ok, now we know we need to use Expected Value Framework. But it’s worth mentioning that the model we just built using H2O is not reality.

Wait, what?!?!

Here’s what I mean: The model is based on a number of assumptions including the average overtime percentage, the anticipated net profit per employee, and so on. The assumptions are imperfect. This leads us to the third reason you need to learn the Expected Value Framework.

Reason #3: We Can Test Variability In Assumptions And Its Effect On Expected Profit

Let’s face it, we don’t know what the future will bring, and we need to put our assumptions in check. This is exactly what the third benefit enables: Sensitivity Analysis, or testing the effect of model assumptions on expected profit (or savings).

In the human resources example below, we tested for a range of values average overtime percentage and net revenue per employee because our estimates for the future may be off. In the Sensitivity Analysis Results shown below, we can see in the profitability heat map that as long as the average overtime percentage is less than or equal to 25%, implementing a targeted overtime policy saves the organization money.

Sensitivity Analysis Results (Profitability Heat Map)

Wow! Not only can we test for the optimal threshold that maximizes the business case, we can use expected value to test for a range of inputs that are variable from year to year and person to person. This is huge for business analysis!

Interested in Learning The Expected Value Framework For Business?

If you’re interested in learning how to apply the Expected Value Framework using R Programming, we teach it in Chapters 7 and 8 of Data Science For Business (DS4B 201) as part of our end-to-end Employee Churn business data science project.

Get Started Today!

YouTube Expected Value Framework Overview

Here’s a 20-minute tutorial on the Expected Value Framework that applies to the Data Science For Business (DS4B 201) course where we use H2O Automated Machine Learning to develop high performance models identifying those likely to leave and then the expected value framework to calculate the savings due to various policy changes.

FAQS: Expected Value Framework

The Expected Value Framework can be confusing and there’s a lot to learn. Here are a few of the most frequently asked questions related to applying the EVF to machine learning classification problems in business.

FAQ 1. What Is Expected Value?

Expected value is a way of assigning a value to a decision using the probability of it’s occurrence. We can do this for almost any decision.

Example: Is A Face-To-Face Meeting Warranted?

Suppose you work for a company that makes high tech equipment. Your company has received a quote for a project estimated at $1M in gross profit. The downside is that your competition is preferred and 99% likely to get the award. A face-to-face meeting will cost your organization $5,000 for travel costs, time to develop a professional quotation, and time and materials involved in the sales meeting. Should you take the meeting or no-quote the project?

High Tech Semiconductor Manufacturer

This problem is an expected value problem. We can assign a value to the meeting. We know that the competition is 99% likely to win, but we still have a 1% shot. Is it worth it?

Applying expected value, we multiply the probability of a win, which is 1%, by the profit of the win, which is $1M dollars. This totals $10,000 dollars. This is the expected benefit of the decision excluding any costs. If our total spend for the face-to-face meeting and quotation process is $5000, then it makes sense to take the meeting because it’s lower than $10,000 dollars, or the expected benefit.

FAQ 2. What Is The Expected Value Framework?

The Expected Value Framework is way to apply expected value to a classification model. Essentially it connects a machine learning classification model to ROI for the business. It enables us to combine:

  1. The threshold,
  2. Knowledge of costs and benefits, and
  3. The confusion matrix converted to expected error rates to account for the presence of false positives and false negatives.

We can use this combination to target based on postive class probability (think employee flight risk quantified as a probability) to gain even greater expected savings than an “all-or-none” approach without the framework.

Source: Data Science for Business by Foster Provost & Tom Fawcett

FAQ 3. What Is The Threshold?

It’s the dividing line that we (the data scientist) select between which values for the positive class probability (Yes in the example shown below) we convert to a positive versus a negative. If we choose a threshold (say 0.30), only one observation meets this criteria.

Class Probabilities

FAQ 4: What Are The Expected Rates?

The confusion matrix results can be converted to expected rates through a process called normalization. The “Actual Negative” and “Actual Positive” results are grouped and then converted to probabilities. The expected rates are then probabilities of correctly predicting an Actual value.

Expected Rates

FAQ 5: What Is The Cost-Benefit Matrix?

The cost / benefit matrix is the final piece needed for the Expected Value Framework. We develop this using our intuition about the problem. This is the most difficult part because we need to apply critical thinking to the business problem and the methodology we intend to use when coding the problem.

For the Employee Churn problem, one way to think of the cost benefit is analyzing two states: an initial state (baseline) and a new state (after a policy change to reduce overtime). We’ll use an employee named Maggie from the DS4B 201 course.

Initial State: Baseline With Allowing Overtime For Maggie

First, we’ll look at the initial state. In this scenario, Maggie’s probability of leaving is 15%, and she has a cost of attrition of $167K. The cost of a policy change is zero because no policy change is implemented in the baseline case. There are four scenarios we need to account for with probabilities of their own:

  • The cases of true positives and true negatives when the algorithm predicts correctly
  • And the cases of false positives and false negatives when the algorithm predicts incorrectly.

The cost of each scenario is what we are concerned with.

Cost-Benefit Matrix: Initial State (Baseline)

Going through a cost-benefit analysis, we can address each of these costs:

  • First is true negatives. If Maggie stays, the cost associated with her staying is nothing.
  • Second is true positives. If Maggie leaves, the cost associated with her leaving is $167K, which is her attrition cost.
  • Third, is false positives. If Maggie was predicted to leave, but actually stays, the cost associated with this scenario is nothing. We did not implement a policy change for the baseline so we don’t have any costs.
  • Fourth, is false negatives. If Maggie leaves but was predicted to stay, we lose her attrition cost.

The expected cost associated with her leaving is $25K

New State: Targeting Maggie For Overtime Reduction

Let’s propose a new state, one that eliminates overtime for Maggie.

In this scenario, Maggie’s probability of leaving drops to 3%. Her cost of attrition is the same, but now we are expecting a higher cost of policy change than we originally anticipated. The cost of the policy change in this scenario is 20% of her attrition cost, meaning she’s working approximately 20% over time. Should we make the policy change?

Like the initial state, there are four scenarios we need to account for with probabilities of their own

  • First is true negatives. If Maggie stays, the cost associated with her staying is 20% of her attrition cost, or $33K.
  • Second is true positives. If Maggie leaves, the cost associated with her leaving is $167K, which is her attrition cost plus the $33K policy change cost for reducing her overtime. This totals $200K.
  • Third, is false positives. If Maggie was predicted to leave, but actually stays, the cost associated with this scenario is $33K because she was targeted.
  • Fourth, is false negatives. If Maggie leaves but was predicted to stay, we lose her attrition cost plus the cost of targeting her, which total $200K

The expected cost is $38K for this scenario

Expected Savings (Benefit Of Expected Value Framework)

At an overtime percentage of 20%, the savings is (negative) -$13K versus the baseline. Therefore, we should not reduce Maggie’s over time.

This is the benefit of using Expected Value. Since the model predicts Maggie having only a 15% risk of leaving when working overtime, she is not a high flight risk. We can use the model to target people that are much higher risk to retain.

Next Steps: Take The DS4B 201 Course!

If interested in learning more, definitely check out Data Science For Business (DS4B 201). In 10 weeks, the course covers all of the steps to solve the employee turnover problem with H2O in an integrated fashion.

The students love it. Here’s a comment we just received on Sunday morning from one of our students, Siddhartha Choudhury, Data Architect at Accenture.

“To be honest, this course is the best example of an end to end project I have seen from business understanding to communication.”

Siddhartha Choudhury, Data Architect at Accenture

See for yourself why our students have rated Data Science For Business (DS4B 201) a 9.0 of 10.0 for Course Satisfaction!


Get Started Today!

Learning More

Check out our other articles on Data Science For Business!

Business Science University

Business Science University is a revolutionary new online platform that get’s you results fast.

Why learn from Business Science University? You could spend years trying to learn all of the skills required to confidently apply Data Science For Business (DS4B). Or you can take the first course in our Virtual Workshop, Data Science For Business (DS4B 201). In 10 weeks, you’ll learn:

  • A 100% ROI-driven Methodology – Everything we teach is to maximize ROI.

  • A clear, systematic plan that we’ve successfully used with clients

  • Critical thinking skills necessary to solve problems

  • Advanced technology: _H2O Automated Machine Learning__

  • How to do 95% of the skills you will need to use when wrangling data, investigating data, building high-performance models, explaining the models, evaluating the models, and building tools with the models

You can spend years learning this information or in 10 weeks (one chapter per week pace). Get started today!


Sign Up Now!

DS4B Virtual Workshop: Predicting Employee Attrition

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.


Get Started Today!

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is code intensive (like these articles) but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework and many data science tools in an integrated fashion. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.

Here’s what one of our students, Jason Aizkalns, Data Science Lead at Saint-Gobain had to say:

“In an increasingly crowded data science education space, Matt and the Business Science University team have found a way to differentiate their product offering in a compelling way. BSU offers a unique perspective and supplies you with the materials, knowledge, and frameworks to close the gap between just “doing data science” and providing/creating value for the business. Students will learn how to formulate their insights with a value-creation / ROI-first mindset which is critical to the success of any data science project/initiative in the “real world”. Not only do students work a business problem end-to-end, but the icing on the cake is “peer programming” with Matt, albeit virtually, who codes clean, leverages best practices + a good mix of packages, and talks you through the why behind his coding decisions – all of which lead to a solid foundation and better habit formation for the student.”

Jason Aizkalns, Data Science Lead at Saint-Gobain


Get Started Today!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: June 2018 roundup

Wed, 07/11/2018 - 01:20

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from June of particular interest to R users.

An animated visualization of global migration, created in R by Guy Abel.

My take on the question, Should you learn R or Python for data science?

The BBC and Financial Times use R — without post-processing — for publication graphics.

"Handling Strings in R", a free e-book by Gaston Sanchez, has been updated.

The AI, Machine Learning and Data Science roundup for June 2018.

The PYPL Popularity of Languages Index ranks R as the 7th most popular programming language.

The "lime" package for R provides tools for interpreting machine learning models.

An R vignette by Paige Bailey on detecting unconscious bias in predictive models.

Microsoft R Open 3.5.0 has been released (with a subsequent fix for Debian systems).

Slides from the webinar, What's New in Azure for Machine Learning and AI.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rOpenSci’s drake package

Tue, 07/10/2018 - 21:28

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

If you don’t know rOpenSci, then I recommend checking them out. They write a lot of really good packages for R. A relatively new seems to be drake. I’ve not played with it yet, but it looks to be very useful at giving indications about which parts of an analysis are subject to changes, and only rerunning those parts to speed up redoing an analysis (envisage the overlays for some version control systems or dropbox that show the status of files, although it’s more complicated than that). Knitr has caching, which goes some way to handling this, but here you can see where outdated parts are in the scope of the entire analysis…

It looks like a super tool!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages