Alleviating AWS Athena Aggravation with Asynchronous Assistance
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
I’ve blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare a dplyr chain, fire off a collect() and then wait.
And, wait.
And, wait.
And, wait.
Queries that take significant processing time or have large result sets do not play nicely with the provided ODBC and JDBC drivers. This means “hung” R sessions and severe frustration, especially when you can login to the AWS Athena console and see that the results are right there!!
I’ve been crafting SQL by hand or using sql_render() by hand to avoid this (when I remember to) but finally felt sufficient frustration to craft a better way, provided you can install and run rJavabased code (it’s 2018 and that still is not an easy given on many systems unfortunately).
There are two functions below:
 collect_async(), and
 gather_results()
The collect_async() function is designed to be used like collect() but uses Athena components from the AWS SDK for Java to execute the SQL query behind the dplyr chain asynchronously. The companion function gather_results() takes the object created by collect_async() and checks to see if the results are ready. If if they are, it will use the aws.s3 package to download them. Personally, I’d just aws s3 sync ... from the command line vs use the aws.s3 package but that’s not everyone’s cup of tea.
Once I figure out the best package API for this I’ll add it to the metis package. There are many AWS idiosyncrasies that need to be accounted for and I’d rather ship this current set of functions via the blog so folks can use it (and tweak it to their needs) before waiting for perfection.
Here’s the code:
library(rJava) library(awsjavasdk) library(aws.signature) library(aws.s3) library(odbc) library(tidyverse) library(dbplyr) #' Collect Amazon Athena query results asynchronously #' #' Long running Athena queries and Athena queries with large result #' sets can seriously stall a `dplyr` processing chain due to poorly #' implemented ODBC and JDBC drivers. The AWS SDK for Athena has #' methods that support submitting a query asynchronously for "batch" #' processing. All Athena resutls are stored in CSV files in S3 and it's #' easy to use the R `aws.s3` package to grab these or perform an #' `aws s3 sync ...` operation on the command line. #' #' @md #' @param obj the `dplyr` chain #' @param schema Athena schema (usually matches the `Schema` parameter to the #' Simba ODBC connection) #' @param region Your AWS region. All lower case with dashes (usually matches #' the `AwsRegion` parameter to the Simba ODBC connection) #' @param results_bucket the S3 results bucket where query results are stored #' (usually matches the `S3OutputLocation` parameter to the Simba ODBC #' connection) #' @return a `list` with the query execution ID and the S3 bucket. This object #' is designed to be passed to the companion `gather_results()` if you #' want to use the `aws.s3` package to retrieve the results. Otherwise, #' sync the file however you want using the query execution id. #' @note You may need to change up the authentication provider depending on how #' you use credentials with Athena collect_async < function(obj, schema, region, results_bucket) { ugly_query < as.character(sql_render(obj)) region < toupper(region) region < gsub("", "_", region, fixed=TRUE) regions < J("com.amazonaws.regions.Regions") available_regions < grep("^[[:upper:][:digit:]_]+$", names(regions), value=TRUE) if (!region %in% available_regions) stop("Invalid region.", call.=FALSE) switch( region, "GovCloud" = regions$GovCloud, "US_EAST_1" = regions$US_EAST_1, "US_EAST_2" = regions$US_EAST_2, "US_WEST_1" = regions$US_WEST_1, "US_WEST_2" = regions$US_WEST_2, "EU_WEST_1" = regions$EU_WEST_1, "EU_WEST_2" = regions$EU_WEST_2, "EU_WEST_3" = regions$EU_WEST_3, "EU_CENTRAL_1" = regions$EU_CENTRAL_1, "AP_SOUTH_1" = regions$AP_SOUTH_1, "AP_SOUTHEAST_1" = regions$AP_SOUTHEAST_1, "AP_SOUTHEAST_2" = regions$AP_SOUTHEAST_2, "AP_NORTHEAST_1" = regions$AP_NORTHEAST_1, "AP_NORTHEAST_2" = regions$AP_NORTHEAST_2, "SA_EAST_1" = regions$SA_EAST_1, "CN_NORTH_1" = regions$CN_NORTH_1, "CN_NORTHWEST_1" = regions$CN_NORTHWEST_1, "CA_CENTRAL_1" = regions$CA_CENTRAL_1, "DEFAULT_REGION" = regions$DEFAULT_REGION ) > region provider < J("com.amazonaws.auth.DefaultAWSCredentialsProviderChain") client < J("com.amazonaws.services.athena.AmazonAthenaAsyncClientBuilder") my_client < client$standard() my_client < my_client$withRegion(region) my_client < my_client$withCredentials(provider$getInstance()) my_client < my_client$build() queryExecutionContext < .jnew("com.amazonaws.services.athena.model.QueryExecutionContext") context < queryExecutionContext$withDatabase(schema) result < .jnew("com.amazonaws.services.athena.model.ResultConfiguration") result$setOutputLocation(results_bucket) startQueryExecutionRequest < .jnew("com.amazonaws.services.athena.model.StartQueryExecutionRequest") startQueryExecutionRequest$setQueryString(ugly_query) startQueryExecutionRequest$setQueryExecutionContext(context) startQueryExecutionRequest$setResultConfiguration(result) res < my_client$startQueryExecutionAsync(startQueryExecutionRequest) r < res$get() qex_id < r$getQueryExecutionId() list( qex_id = qex_id, results_bucket = results_bucket ) } #' Gather the results of an asynchronous query #' #' @md #' @param async_result the result of a call to `collect_async()` #' @return a data frame (tibble) or `NULL` if the query results are not ready yet gather_results < function(async_result) { if (bucket_exists(sprintf("%s/%s", async_result$results_bucket, async_result$qex_id))) { readr::read_csv( get_object(sprintf("%s/%s.csv", async_result$results_bucket, async_result$qex_id)) ) } else { message("Results are not in the designated bucket.") return(NULL) } }Now, we give it a go:
# Setup the credentials you're using use_credentials("personal") # load the AWS Java SDK classes awsjavasdk::load_sdk() # necessary for Simba ODBC and the async query ops aws_region < "useast1" athena_schema < "sampledb" athena_results_bucket < "s3://awsathenaqueryresultsredacted" # connect to Athena and the sample database DBI::dbConnect( odbc::odbc(), driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib", Schema = athena_schema, AwsRegion = aws_region, AuthenticationType = "IAM Profile", AwsProfile = "personal", S3OutputLocation = athena_results_bucket ) > con # the sample table in the sample db/schema elb_logs < tbl(con, "elb_logs") # create your dplyr chain. This one is small so I don't incur charges # collect_async() MUST be the LAST item in the dplyr chain. elb_logs %>% filter(requestip == "253.89.30.138") %>% collect_async( schema = athena_schema, region = aws_region, results_bucket = athena_results_bucket ) > async_result async_result ## $qex_id ## [1] "d5fe7754919b47c5bd7d3ccdb1a3a414" ## ## $results_bucket ## [1] "s3://awsathenaqueryresultsredacted" # For long queries we can wait a bit but the function will tell us if the results # are there or not. gather_results(async_result) ## Parsed with column specification: ## cols( ## timestamp = col_datetime(format = ""), ## elbname = col_character(), ## requestip = col_character(), ## requestport = col_integer(), ## backendip = col_character(), ## backendport = col_integer(), ## requestprocessingtime = col_double(), ## backendprocessingtime = col_double(), ## clientresponsetime = col_double(), ## elbresponsecode = col_integer(), ## backendresponsecode = col_integer(), ## receivedbytes = col_integer(), ## sentbytes = col_integer(), ## requestverb = col_character(), ## url = col_character(), ## protocol = col_character() ## ) ## # A tibble: 1 x 16 ## timestamp elbname requestip requestport backendip backendport ## ## 1 20140929 03:24:38 lbdemo 253.89.30.138 20159 253.89.30.138 8888 ## # ... with 10 more variables: requestprocessingtime , backendprocessingtime , ## # clientresponsetime , elbresponsecode , backendresponsecode , ## # receivedbytes , sentbytes , requestverb , url , protocolIf you do try this out and end up needing to tweak it, feedback on what you had to do (via the comments) would be greatly appreciated.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Why I rarely use apply
(This article was first published on Florian Privé, and kindly contributed to Rbloggers)
In this short post, I talk about why I’m moving away from using function apply.
With matricesIt’s okay to use apply with a dense matrix, although you can often use an equivalent that is faster.
N < M < 8000 X < matrix(rnorm(N * M), N) system.time(res1 < apply(X, 2, mean)) ## user system elapsed ## 0.73 0.05 0.78 system.time(res2 < colMeans(X)) ## user system elapsed ## 0.05 0.00 0.05 stopifnot(isTRUE(all.equal(res2, res1)))“Yeah, there are colSums and colMeans, but what about computing standard deviations?”
There are lots of applylike functions in package {matrixStats}.
system.time(res3 < apply(X, 2, sd)) ## user system elapsed ## 0.96 0.01 0.97 system.time(res4 < matrixStats::colSds(X)) ## user system elapsed ## 0.2 0.0 0.2 stopifnot(isTRUE(all.equal(res4, res3))) With data frames head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa apply(head(iris), 2, identity) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 "5.1" "3.5" "1.4" "0.2" "setosa" ## 2 "4.9" "3.0" "1.4" "0.2" "setosa" ## 3 "4.7" "3.2" "1.3" "0.2" "setosa" ## 4 "4.6" "3.1" "1.5" "0.2" "setosa" ## 5 "5.0" "3.6" "1.4" "0.2" "setosa" ## 6 "5.4" "3.9" "1.7" "0.4" "setosa"A DATA FRAME IS NOT A MATRIX (it’s a list).
The first thing that apply does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).
What can you use as a replacement of apply with a data frame?

If you want to operate on all columns, since a data frame is just a list, you can use sapply instead (or map* if you are a purrrist).
sapply(iris, typeof) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "double" "double" "double" "double" "integer" 
If you want to operate on all rows, I recommend you to watch this webinar.
The memory problem is even more important when using apply with sparse matrices, which makes using apply very slow for such data.
library(Matrix) X.sp < rsparsematrix(N, M, density = 0.01) ## X.sp is converted to a dense matrix when using `apply` system.time(res5 < apply(X.sp, 2, mean)) ## user system elapsed ## 0.78 0.46 1.25 system.time(res6 < Matrix::colMeans(X.sp)) ## user system elapsed ## 0.01 0.00 0.02 stopifnot(isTRUE(all.equal(res6, res5)))You could implement your own applylike function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i and j storing positions of nonnull elements, and x storing values of these elements). Then, you could use a group_by–summarize approach.
For instance, for the previous example, you can do this in base R:
apply2_sp < function(X, FUN) { res < numeric(ncol(X)) X2 < as(X, "dgTMatrix") tmp < tapply(X2@x, X2@j, FUN) res[as.integer(names(tmp)) + 1] < tmp res } system.time(res7 < apply2_sp(X.sp, sum) / nrow(X.sp)) ## user system elapsed ## 0.03 0.00 0.03 stopifnot(isTRUE(all.equal(res7, res5))) ConclusionUsing apply with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Florian Privé. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Seasonal decomposition of short time series
(This article was first published on R on Rob J Hyndman, and kindly contributed to Rbloggers)
Many users have tried to do a seasonal decomposition with a short time series, and hit the error “Series has less than two periods”.
The problem is that the usual methods of decomposition (e.g., decompose and stl) estimate seasonality using at least as many degrees of freedom as there are seasonal periods. So you need at least two observations per seasonal period to be able to distinguish seasonality from noise.
To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
From Data to Viz  Find the graphic you need
(This article was first published on Blog – The R Graph Gallery, and kindly contributed to Rbloggers)
I’m delighted to announce a new dataviz project called ‘Data to Viz’.
What it is
From Data to Viz is a classification of chart types based on input data format. It comes in the form of a decision tree leading to a set of potentially appropriate visualizations to represent the dataset.
The decision trees are available in a poster that has been presented at UseR in Brisbane.
PhilosophieThe project is built on two underlying philosophies. First, that most data analysis can be summarized in about twenty different dataset formats. Second, that both data and context determine the appropriate chart.
Thus, our suggested method consists in identifying and trying all feasible chart types to find out which suits your data and idea best.
Once this set of graphic identified, datatoviz.com aims to guide you toward the best decision.
Content
Several sections are available on top of the decision tree:
– portfolio – an overview of all chart possibilities. For each, an extensive description is given, showing variations, pros and cons, common pitfalls and more.
– stories – for each input data format, a real life example is analyzed to illustrate the different chart types applicable to it.
– caveat gallery – a list of common dataviz pitfalls, with suggested workarounds.
Link with R
From Data to Viz aims to give general advices for data visualization in general and is not targeting R users especialy.
However, 100% of the charts are made using R, mostly using ggplot2 and the tidyverse. The reproducible code snippets are always available. The biggest part of the website is built using R Markdown, using a good amount of hacks described here.
The website is tightly linked with the R graph gallery. Once you’ve identified the graphic that suits your needs, you will be redirected to the appropriate section of the gallery to get the R code in minutes.
Next step
From Data to Viz is still in beta version, and a lot remains to be done. The caveat gallery is incomplete and some chart types are missing. Also, a few inaccuracies may be present in the decision tree. Last but not least the English is horrible but this is not likely to change unfortunately, I apologize for that.
If you find any mistake or potential improvement, please fill an issue on GitHub, contact me at Yan.holtz.data@gmail.com or on twitter, or leave a comment below. Any feedback will be very welcome.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Blog – The R Graph Gallery. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Stencila – an office suite for reproducible research
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
Stencila launches the first version of its (open source) word processor and spreadsheet editor designed for researchers.
By Michael Aufreiter, Substance, and Aleksandra Pawlik and Nokome Bentley, Stencila
Stencila is an open source office suite designed for researchers. It allows the authoring of interactive, datadriven publications in visual interfaces, similar to those in conventional office suites, but is built from the ground up for reproducibility.
Stencila aims to make it easier for researchers with differing levels of computational skills to collaborate on the same research article. Researchers used to tools like Microsoft Word and Excel will find Stencila’s interfaces intuitive and familiar. And those who use tools such as Jupyter Notebook or R Markdown are still able to embed code for data analysis within their research articles. Once published, Stencila documents are selfcontained, interactive and reusable, containing all the text, media, code and data needed to fully support the narrative of research discovery.
Source: https://stenci.la
The Stencila project aims to be part of the wider vision to enable the next generation of research article – all the way from authoring through to publication as a reproducible, selfcontained webpage. A key limitation of the current research publishing process is that conventional document formats (e.g. Word, PDF and LaTeX) do not support the inclusion of reproducible research elements, nor do they produce content in the structured format used for science publishing and dissemination (XML). Stencila aims to remove the need for manual conversion of content from source documents to XML and web (HTML) publishing formats, whilst enabling the inclusion of source data and computational methods within the manuscript. We hope that establishing a digitalfirst, reproducible archive format for publications will facilitate research communication that is faster and more open, and which lowers the barrier for collaboration and reuse. The development of Stencila is driven by community needs and in coordination with the goals of the Reproducible Document Stack, an initiative started by eLife, Substance and Stencila.
A word processor for creating journalready scientific manuscriptsStencila’s article editor builds on Texture, an open source editor built for visually editing JATS XML documents (a standard widely used by scientific journals). Supporting all elements of a standardised research article, the editor features semantic contentoriented editing that allows the user to focus on the research without worrying about layout information, which is normally stripped during the publishing process. While Texture implements all static elements (abstract, figures, references, citations and so on), Stencila extends Texture with code cells which enable computed, datadriven figures.
Spreadsheets for source data and analysisIn Stencila, datasets are an integral part of the publication. They live as individual spreadsheet documents holding structured data. This data can then be referenced from the research article to drive analysis and plots. As within Excel, cells can contain formulas and function calls to run computations directly in a spreadsheet. But not only can users enter simple expressions, they can also add and execute code in a variety of supported programming languages (at the moment R, Python, SQL and Javascript).
A walkthrough of some of the features of Stencila, using this Stencila Article. Source: YouTube; video CCBY Stencila.
Code evaluation in the browser and beyondStencila’s user interfaces build on modern web technology and run entirely in the browser – making them available on all major operating systems. The predefined functions available in Stencila use Javascript for execution so they can be run directly in the editor. For example, the plotly() function generates powerful, interactive visualizations solely using Plotly’s Javascript library.
Stencila can also connect to R, Python and SQL sessions, allowing more advanced data analysis and visualization capabilities. Stencila’s execution engine keeps a track of the dependency between code cells, enabling a reactive, spreadsheetlike programming experience both in Stencila Articles and Sheets.
An example of using R within a Stencila Sheet. Source: YouTube; video CCBY Stencila.
Reproducible Document Archive (Dar)
Stencila stores projects in an open file archive format called Dar. A Dar is essentially a folder with a number of files encompassing the manuscript itself (usually one XML per document) and all associated media.
The Dar format is open source: inspect it and provide feedback at https://github.com/substance/dar
Dar uses existing standards when possible. For instance, articles are represented as JATS XML, the standard preferred by a number of major publishers. The Dar format is a separate effort from Stencila, and aims to establish a strict standard for representing selfcontained reproducible publications, which can be submitted directly to publishers. Any other tool should be able to easily read and write such archives, either by supporting it directly or by implementing converters.
Interoperability with existing tools and workflowsStencila is developed not to replace existing tools, but to complement them. Interoperability is at the heart of the project, with the goal of supporting seamless collaboration between users of Jupyter Notebooks, R Markdown and spreadsheet applications. We are working closely with the communities of existing open source tools to improve interoperability. For instance, we are working with the Jupyter team on tools to turn notebooks into journal submissions. We are also evaluating whether the Stencila editor could be used as another interface to edit Jupyter Notebooks or R Markdown files: we hope this could help researchers who use existing tools to collaborate with peers who are used to other office tools, such as Word and Excel, and thus encourage wider adoption of reproducible computational research practises.
State of developmentOver the past two years, we’ve built Stencila from the ground up as a set of modular components that support communitydriven open standards for publishing and computation. Stencila Desktop is our prototype of a ‘researcher’s office suite’, built by combining these components into an integrated application. During this beta phase of the project, we are working to address bugs and add missing features, and welcome your feedback and suggestions (see below).
One of our next priorities will be to develop a toolset for generating a web page from a reproducible article in the Dar format. Using progressive enhancement, the reader should be able to reproduce a scientific article right from the journal’s website in various forms, ranging from a traditional static representation of the manuscript and its figures to a fully interactive, executable publication.
We will continue working on Stencila’s various software components, such as the converter module and execution contexts for R and Python, towards improved integration and interoperability with other tools in the open science toolbox (e.g. Jupyter, RStudio and Binder).
Get involvedWe’d love to get your input to help shape Stencila. Download Stencila Desktop and take it for a test drive. You could also try porting an existing manuscript over to Stencila using the Stencila command line tool. Give us your feedback and contribute ideas on our community forum or in our chat channel, or drop us an email at aleksandra@stenci.la or nokome@stenci.la.
AcknowledgmentsDevelopment of Stencila has been generously supported by the Alfred P. Sloan Foundation and eLife.
This post was originally published on eLife Labs.
eLife welcomes comments, questions and feedback. Please annotate publicly on the article or contact us at innovation [at] elifesciences [dot] org.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Is GINA really about to die?!?
(This article was first published on R – Nathan Chaney, and kindly contributed to Rbloggers)
IntroductionDuring a recent negotiation of an informed consent form for use in a clinical trial, the opposing lawyer and I skirmished over the applicability of the Genetic Information Nondiscrimination Act of 2008, commonly known as GINA. Specifically, the opposing lawyer thought that guidance issued by the U.S. Office for Human Research Protections in 2009 was now outdated, in part because enforcement efforts were erratic. The argument was primarily driven by policy, rather than data.
Being a datadriven guy, I wanted to see whether the data supported the argument advanced by the other lawyer. Fortunately, the U.S. Equal Employment Opportunity Commission (EEOC), which is responsible for administering GINA complaints, maintains statistics regarding GINA claims and resolutions. I’m not great at making sense of numbers in a table, so I thought this presented the perfect opportunity to rvest some data!
libraries < c("tidyverse", "rvest", "magrittr") lapply(libraries, require, character.only = TRUE) Data ScrapingIn standard rvest fashion, we’ll read a url, extract the table containing the GINA enforcement statistics, and then do some data cleaning. Once we read the table and gather all of the annual results into key/pair of year/value, we get the following results:
url < "https://www.eeoc.gov/eeoc/statistics/enforcement/genetic.cfm" GINA.resolutions < read_html(url) %>% html_nodes("table") %>% extract2(1) %>% html_table(trim = TRUE, fill = TRUE, header = TRUE) names(GINA.resolutions)[1] < "metric" # Top left table cell is blank, will throw errors names(GINA.resolutions) < gsub("FY (.+)", "\\1", names(GINA.resolutions)) # Remove FY from year so we can convert to numeric GINA.resolutions < GINA.resolutions %>% filter(! metric == "") %>% # Remove percentage rows filter(! metric == "Resolutions By Type") %>% # Remove blank line gather(year, value, 2:9) %>% # short and wide data to tall and skinny mutate( year = as.integer(year), value = gsub("[\\$\\%]", "", value) ) %>% mutate( value = as.numeric(value) ) %>% as.tibble() GINA.resolutions ## # A tibble: 88 x 3 ## metric year value ## ## 1 Receipts 2010 201 ## 2 Resolutions 2010 56 ## 3 Settlements 2010 3 ## 4 Withdrawals w/Benefits 2010 2 ## 5 Administrative Closures 2010 11 ## 6 No Reasonable Cause 2010 38 ## 7 Reasonable Cause 2010 2 ## 8 Successful Conciliations 2010 1 ## 9 Unsuccessful Conciliations 2010 1 ## 10 Merit Resolutions 2010 7 ## # ... with 78 more rows Claim Numbers over TimeNow that we have the data in a format we can use, we’ll look at the volume of claims and resolutions over time:
GINA.resolutions %>% filter(metric == "Receipts"  metric == "Resolutions") %>% ggplot(aes(year, value, color = metric)) + geom_line() + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Claims and Resolutions, FY 2010  FY 2017", caption = paste0("Source: ", url), x = "", y = "" ) + scale_color_brewer("", palette = "Paired")The number of GINA claims rose for the first few years, but then declined down to enactmentyear levels. This could represent support for the opposing lawyer’s argument that enforcement is waning. However, it could just as likely be showing that the deterrent effect of the law has proven effective, and most firms subject to GINA are now aware of the law and have taken appropriate steps toward compliance. Given these competing interpretations, we’ll need to look at little deeper to see if we can derive trends from the data.
GINA Claim ResolutionsOne of the arguments made by the opposing lawyer is that the Obama administration was pushing GINA enforcement, and that the Trump administration hates the law and won’t enforce it. We can look at the resolution types to test this hypothesis:
GINA.resolutions %>% filter(metric != "Receipts" & metric != "Resolutions") %>% ggplot(aes(year, value)) + geom_line() + facet_wrap(~ metric, scales = "free_y") + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Resolutions by Type, FY 2010  FY 2017", caption = paste0("Source: ", url), x = "", y = "" )In 2017, the first year of the Trump administration, administrative closures were down and resolutions on the merits were up, which contradicts the opposing lawyer’s argument. While findings of no reasonable cause were up about 1015%, findings of reasonable cause were up 50%; if anything, this also contradicts the opposing lawyer’s argument. Monetary settlements appear to be relatively flat from 2013 – 2017, and in any event a million dollars isn’t a ton of money in light of the EEOC’s annual budget of about $376 million (note that the EEOC handles many other types of charges besides GINA).
The resolution type that jumped most markedly in 2017 was “unsuccessful conciliation.” A conciliation is where the EEOC “attempt[s] to achieve a just resolution of all violations found and to obtain agreement that the respondent will eliminate the unlawful employment practice and provide appropriate affirmative relief.” 29 C.F.R. § 1601.24. It’s unclear why this jump occurred from the summary statistics provided by the EEOC.
Finally, I thought it was useful to plot all the resolution types together to show relative numbers:
GINA.resolutions %>% filter(metric != "Receipts" & metric != "Resolutions" & metric != "Monetary Benefits (Millions)*") %>% ggplot(aes(year, value, color = metric)) + geom_line() + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Resolutions by Type, FY 2010  FY 2017", caption = paste0("Source: ", url), x = "", y = "" ) + # scale_y_sqrt() + scale_color_brewer("Resolution Type", palette="Paired")From this perspective, it does look like the longterm trend has been for the EEOC to dismiss a majority of GINArelated charges as unfounded (i.e., no reasonable cause). However, for the cases that do have merit, it appears the EEOC has reversed an initial trend showing a preference toward administrative closure.
ConclusionIn all, I didn’t find the opposing lawyer’s argument particularly compelling in light of the data from President Trump’s first year in office. However, the first month of 2017 was President Obama’s last in office, and there was a flurry of activity by many regulatory agencies. It wouldn’t surprise me if EEOC also participated in a high volume of lameduck activity, and a lot of activity in January 2017 could haved skewed the annual results. Monthly statistics would be nice but didn’t appear to be readily available. The goal with any R project is for it to be repeatable with additional data, so it will be interesting to see what the data from FY2018 shows.
This wasn’t a particularly complicated coding project – in fact, this writeup took me longer to produce than writing the actual code and coming to conclusions about whether GINA is on its last leg or not. Despite that fact, I thought it was a good example of how data science can be used to inform solutions to simple as well as complex problems.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introducing the Kernelheaping Package II
(This article was first published on INWTBlogRBloggers, and kindly contributed to Rbloggers)
In the first part of Introducing the Kernelheaping Package I showed how to compute and plot kernel density estimates on rounded or interval censored data using the Kernelheaping package. Now, let’s make a big leap forward to the 2dimensional case. Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard arealevel mapping methods such as chloropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these arealevel data to a smooth point estimate. For the German capital city of Berlin, for example, there exists an open data initiative, where data on e.g. demographics is publicly available. We first load a dataset on the Berlin population, which can be downloaded from: https://www.statistikberlinbrandenburg.de/opendata/EWR201512E_Matrix.csv ```r library(dplyr) library(fields) library(ggplot2) library(Kernelheaping) library(maptools) library(RColorBrewer) library(rgdal) gpclibPermit() ``` ```r data < read.csv2("EWR201512E_Matrix.csv") ``` This dataset contains the numbers of inhabitants in certain age groups for each administrative districts. Afterwards, we load a shapefile with these administrative districts, available from: https://www.statistikberlinbrandenburg.de/opendata/RBS_OD_LOR_2015_12.zip ```r berlin < readOGR("RBS_OD_LOR_2015_12/RBS_OD_LOR_2015_12.shp") ``` ```r berlin < spTransform(berlin, CRS("+proj=longlat +datum=WGS84")) ``` Now, we form our input data set, which contains the area/polygon centers and the variable of interest, whose density should be calculated. In this case we' like to calculate the spatial density of people between 65 and 80 years of age: ```r dataIn < lapply(berlin@polygons, function(x) x@labpt) %>% do.call(rbind, .) %>% cbind(data$E_E65U80) ``` In the next step we calculate the bivariate kernel density with the “dshapebivr” function (this may take some minutes) using the prepared data and the shape file: ```r est < dshapebivr(data = dataIn, burnin = 5, samples = 10, adaptive = FALSE, shapefile = berlin, gridsize = 325, boundary = TRUE) ``` To plot the map with "ggplot2", we need to perform some additional data preparation steps: ```r berlin@data$id < as.character(berlin@data$PLR) berlin@data$E_E65U80 < data$E_E65U80 berlinPoints < fortify(berlin, region = "id") berlinDf < left_join(berlinPoints, berlin@data, by = "id") kData < data.frame(expand.grid(long = est$Mestimates$eval.points[[1]], lat = est$Mestimates$eval.points[[2]]), Density = as.vector(est$Mestimates$estimate)) %>% filter(Density > 0) ``` Now, we are able to plot the density together with the administrative districts: ```r ggplot(kData) + geom_raster(aes(long, lat, fill = Density)) + ggtitle("Bivariate density of Inhabitants between 65 and 80 years") + scale_fill_gradientn(colours = c("#FFFFFF", "#5c87c2", "#19224e")) + geom_path(color = "#000000", data = berlinDf, aes(long, lat, group = group)) + coord_quickmap() ``` This map gives a much better overall impression of the distribution of older people than a simple choropleth map: ```r ggplot(berlinDf) + geom_polygon(aes(x = long, y = lat, fill = E_E65U80, group = id)) + ggtitle("Number of Inhabitants between 65 and 80 years by district") + scale_fill_gradientn(colours = c("#FFFFFF", "#5c87c2", "#19224e"), "n") + geom_path(color = "#000000", data = berlinDf, aes(long, lat, group = group)) + coord_quickmap() ``` Often, as the case with Berlin we may have large uninhabited areas such as woods or lakes. Furthermore, we may like to compute the proportion of older people compared to the overall population in a spatial setting. The third part of this series shows how you can compute boundary corrected and smooth proportion estimates with the Kernelheaping package.Further parts of the article series Introducing the Kernelheaping Package:
 Part 1: Univariate Kernel Density Estimation for Heaped Data
 Part 2: Bivariate Kernel Density Estimation for Heaped Data
To leave a comment for the author, please follow the link and comment on their blog: INWTBlogRBloggers. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using Machine Learning for Causal Inference
(This article was first published on rbloggers – STATWORX, and kindly contributed to Rbloggers)
Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many "offtheshelf" ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization.
"Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from handwritten scrawls…[But] most statistical surveys have the identification of causal factors as their ultimate goal." – Bradley Efron
IntroductionHowever, the literature of combining ML and casual inferencing is growing by the day. One common problem of causal inference is the estimation of heterogeneous treatment effects. So, we will take a look at three interesting and different approaches for it and focus on a very recent paper by Athey et al. which is forthcoming in "The Annals of Statistics"1.
Modelbased Recursive PartitioningOne of the earlier papers about causal trees is by Zeileis et al., 20082. They describe an algorithm for Modelbased Recursive Partitioning (MOB), which looks at recursive partitioning for more complex models. They fit at first a parametric model to the data set, while using MaximumLikelihood, then test for parameter instability for a set of predefined variables and lastly split the model with the variable regarding the highest parameter instability. Those steps are repeated in each of the daughter nodes till a stopping criterion is reached. However, they do not provide statistical properties for the mob and the partitions are still quite large.
Bayesian Additive Regression TreeAnother paper uses Bayesian Additive Regression Tree (BART) for the estimation of heterogeneous treatment effects3. Hereby, one advantage of this approach is, that BART can detect and handle interactions and nonlinearity in the response surface. It uses a SumofTree Model. First, a weaklearning tree is grown, whereby the residuals are calculated and the next tree is fitted according to these residuals. Similar to Boosting Algorithms, BART wants do avoid overfitting. This is achieved by using a regularization prior, which restricts overfitting and the contribution of each tree to the final result.
Generalized Random ForestHowever, this and the next blog post will be mainly focused on the Generalized Random Forest (GRF) by Athey et al., who have already been exploring the possibilities of ML in economics before. It is a method for nonparametric statistical estimation, which uses the basic ideas of the Random Forest. Therefore, it keeps the recursive partitioning, subsampling and random split selection. Nevertheless, the final outcome is not estimated via simple averaging over the trees. The Forest is used to estimate an adaptive weighting function. So, we grow a set of trees and each observation gets weighted equalling how often it falls into the same leaf as the target observation. Those weights are used to solve a "local GMM" model.
Another important piece of the GRF is the split selection algorithm, which emphasizes maximizing heterogeneity. With this framework, a wide variety of applications is possible like quantile regressions but also the estimation of heterogeneous treatment effects. Therefore, the split selection must be suitable for a lot of different purposes. As in Breiman's Random Forest, splits are selected greedily. However, in the case of general moment estimation, we don't have a direct loss criterion to minimize. So instead we want to maximize a criterion ∆ , which favors splits that are increasing the heterogeneity of our insample estimation. Maximizing ∆ directly on the other side would be computationally costly, therefore Athey et al. are using a gradientbased approximation for it. This results in a computational performance, similar to standard CART approaches.
Comparing the regression forest of GRF to standard random forestAthey et al. are claiming in their paper that in the special case of a regression forest, the GRF gets the same results as the standard random forest by Breiman (2001). So, one already implemented estimation method in the grfpackage4 is a regression forest. Therefore, I will compare those results, with the random forest implementations of the randomForestpackage as well as the implementation of the rangerpackages. For tuning porpuses, I will use a random search with 50 iterations for the randomForest and rangerpackage and for the grf the implemented tune_regression_forest()function. The Algorithms will be benchmarked on 3 data sets, which have been already used in another blog post, while using the RMSE to compare the results. For easy handling, I implemented the regression_forest() into the caret framework, which can be found on my GitHub.
Data Set Metric grf ranger randomForest air RMSE 0.25 0.24 0.24 bike RMSE 2.90 2.41 2.67 gas RMSE 36.0 32.6 34.4The GRF performs a little bit worse in comparison with the other implementations. However, this could be also due to the tuning of the parameters, because there are more parameters to tune. According to their GitHub, they are planning on improving the tune_regression_forest()Function.
One advantage of the GRF is, that it produces unbiased confidence intervals for each estimation point. In order to do so, they are performing honest tree splitting, which was first described in their paper about causal trees5. With honest stree splitting, one sample is used to make the splits and another distinct sample is used to estimate the coefficients.
However, standard regression is not the exciting part of the Generalized Random Forest. Therefore, I will take a look at how the GRF performs in estimating heterogeneous treatment effects with simulated data and compare it to the estimation results of the MOB and the BART in my next blog post.
References Athey, Tibshirani, Wager. Forthcoming."Generalized Random Forests"
 Zeileis, Hothorn, Hornik. 2008."Modelbased Recursive Partitioning"
 Hill. 2011."Bayesian Nonparametric Modeling for Causal Inference"
 https://github.com/swager/grf
 Athey and Imbens. 2016."Recursive partitioning for heterogeneous causal effects."
Der Beitrag Using Machine Learning for Causal Inference erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – STATWORX. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Announcing the R Markdown Book
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
It is exciting for us to see the amazing growth of the R Markdown ecosystem over the four years since the creation of R Markdown in 2014. Now you can author many types of documents, and build a wide range of applications based on R Markdown. As an effort to unite and improve the documentation of the R Markdown base package (rmarkdown) and several other extensions (such as bookdown, blogdown, pkgdown, flexdashboard, tufte, xaringan, rticles, and learnr) in one place, we authored a book titled “R Markdown: The Definitive Guide”, which is to be published by Chapman & Hall/CRC in about two weeks.
You can preorder a copy now if you like. Our publisher is generous enough to allow us to provide a complete online version of this book at https://bookdown.org/yihui/rmarkdown/, which you can always read for free. The full source of this book is also freely and publicly available in the Github repo https://github.com/rstudio/rmarkdownbook.
Please feel free to let us know if you have any questions or suggestions regarding this book. You are always welcome to send Github pull requests to help us improve the book.1 We hope you will find this book useful.
 When you are reading the online version of the book, you may click the Edit button on the toolbar to edit the source file of a page, and follow the guide on Github to create a pull request.
To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to Aggregate Data in R
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases. These two stages are wrapped into a single function.
To perform aggregation, we need to specify three things in the code:
 The data that we want to aggregate
 The variable to group by within the data
 The calculation to apply to the groups (what you want to find out)
The raw data shown below consists of one row per case. Each case is an employee at a restaurant.
Load the example data by running the following R code:
library(flipAPI) data = DownloadXLSX("https://wiki.qresearchsoftware.com/images/1/1b/Aggregation_data.xlsx", want.row.names = FALSE, want.data.frame = TRUE)Perform aggregation with the following R code.
agg = aggregate(data, by = list(data$Role), FUN = mean)This produces a table of the average salary and age by role, as below.
The first argument to the function is usually a data.frame.
The by argument is a list of variables to group by. This must be a list even if there is only one variable, as in the example.
The FUN argument is the function which is applied to all columns (i.e., variables) in the grouped data. Because we cannot calculate the average of categorical variables such as Name and Shift, they result in empty columns, which I have removed for clarity.
Other aggregation functionsAny function that can be applied to a numeric variable can be used within aggregate. Maximum, minimum, count, standard deviation and sum are all popular.
For more specific purposes, it is also possible to write your own function in R and refer to that within aggregate. I’ve demonstrated this below where the second largest value of each group is returned, or the largest if the group has only one case. Note also that the groups are formed by Role and by Shift together.
second = function(x) { if (length(x) == 1) return(x) return(sort(x, decreasing = TRUE)[2])} agg = aggregate(data, by = list(data$Role, data$Shift), FUN = second) Additional featuresThe aggregate function has a few more features to be aware of:
 Grouping variable(s) and variables to be aggregated can be specified with R’s formula notation.
 Setting drop = TRUE means that any groups with zero count are removed.
 na.action controls the treatment of missing values within the data.
TRY IT OUT
The analysis in this post was performed in Displayr using R. You can repeat or amend this analysis for yourself in Displayr.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New open data sets from Microsoft Research
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Microsoft has released a number of data sets produced by Microsoft Research and made them available for download at Microsoft Research Open Data.
The Datasets in Microsoft Research Open Data are categorized by their primary research area, such as Physics, Social Science, Environmental Science, and Information Science. Many of the data sets have not been previously available to the public, and many are large and useful for research in AI and Machine Learning techniques. Many of the datasets also include links to associated papers from Microsoft Research. For example, the 10Gb DESM Word Embeddings dataset provides the IN and the OUT word2vec embeddings for 2.7M words trained on a Bing query corpus of 600M+ queries.
Other data sets of note include:
 A collection of 38M tweets related to the 2012 US election
 3D capture data from individuals performing a variety of hand gestures
 Infer.NET, a framework for running Bayesian inference in graphical models
 Images for 1 million celebrities, and associated tags
 MS MARCO, is a new largescale dataset for reading comprehension and question answering
Most data sets are provided as plain text files, suitable for importing into Python, R, or other analysis tools. In addition to downloading the data, you can also deploy the datasets for analysis in to Microsoft Azure: see the FAQ for details on this and other ways the datasets can be used.
As of this writing, the archive includes 51 datasets. You can explore and download the Microsoft Research datasets at the link below.
Microsoft: Microsoft Research Open Data
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Coloured output in the R console
(This article was first published on R – Insights of a PhD, and kindly contributed to Rbloggers)
Just a little fun today… the R console isn’t the most interesting of things… text is typically either black or red (assuming default settings in RStudio). There’s a package though called crayon which allows one to change the style of text in terms of colour, background and some fonttype settings. It could be an interesting way to spice up summaries (I’m not recommending it, but it’s a possibility. As far as packages are concerned, it’s just another dependency…)…
devtools::install_github("rlib/crayon") library(crayon) cat(green( 'I am a green line ' %+% blue$underline$bold('with a blue substring') %+% yellow$italic(' that becomes yellow and italicised!\n') )) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
My upcoming conference talks & workshops: Mcubed, ML Summit & data2day
(This article was first published on Shirin's playgRound, and kindly contributed to Rbloggers)
I’ll be giving talks and workshops at the following three upcoming conferences; hope to meet some of you there!
 From 15th to 17th October 2018, I’ll be in London for the Mcubed conference. My talk about Explaining complex machine learning models with LIME will take place on October 16
Traditional machine learning workflows focus heavily on model training and optimization; the best model is usually chosen via performance measures like accuracy or error and we tend to assume that a model is good enough for deployment if it passes certain thresholds of these performance criteria. Why a model makes the predictions it makes, however, is generally neglected. But being able to understand and interpret such models can be immensely important for improving model quality, increasing trust and transparency and for reducing bias. Because complex machine learning models are essentially black boxes and too complicated to understand, we need to use approximations.
Required audience experience: Basic knowledge of machine learning
Objective of the talk: Listeners will get an overview of why understanding machine learning models is important, how it can help us improve models and help gain trust in their decisions. I will explain in detail how one popular approach to explaining complex models – LIME – works and show an example analysis.
 At the ML Summit held on October 1st and 2nd in Berlin, Germany, I’ll be giving a workshop about image classification with Keras (German language).
Bildklassifikation leicht gemacht – mit Keras und TensorFlow
Tuesday, 2. October 2018  10:00 – 13:00 Lange Zeit galt die automatische Erkennung von Objekten, Menschen und Szenen auf Bildern durch Computer als unmöglich. Die Komplexität schien schlicht zu groß, um sie einem Algorithmus programmatisch beibringen zu können. Doch Neuronale Netze haben dies drastisch verändert! Inzwischen ist Bilderkennung ist ein weit verbreitetes Anwendungsgebiet von Maschinellem Lernen. Häufig werden dafür sogenannte “Convolutional Neuronal Networks”, oder “ConvNets” verwendet. In diesem Workshop werde ich zeigen, wie einfach es ist, solch ein Neuronales Netz selber zu bauen. Dafür werden wir Keras und TensorFlow verwenden. Wir werden zunächst ein komplettes Netz selber trainieren: vom Einlesen der Bilder, über das Definieren des Netzes, hin zum Evaluieren auf Testbildern. Anschließend gucken wir uns an, wie man mit Transfer Learning und vortrainierten Netzen auch mit wenigen eigenen Bildern schnell Erfolge sehen kann. Im letzten Teil des Workshops soll es dann darum gehen, wie wir diese Bilderkennungsmodelle besser verstehen können – zum Beispiel indem wir die Knoten in Zwischenschichten visualisieren; so können wir Muster und für die Klassifikation wichtige Bildbereiche finden und die Klassifikation durch das Modell nachvollziehen. Installationshinweise: Wir werden mit Python3 in Google Collaboratory arbeiten.
 Together with my colleague Mark, I’ll be giving a workshop about “END2END VOM KERAS TENSORFLOWMODELL ZUR PRODUKTION” at the data2day conference, which is being held from September 25th – 27th 2018 in Heidelberg, Germany (German language).
Durch das stark wachsende Datenvolumen hat sich das Rollenverständnis von Data Scientists erweitert. Statt MachineLearningModelle für einmalige Analysen zu erstellen, wird häufiger in konkreten Entwicklungsprojekten gearbeitet, in denen Prototypen in produktive Anwendungen überführt werden. Keras ist eine HighLevelSchnittstelle, die ein schnelles, einfaches und flexibles Prototypisieren von Neuronalen Netzwerken mit TensorFlow ermöglicht. Zusammen mit Luigi lassen sich beliebig komplexe DatenverarbeitungsWorkflows in Python erstellen. Das führt dazu, dass auch NichtEntwickler den End2EndWorkflow des KerasTensorFlowModells zur Produktionsreife leicht implementieren können.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
phylogram: dendrograms for evolutionary analysis
(This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers)
Evolutionary biologists are increasingly using R for building,
editing and visualizing phylogenetic trees.
The reproducible codebased workflow and comprehensive array of tools
available in packages such as ape,
phangorn and
phytools make R an ideal platform for
phylogenetic analysis.
Yet the many different tree formats are not well integrated,
as pointed out in a recent
post.
The standard data structure for phylogenies in R is the “phylo”
object, a memory efficient, matrixbased tree representation.
However, nonbiologists have tended to use a tree structure
called the “dendrogram”, which is a deeply nested list with
node properties defined by various attributes stored at each level.
While certainly not as memory efficient as the matrixbased format,
dendrograms are versatile and intuitive to manipulate, and hence
a large number of analytical and visualization functions exist
for this object type. A good example is the
dendextend package,
which features an impressive range of options for editing dendrograms
and plotting publicationquality trees.
To better integrate the phylo and dendrogram object types,
and hence increase the options available for both camps,
we developed the phylogram
package, which is now a part of the rOpenSci
project.
This small package features a handful of functions for tree conversion,
importing and exporting trees as parenthetic text, and manipulating
dendrograms for phylogenetic applications.
The phylogram package draws heavily on ape,
but currently has no other nonstandard dependencies.
To download phylogram from CRAN and load the package, run
install.packages("phylogram") library(phylogram)Alternatively, to download the latest development version from GitHub,
first ensure that the devtools,
kmer, and
dendextend
packages are installed,
then run:
A wide variety of tree formats can be parsed as phylo objects using either the
welloptimized ape::read.tree function
(for Newick
strings),
or the suite of specialized functions in the versatile
treeio package.
To convert a phylo object to a dendrogram, the phylogram package includes
the function as.dendrogram, which retains node height attributes and can handle
nonultrametric trees.
For singleline parsing of dendrograms from Newick text,
the read.dendrogram function wraps ape::read.tree
and converts the resulting phylo class object to a dendrogram using as.dendrogram.
Similarly, the functions write.dendrogram and as.phylo are used to
export dendrogram objects to parenthetic text and phylo objects, respectively.
The phylogram package includes some new functions for manipulating
trees in dendrogram format.
Leaf nodes and internal branching nodes can be removed
using the function prune, which identifies and
recursively deletes nodes based on pattern
matching of “label” attributes.
This is slower than ape::drop.tip, but offers
the benefits of versatile string matching using regular expressions,
and the ability to remove inner nodes (and by extension all subnodes)
that feature matching “label” attributes.
To aid visualization, the function ladder rearranges
the tree, sorting nodes by the number of members
(analogous to ape::ladderize).
For more controlled subsetting or when creating trees from scratch
(e.g. from a standard nested list), the function remidpoint
recursively corrects all “midpoint”, “members” and “leaf” attributes.
Node heights can then be manipulated using either reposition, which
scales the heights of all nodes in a tree by a given constant, or
as.cladogram, which resets the “height” attributes of all terminal
leaf nodes to zero and progressively resets the heights of the inner nodes
in single incremental units.
As an example, a simple threeleaf dendrogram can be created from
a nested list as follows:
A nice feature of the dendrogram object type is that tree
editing operations can be carried out recursively
using fast inbuilt functions in the “apply” family such as dendrapply
and lapply.
For example, to label each leaf node of the tree alphabetically we can
create a simple labeling function and apply it to the tree nodes recursively using
dendrapply.
One application motivating bidirectional conversion between phylo and
dendrogram objects involves creating publicationquality ‘tanglegrams’ using
the dendextend package.
For example, to see how well the fast, alignmentfree kmer distance
from the kmer package
performs in comparison to the standard Kimura 1980 distance measure,
we can create neighborjoining trees using each method and plot them side by side
to check for incongruent nodes.
In this case, the trees are congruent and branch lengths are similar.
However, if we reduce the kmer size from 7 to 6,
the accuracy of the tree reconstruction is affected, as shown by the
incongruence between the original K80 tree (left) and the tree derived
from the 6mer distance matrix (right):
Hopefully users will find the package useful for a range of other applications.
Bug reports and other suggestions are welcomed, and can be directed to the
GitHub issues page
or the phylogram google group.
Thanks to Will Cornwell and
Ben J. Ward
for reviewing the code and suggesting improvements,
and to Scott Chamberlain
for handling the rOpenSci
onboarding process.
The phylogram package is available for download from
GitHub and
CRAN,
and a summary of the package is published in the
Journal of Open Source Software.
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Four Ways to Write Better Stan Code
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
1. Improve sampler efficiency by picking the right model
We need to address how we specify our models before even discussing writing code that is optimized computationally. Make sure that your model is parameterized in such a way that Stan can easily sample it using its algorithms – No UTURN Sampler (NUTS) or Hamiltonian Monte Carlo (HMC). Simply selecting the right model can make a big difference in terms of efficiency. Due to differences in implementations and algorithms, an efficient parameterization in stan is not necessarily the one that was best in other software we’ve tried.
A poorly specified model will require more samples to reach convergence and adequately explore the posterior distribution or they may not converge at all. For some models, reparameterization can be an effective means of improve sampling efficiency by replacing a complex distribution Stan has difficulty sampling from, with one from which it can draw more easily. One example discussed in Section 28.6 of the Stan manual involves reparameterizing the Cauchy distribution, a challenge for Stan to sample from because of the heavy tails. The difficulties can be fixed by instead sampling a uniform random variable and using the probability integral transform.
2. Matrices or arrays, pick carefully!It can also be confusing whether to use matrices or arrays when writing your code. There are actually four different ways to specify a twodimensional collection of real numbers! But which one should you pick? This largely depends on the operations you need to perform. If you need to do matrix computations, you should be using a matrix. However, if you frequently need to index into the rows of the matrix it is more efficient to use arrays. In this situation, it will save you a headache to declare an array of type row_vector than to work with matrices.
Matrices and arrays should also be traversed in different order. Loops involving arrays should be traversed with the last index varying fastest, whereas the opposite is true for matrices. Additionally, traversing through matrices is slightly more efficient. If for example your code involves an I x J array of matrices each having dimension R x C, then the most efficient way to write a loop that traverses every element is:
matrix[R,C] a[I,J];
for (i in 1:I)
for (j in 1:J)
for (c in 1:C)
for (r in 1:R)
a[i,j,r,c] = ……
Stan has a number of optimized builtin functions for common computations such as dot products and special matrix multiplications. You should use these whenever possible to save yourself from having to write your own code to perform the calculations. There are also functions that will improve the speed by vectorizing code. For example, this loop
matrix[R,C] m;
for(r in 1:R)
m[r] = normal(0,1);
should be replaced with the faster, vectorized code:
matrix[R,c] m;
to_vector(m) ~ normal(0,1);
Because Stan relies on compiled C++ code, it may be possible for advanced users to further optimize by changing compiler flags in R’s version of a Makefile, known as a Makevars file. It is recommended to do this in a user specific file located in ~/.R/Makevars. Be careful though – using overly aggressive compiler options can result in code that is not portable across machines and architectures. The tradeoff however is code that is as fast as possible on your own computer. The RStan installation guide recommends adding CXXFLAGS==O3 to the Makevars file for the highest level of overall optimization possible, but be warned – this can result in increased memory usage and larger binary file sizes!
We hope these four tips help you improve the efficiency of your coded models! Check out more tips and tricks here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
John Mount speaking on rquery and rqdatatable
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or inmemory. The speed up both execution (through optimizations) and development (though a good mental model and upfront error checking) for data wrangling tasks.
WinVector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).
One may ask: “why share new packages with an “R Language Beginners Group?” (Though, I prefer to use the term “part time R user”.) Because: such packages can give such users easy, safe, consistent, and performant access to powerful data systems. rquery establishes a notation and links to external data providers such as PostgreSQL, Amazon Redshift, and Apache Spark. rqdatatable supplies an additional inmemory implementation of the rquery notation using the powerful data.table package.
rquery and rqdatatable can, if given a fair chance, be significant R tools. Users are that chance.
If you are around please come see the talk.
Some earlier articles on rquery and rqdatatable can be found here and here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New Course: Python for R Users
(This article was first published on DataCamp Community  r programming, and kindly contributed to Rbloggers)
Here is the course link.
Course DescriptionPython and R have seen immense growth in popularity in the "Machine Learning Age". They both are highlevel languages that are easy to learn and write. The language you use will depend on your background and field of study and work. R is a language made by and for statisticians, whereas Python is a more general purpose programming language. Regardless of the background, there will be times when a particular algorithm is implemented in one language and not the other, a feature is better documented, or simply, the tutorial you found online uses Python instead of R. In either case, this would require the R user to work in Python to get his/her work done, or try to understand how something is implemented in Python for it to be translated into R. This course helps you cross the RPython language barrier.
Chapter 1: The Basics (Free)Learn about some of the most important data types (integers, floats, strings, and booleans) and data structures (lists, dictionaries, numpy arrays, and pandas DataFrames) in Python and how they compare to the ones in R.
Chapter 2: Control flow, Loops, and FunctionsThis chapter covers control flow statements (ifelse ifelse), for loops and shows you how to write your own functions in Python!
Chapter 3: PandasIn this chapter you will learn more about one of the most important Python libraries, Pandas. In addition to DataFrames, pandas provides several data manipulation functions and methods.
Chapter 4: PlottingYou will learn about the rich ecosystem of visualization libraries in Python. This chapter covers matplotlib, the core visualization library in Python along with the pandas and seaborn libraries.
Chapter 5: CapstoneAs a final capstone, you will apply your Python skills on the NYC Flights 2013 dataset.
Prerequisites var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community  r programming. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
[ggplot2] Welcome viridis !
(This article was first published on (en) The R Task Force, and kindly contributed to Rbloggers)
Let’s welcome the viridis palette into the new version of {ggplot2}!
Viriwhat ?viridis is one of the favorite color palettes of one of the member of the team (guesswho).
The viridis palette was first developed for the python package matplotlib, and has been implemented in R since.
The strengths of this palette are that:
 plots are beautiful (which is good enough a reason to use it)
 colors are perfectly perceptuallyuniform, even when printed in black and white (yes, people still print stuffs in 2018…)
 colors are perceived by the most common forms of color blindness (an important feature)
This palette is now the default color scheme for colormap in the matplolib python package. It had been available as a color scale in R with the {viridisLite} package, then as a ggplot2 scale in {viridis}, and is now fully integrated in {ggplot2} with the last release.
Why viridis? The story behindAs said in A Better Default Colormap for Matplotlib: “colormap are interfaces between your data and your brain” — which means that the choices of colors can have a significant effect on the way you make decision. We won’t go into the details of color theory (there is a lot of literature on that topic, and we are not the best to speak in depth about it), but it’s pretty obvious that colors influence the way you perceive things.
For example, in western Europe (and I guess in many countries around the globe), we are culturally conditionned to think of a red sign as an alert. Green on the other hand tends to indicate something sage. If today I had to design a color scheme for error / warning / success, I would undoubtly go for a red / orange / green solution. I’m not the only one, and the first example that comes to my mind is the {shinyalert} package, which has this exact model of colors. This seems simple (and it is), and there is a lot of articles and books about this paradigm applied to marketing and selling (I’ll let you Googlescholar that). For example, here is an “emotion color wheel” called the Plutchik’s wheel of emotions, made by the psychologist of the same name:
NB: we do not share / endorse this simplified theory of color, it’s just used there as an example
Let’s take a more serious example, borrowed from the talk I linked just before. In an article (that sounds like a lot of fun) called “Evaluation of artery visualizations for heart disease diagnosis” (link), the authors showed how a color palette can play a critical role in diagnosing the presence of an heart disease. I’ll let you dig into the details if you want to, but long story short, the color palette of the software used to visualise the results has an effect on the capacity of the professional to read and detect a specific condition.
Why are we talking about that? Because as we want R to be used in production for visualising data, it’s important to keep in mind that the color scheme we choose in the end product to use is not without impact: being simply on readability or in the emotions that is conveyed by the graph.
Here comes a new challengerThe standard color palette in a lot of tools is called the Jet palette. You might be familiar with it, as it is implemented in a lot of softwares. This color scheme appears as a good choice to many as we tend to think that this palette makes it easy to distinguish colors (spoiler: it doesn’t).
It looks like this:
library(matlab) ## Attaching package: 'matlab' ## The following object is masked from 'package:stats': ## reshape ## The following objects are masked from 'package:utils': ## find, fix ## The following object is masked from 'package:base': ## sumThis example is taken from the viridis vignette, put into a function
with_palette < function(palette) { x < y < seq(8 * pi, 8 * pi, len = 40) r < sqrt(outer(x^2, y^2, "+")) filled.contour(cos(r^2) * exp(r / (2 * pi)), axes = FALSE, color.palette = palette, asp = 1 ) } with_palette(jet.colors)Let’s just for a second compare to the same data with viridis:
library(viridis) ## Loading required package: viridisLite with_palette(viridis)If it seems obvious to you that the viridis palette is easier to read… it’s because it is.
How to implement a new color paletteWith the viridis colormap, the purpose was to have something that is:
 colorful
 pretty
 sequential
 perceptually uniform
 working with black and white
 accessible to colorblind
So how can one do that? How can one implement a new palette that answers these demands? Before getting into the that, let’s analyse how the data go from your computer to your brain.
To be projected, the data is transformed in RGB notation (for “Red Green Blue”), either as an hex (#FFCC99), as a percentage (rgb(100%,80%,60%) : 100% red, 80% green, 60% blue) or as a number between 0 and 255 (rgb(255,204,153)). All these notations are the same: the hexadecimal notation being used to represent numbers from 0 to 255 with 16 symbols (from 09 & AF, 16^2 being 256).
In a lot of cases, with color, there are two symbols added to the end of the hex, used for indicating transparency. For example:
viridis(2) ## [1] "#440154FF" "#FDE725FF"Note: this is made possible by the fact that a pixel is coded using 32 bits, and a color is represented only 24 out of these 32 bits (3*8), leaving room for another 8 bits.
Note bis: you remember that the hex notation allows 256 combinations, which correspond to the number of possible combination for 8 bits (2^8).
So once you have your lowest color, and your highest, you can use a color palette generator to get a range of color from one to the other. It’s done, for example, by the colorRampPalette function from {grDevices} :
colfunc < colorRampPalette(c("black", "white"))
colfunc(10)
[1] "#000000" "#1C1C1C" "#383838" "#555555" "#717171" "#8D8D8D" "#AAAAAA" "#C6C6C6"
[9] "#E2E2E2" "#FFFFFF"
The computation of this range is out of the scope of this article, but you get the idea Once we have decided what our palette should look like, it is then possible to scale your data to fit your palette: as your palette is going from one point of the color scale to another by passing by a series of intermediary tones, you can give each value of your data a spot in you colormap. The lowest value of your data is mapped to the lowest value of the palette, and the highest to the highest.
The viridis colormapSo now that we have the theory, let’s get to the big question: how can we choose a palette that is colorbling friendly? To do that, the solution found by the viridis colormap is to avoid the red and to go from blue to yellow,as these colors can be seen by most of the colorblind.
Let’s use the {dichromat} package to simulate that :
library(dichromat)Let’s see what the jet palette looks like when you simulate several cases of colorblindness:
library(purrr) with_palette( compose( partial(dichromat, type = "deutan"), jet.colors ) ) with_palette( compose( partial(dichromat, type = "protan"), jet.colors ) ) with_palette( compose( partial(dichromat, type = "tritan"), jet.colors ) )And with viridis:
with_palette( compose( partial(dichromat, type = "deutan"), viridis ) ) with_palette( compose( partial(dichromat, type = "protan"), viridis ) ) with_palette( compose( partial(dichromat, type = "tritan"), viridis ) )As you can see, the contrast stays readable in every situation.
Finally, to suit the printing need, the palette had to go from dark to light. And the wider range was from dark blue to bright yellow. Let’s just compare two examples:
with_palette( compose( colorspace::desaturate, jet.colors ) ) with_palette( compose( colorspace::desaturate, viridis ) ) Back to RSo, enough theory, time to get back to R.
viridis as aLet’s imagine for a minute we are using the old version of ggplot2 (as many still do as we write this lines).
When using this (now old) version of {ggplot2}, we needed the {viridis} package to use this color palette.
library(ggplot2) library(viridis) ggplot(iris) + aes(Sepal.Length, Sepal.Width, color = Petal.Length) + geom_point() + scale_colour_viridis() # from the viridis packageAnother way to do that was to get a vector of hex colors, with the {viridisLite} package (note that the functions from {viridis} call functions from {viridisLite}).
pal < viridisLite::viridis(3) pal ## [1] "#440154FF" "#21908CFF" "#FDE725FF" ggplot(iris) + aes(Sepal.Length, Sepal.Width, color = Species) + geom_point() + scale_color_manual(values = pal)viridis as a ggplot scale
But now, thanks to the new ggplot2 (version 3.0.0), you can call directly the viridis palette with the ad hoc functions.
ggplot(iris) +
aes(Sepal.Length, Sepal.Width, color = Species) +
geom_point() +
scale_color_viridis_d()#From ggplot2
There are two scales for numeric inputs: discrete (scale_color_viridis_d) and continous (scale_color_viridis_c).
And there is of course the counterpart with fill :
ggplot(faithfuld) + aes(waiting, eruptions, fill = density) + geom_tile() + scale_fill_viridis_c()Note the difference in readability compared to the default palette:
ggplot(faithfuld) + aes(waiting, eruptions, fill = density) + geom_tile()Also, viridis is the new default palette for ordered factors :
diamonds %>% dplyr::count(cut, color) %>% ggplot() + aes(cut, n, fill = color) + geom_col()Finally, and I didn’t presented them so far, but there are 4 other available colormaps: “magma” (or “A”), “inferno” (or “B”), “plasma” (or “C”), and “cividis” (or “E”), which can be passed to the option argument in the various scales:
ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "A") ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "B") ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "C") ggplot(mtcars) + aes(mpg, disp, color = cyl) + geom_point() + scale_color_viridis_c(option = "E")So, to sum up:
And in conclusion, here is why the inclusion of viridis as a native scale in ggplot2 is a good news:
 More beautiful plots (but in the end, that’s secondary)
 Better and more inclusive end products : easier to read for people with colorblindness or vision problem
NB: of course, this was already possible before, but the fact that it is now native will make it more used and reknown.
The post [ggplot2] Welcome viridis ! appeared first on (en) The R Task Force.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Science For Business: 3 Reasons You Need To Learn The Expected Value Framework
(This article was first published on businessscience.io  Articles, and kindly contributed to Rbloggers)
One of the most difficult and most critical parts of implementing data science in business is quantifying the returnoninvestment or ROI. As a data scientist in an organization, it’s of chief importance to show the value that your improvements bring. In this article, we highlight three reasons you need to learn the Expected Value Framework, a framework that connects the machine learning classification model to ROI. Further, we’ll point you to a new video we released on the Expected Value Framework: Modeling Employee Churn With H2O that was recently taught as part of our flagship course: Data Science For Business (DS4B 201). The video serves as an overview of the steps involved in calculating ROI from reducing employee churn with H2O, tying the key H2O functions to the process. Last, we’ll go over some Expected Value Framework FAQ’s that are commonly asked in relation to applying Expected Value to machine learning classification problems in business.
Articles In This SeriesIf you’re interested in data science for business and / or the HR Analytics Employee Turnover problem and how it can be solved with H2O Automated Machine Learning, check out some of these tutorials.

HR Analytics: Using Machine Learning To Predict Employee Turnover

Data Science For Business Tutorial: Using Machine Learning With LIME To Understand Employee Churn
We’ll touch on the following Expected Value Framework topics in this article:
Alright, let’s get started!
Get The Best Resources In Data Science. Every Friday!Sign up for our free “5 Topic Friday” Newsletter. Every week, I’ll send you the five coolest topics in data science for business that I’ve found that week. These could be new R packages, free books, or just some fun to end the week on.
Sign Up For FiveTopicFriday!
3 Reasons You Need To Learn The Expected Value FrameworkHere are the 3 reasons you need to know about Expected Value if you want to tie data science to ROI for a machine learning classifier. We’ll use examples from the DS4B 201 course that are related to employee churn (also called employee turnover or employee attrition).
Reason #1: We Can Avoid The Problem With Max F1The problem with machine learning classification is that most data scientists use the threshold that maximizes F1 as the classification threshold.
The problem with machine learning classification is that most data scientists use the threshold that maximizes F1 as the classification threshold. Using the threshold that maximizes F1 is the default for most machine learning classification algorithms. For many problems outside of the business domain this is OK. However, for business problems, this is rarely the best choice.
For those that are wondering what the threshold at max F1 is, it’s the threshold that harmonically balances the precision and recall (in other words, it optimally aims to reduce both the false positives and the false negatives finding a threshold that achieves a relative balance). The problem is that, in business, the costs associated with false positives (Type 1 Errors) and false negatives (Type 2 Errors) are rarely equal. In fact, in many cases false negatives are much more costly ( by a factor of 3 to 1 or more!).
Example: Cost of Type 1 And Type 2 Errors For Employee AttritionAs an example, let’s focus on the cost difference between Type 1 (false positives) and Type 2 (false negatives) in an employee attrition problem. Say we are considering a mandatory overtime reduction because saw that employees flagged as working overtime are 5X more likely to quit. We develop a prediction algorithm using H2O Automated Machine Learning and then run our LIME algorithm to develop explanations at the employee level. The LIME results confirm our suspicion. Overtime is a key feature supporting Employee Attrition.
Calculating Expected Attrition Cost From H2O + LIME Results
We develop a proposal to reduce overtime using our H2O classification model, which by default uses the threshold that maximizes F1 (treats Type 1 and Type 2 errors equally). We then begin targeting people for overtime reduction. We end up misclassifying people that leave as stay (Type 2 error) at roughly the same rate as we misclassify people that stay as leave (Type 1 error). The cost of overtime reduction for an employee is estimated at 30% of the lost productivity if the employee quits.
Here lies the problem: The cost of reducing the overtime incorrectly for some one that stays is 30% of missing the opportunity to reduce overtime for an employee incorrectly predicted to stay when they leave. In other words, Type 2 Error is 3X more costly than Type 1 Error, yet we are treating them the same!
“Type 2 Error is 3X more costly than Type 1 Error, yet we are treating them the same!”
Because of this, the optimal threshold for business problems is almost always less than the F1 threshold. This leads us to our second reason you need to know the Expected Value Framework.
Reason #2: We Can Maximize Expected Profit Using Threshold OptimizationWhen we have a calculation to determine the expected value using business costs, we can perform the calculation iteratively to find the optimal threshold that maximizes the expected profit or savings of the business problem. By iteratively calculating the savings generated at different thresholds, we can see which threshold optimizes the targeting approach.
In the example below, we can see in the threshold optimization results that the maximum savings ($546K) occurs at a threshold of 0.149, which is 16% more savings than the savings at threshold at max F1 ($470K). It’s worth mentioning that the threshold that maximizes F1 was 0.280, and that for a test set containing 15% of the total population it cost $76K due to being suboptimal ($546K – $470K). Extending this inefficiency to the full population (train + test data), this is a missed opportunity of $500K annually!
Threshold Optimization Results Results in 16% Benefit for A Test Set Containing 15% Of Total Population, Extending To Full Set is $500K Savings
The Expected Value Framework enables us to find the optimal threshold that accounts for the business costs, thus weighting Type 2 Errors appropriately. As shown, this can result in huge savings over the F1.
The Expected Value Framework enables us to find the optimal threshold that accounts for the business costs, thus weighting Type 2 Errors appropriately.
Ok, now we know we need to use Expected Value Framework. But it’s worth mentioning that the model we just built using H2O is not reality.
Wait, what?!?!
Here’s what I mean: The model is based on a number of assumptions including the average overtime percentage, the anticipated net profit per employee, and so on. The assumptions are imperfect. This leads us to the third reason you need to learn the Expected Value Framework.
Reason #3: We Can Test Variability In Assumptions And Its Effect On Expected ProfitLet’s face it, we don’t know what the future will bring, and we need to put our assumptions in check. This is exactly what the third benefit enables: Sensitivity Analysis, or testing the effect of model assumptions on expected profit (or savings).
In the human resources example below, we tested for a range of values average overtime percentage and net revenue per employee because our estimates for the future may be off. In the Sensitivity Analysis Results shown below, we can see in the profitability heat map that as long as the average overtime percentage is less than or equal to 25%, implementing a targeted overtime policy saves the organization money.
Sensitivity Analysis Results (Profitability Heat Map)
Wow! Not only can we test for the optimal threshold that maximizes the business case, we can use expected value to test for a range of inputs that are variable from year to year and person to person. This is huge for business analysis!
Interested in Learning The Expected Value Framework For Business?If you’re interested in learning how to apply the Expected Value Framework using R Programming, we teach it in Chapters 7 and 8 of Data Science For Business (DS4B 201) as part of our endtoend Employee Churn business data science project.
YouTube Expected Value Framework OverviewHere’s a 20minute tutorial on the Expected Value Framework that applies to the Data Science For Business (DS4B 201) course where we use H2O Automated Machine Learning to develop high performance models identifying those likely to leave and then the expected value framework to calculate the savings due to various policy changes.
FAQS: Expected Value FrameworkThe Expected Value Framework can be confusing and there’s a lot to learn. Here are a few of the most frequently asked questions related to applying the EVF to machine learning classification problems in business.
FAQ 1. What Is Expected Value?Expected value is a way of assigning a value to a decision using the probability of it’s occurrence. We can do this for almost any decision.
Example: Is A FaceToFace Meeting Warranted?Suppose you work for a company that makes high tech equipment. Your company has received a quote for a project estimated at $1M in gross profit. The downside is that your competition is preferred and 99% likely to get the award. A facetoface meeting will cost your organization $5,000 for travel costs, time to develop a professional quotation, and time and materials involved in the sales meeting. Should you take the meeting or noquote the project?
High Tech Semiconductor Manufacturer
This problem is an expected value problem. We can assign a value to the meeting. We know that the competition is 99% likely to win, but we still have a 1% shot. Is it worth it?
Applying expected value, we multiply the probability of a win, which is 1%, by the profit of the win, which is $1M dollars. This totals $10,000 dollars. This is the expected benefit of the decision excluding any costs. If our total spend for the facetoface meeting and quotation process is $5000, then it makes sense to take the meeting because it’s lower than $10,000 dollars, or the expected benefit.
FAQ 2. What Is The Expected Value Framework?The Expected Value Framework is way to apply expected value to a classification model. Essentially it connects a machine learning classification model to ROI for the business. It enables us to combine:
 The threshold,
 Knowledge of costs and benefits, and
 The confusion matrix converted to expected error rates to account for the presence of false positives and false negatives.
We can use this combination to target based on postive class probability (think employee flight risk quantified as a probability) to gain even greater expected savings than an “allornone” approach without the framework.
Source: Data Science for Business by Foster Provost & Tom Fawcett
It’s the dividing line that we (the data scientist) select between which values for the positive class probability (Yes in the example shown below) we convert to a positive versus a negative. If we choose a threshold (say 0.30), only one observation meets this criteria.
Class Probabilities
FAQ 4: What Are The Expected Rates?The confusion matrix results can be converted to expected rates through a process called normalization. The “Actual Negative” and “Actual Positive” results are grouped and then converted to probabilities. The expected rates are then probabilities of correctly predicting an Actual value.
Expected Rates
FAQ 5: What Is The CostBenefit Matrix?The cost / benefit matrix is the final piece needed for the Expected Value Framework. We develop this using our intuition about the problem. This is the most difficult part because we need to apply critical thinking to the business problem and the methodology we intend to use when coding the problem.
For the Employee Churn problem, one way to think of the cost benefit is analyzing two states: an initial state (baseline) and a new state (after a policy change to reduce overtime). We’ll use an employee named Maggie from the DS4B 201 course.
Initial State: Baseline With Allowing Overtime For MaggieFirst, we’ll look at the initial state. In this scenario, Maggie’s probability of leaving is 15%, and she has a cost of attrition of $167K. The cost of a policy change is zero because no policy change is implemented in the baseline case. There are four scenarios we need to account for with probabilities of their own:
 The cases of true positives and true negatives when the algorithm predicts correctly
 And the cases of false positives and false negatives when the algorithm predicts incorrectly.
The cost of each scenario is what we are concerned with.
CostBenefit Matrix: Initial State (Baseline)
Going through a costbenefit analysis, we can address each of these costs:
 First is true negatives. If Maggie stays, the cost associated with her staying is nothing.
 Second is true positives. If Maggie leaves, the cost associated with her leaving is $167K, which is her attrition cost.
 Third, is false positives. If Maggie was predicted to leave, but actually stays, the cost associated with this scenario is nothing. We did not implement a policy change for the baseline so we don’t have any costs.
 Fourth, is false negatives. If Maggie leaves but was predicted to stay, we lose her attrition cost.
The expected cost associated with her leaving is $25K
New State: Targeting Maggie For Overtime ReductionLet’s propose a new state, one that eliminates overtime for Maggie.
In this scenario, Maggie’s probability of leaving drops to 3%. Her cost of attrition is the same, but now we are expecting a higher cost of policy change than we originally anticipated. The cost of the policy change in this scenario is 20% of her attrition cost, meaning she’s working approximately 20% over time. Should we make the policy change?
Like the initial state, there are four scenarios we need to account for with probabilities of their own
 First is true negatives. If Maggie stays, the cost associated with her staying is 20% of her attrition cost, or $33K.
 Second is true positives. If Maggie leaves, the cost associated with her leaving is $167K, which is her attrition cost plus the $33K policy change cost for reducing her overtime. This totals $200K.
 Third, is false positives. If Maggie was predicted to leave, but actually stays, the cost associated with this scenario is $33K because she was targeted.
 Fourth, is false negatives. If Maggie leaves but was predicted to stay, we lose her attrition cost plus the cost of targeting her, which total $200K
The expected cost is $38K for this scenario
Expected Savings (Benefit Of Expected Value Framework)At an overtime percentage of 20%, the savings is (negative) $13K versus the baseline. Therefore, we should not reduce Maggie’s over time.
This is the benefit of using Expected Value. Since the model predicts Maggie having only a 15% risk of leaving when working overtime, she is not a high flight risk. We can use the model to target people that are much higher risk to retain.
Next Steps: Take The DS4B 201 Course!If interested in learning more, definitely check out Data Science For Business (DS4B 201). In 10 weeks, the course covers all of the steps to solve the employee turnover problem with H2O in an integrated fashion.
The students love it. Here’s a comment we just received on Sunday morning from one of our students, Siddhartha Choudhury, Data Architect at Accenture.
“To be honest, this course is the best example of an end to end project I have seen from business understanding to communication.”
Siddhartha Choudhury, Data Architect at Accenture
See for yourself why our students have rated Data Science For Business (DS4B 201) a 9.0 of 10.0 for Course Satisfaction!
Learning MoreCheck out our other articles on Data Science For Business!

HR Analytics: Using Machine Learning To Predict Employee Turnover

Data Science For Business Tutorial: Using Machine Learning With LIME To Understand Employee Churn

Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn

Sales Analytics: How To Use Machine Learning To Predict And Optimize Product Backorders
Business Science University is a revolutionary new online platform that get’s you results fast.
Why learn from Business Science University? You could spend years trying to learn all of the skills required to confidently apply Data Science For Business (DS4B). Or you can take the first course in our Virtual Workshop, Data Science For Business (DS4B 201). In 10 weeks, you’ll learn:

A 100% ROIdriven Methodology – Everything we teach is to maximize ROI.

A clear, systematic plan that we’ve successfully used with clients

Critical thinking skills necessary to solve problems

Advanced technology: _H2O Automated Machine Learning__

How to do 95% of the skills you will need to use when wrangling data, investigating data, building highperformance models, explaining the models, evaluating the models, and building tools with the models
You can spend years learning this information or in 10 weeks (one chapter per week pace). Get started today!
DS4B Virtual Workshop: Predicting Employee AttritionDid you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.
Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301
Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:
 DS4B 201: Predicting Employee Attrition with h2o and lime
 DS4B 301 (Coming Soon): Building A Shiny Web Application
 DS4B 302 (Coming Soon): Data Communication With RMarkdown Reports and Presentations
 DS4B 303 (Coming Soon): Building An R Package For Your Organization, tidyattrition
The Virtual Workshop is code intensive (like these articles) but also teaches you fundamentals of data science consulting including CRISPDM and the Business Science Problem Framework and many data science tools in an integrated fashion. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.
Here’s what one of our students, Jason Aizkalns, Data Science Lead at SaintGobain had to say:
“In an increasingly crowded data science education space, Matt and the Business Science University team have found a way to differentiate their product offering in a compelling way. BSU offers a unique perspective and supplies you with the materials, knowledge, and frameworks to close the gap between just “doing data science” and providing/creating value for the business. Students will learn how to formulate their insights with a valuecreation / ROIfirst mindset which is critical to the success of any data science project/initiative in the “real world”. Not only do students work a business problem endtoend, but the icing on the cake is “peer programming” with Matt, albeit virtually, who codes clean, leverages best practices + a good mix of packages, and talks you through the why behind his coding decisions – all of which lead to a solid foundation and better habit formation for the student.”
Jason Aizkalns, Data Science Lead at SaintGobain
Don’t Miss A Beat Sign up for the Business Science blog to stay updated
 Get started with Business Science University to learn how to solve realworld data science problems from Business Science
 Check out our Open Source Software
If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:
 businessscience on GitHub
 Business Science, LLC on LinkedIn
 bizScienc on twitter
 Business Science, LLC on Facebook
To leave a comment for the author, please follow the link and comment on their blog: businessscience.io  Articles. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
In case you missed it: June 2018 roundup
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
In case you missed them, here are some articles from June of particular interest to R users.
An animated visualization of global migration, created in R by Guy Abel.
My take on the question, Should you learn R or Python for data science?
The BBC and Financial Times use R — without postprocessing — for publication graphics.
"Handling Strings in R", a free ebook by Gaston Sanchez, has been updated.
The AI, Machine Learning and Data Science roundup for June 2018.
The PYPL Popularity of Languages Index ranks R as the 7th most popular programming language.
The "lime" package for R provides tools for interpreting machine learning models.
An R vignette by Paige Bailey on detecting unconscious bias in predictive models.
Microsoft R Open 3.5.0 has been released (with a subsequent fix for Debian systems).
Slides from the webinar, What's New in Azure for Machine Learning and AI.
And some general interest stories (not necessarily related to R):
 The Curvature Blindness Illusion
 Lioness v Wrestlers in tugofwar
 A comedian imagines an AI writing a TV commercial
 A fur seal, transcribed
 An architecture error and a neardisaster in NYC
As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...