R news and tutorials contributed by (750) R bloggers
Updated: 55 min 51 sec ago

Teaching Factory Physics Flow Benchmarking with R and Many-Objective Visuals

Sun, 04/28/2019 - 02:00

(This article was first published on R Blogs by Pedro N. de Lima, and kindly contributed to R-bloggers)

Teaching to seasoned managers in MBE classes is challenging. While it’s important to bring new thoughts and ideas and not sound repetitive, it is necessary to provide a theoretical basis for experienced people with diverse backgrounds. One of the strategies I found to overcome these obstacles this week was to use a new analysis framework (in my case, Factory Physics concepts) to challenge their views about existing frames they already master. Using a combination of concepts, simulation sodels, many-objective tradeoffs visuals (like the gif below) and tasks was great to challenge their intuition about manufacturing systems.

This post shares some of the R code I developed while putting together course materials. This post is also an example of how to use simulation metamodeling and Arena to Run Factory Physics’ Flow Benchmarking.

Flow Benchmarking

Flow Benchmarking is an absolute benchmarking technique useful to determine how close a production flow is to its best possible performance. The technique has been introduced in the award-winning Factory Physics (FP) Book (Hopp and Spearman 2008), and is a key component of the science-based manufacturing management approach described in (Pound, Bell, and Spearman 2014).

Defining Factory Physics Laws

The Flow Benchmarking analysis is grounded on Little’s Law (WIP = TH * CT), and utilizes three general cases as absolute benchmarks for any real manufacturing system: The Best Case, the Worst Case and the Practical Worst Case .Please refer to (Hopp and Spearman 2008) and (Pound, Bell, and Spearman 2014) for the rationale for these laws and equations.

I’ll define these equations as R functions:

calc_w0 = function(rb, t0) {rb * t0} ct_best = function(t0, w, w0, rb) {ifelse(w<=w0,t0,w/rb)} th_best = function(t0, w, w0, rb) {ifelse(w<=w0,w/t0,rb)} ct_worst = function(w,t0){w*t0} th_worst = function(t0){1/t0} ct_marginal = function(t0,w,rb){t0+(w-1)/rb} th_marginal = function(w0,w,rb){rb*w/(w0+w-1)}

In summary, these equations provide a starting point to discuss how well a manufacturing system is doing in terms of converting inventory to Throughput. The initial analysis requires two inputs. The first input is the Bottleneck rate (rb), which is the production rate (parts, orders / time) of the bottleneck (defined as the process center with the highest long-term utilization). The second parameter is the Total Raw Processing Time (t0), which is the sum of the long-term average process times of each processing center. Based on these two parameters, it’s possible to draw benchmarking curves for the System’s Throughput and Cycle Time as a function of its Work in Process, assuming a CONWIP control system (SPEARMAN, WOODRUFF, and HOPP 1990).

Drawing Absolute Benchmarking Curves

Once I have the basic laws of manufacturing dynamics as R functions, I’ll create a benchmarck_flow function to execute the analysis. This function accepts the rb and t0 parameters and will calculate the system’s Throughput and Cycle time as a function of the wip under different scenarios for benchmarking purposes.

## Defining Cycle time and Throughput functions benchmark_flow = function(rb, t0, step = 1, wip_mult = 5) { # First, calculate wip_crit w0 = calc_w0(rb = rb, t0 = t0) # Then, define WIP Range to consider: wip = seq.int(from = 1, to = w0 * wip_mult, by = step) # Then, calculate The Best Case Variables Best_Cycle_Time = ct_best(t0 = t0, w = wip, w0 = w0, rb = rb) Best_Throughput = th_best(t0 = t0, w = wip, w0 = w0, rb = rb) best_data = data.frame(WIP = wip, Throughput = Best_Throughput, CycleTime = Best_Cycle_Time, Scenario = "Best Case") # Calculate the Marginal Cases: Marginal_Cycle_Time = ct_marginal(t0=t0,w=wip,rb=rb) Marginal_Throughput = th_marginal(w0=w0,w=wip,rb=rb) marginal_data = data.frame(WIP = wip, Throughput = Marginal_Throughput, CycleTime = Marginal_Cycle_Time, Scenario = "Marginal") # Calculate Worst Case worst_data = data.frame( WIP = wip, Throughput = th_worst(t0 = t0), CycleTime = ct_worst(w = wip, t0 = t0), Scenario = "Worst Case" ) # Output A DataFrame with results: # I'm not including the Worst Case because it's unrealistic (and messes up my cycle time plot). rbind(best_data, marginal_data, worst_data) } # The First Penny Fab Example: data_benchmark = benchmark_flow(rb = 0.5, t0 = 8) knitr::kable(head(data_benchmark)) WIP Throughput CycleTime Scenario 1 0.125 8 Best Case 2 0.250 8 Best Case 3 0.375 8 Best Case 4 0.500 8 Best Case 5 0.500 10 Best Case 6 0.500 12 Best Case How would the Actual System Behave if…

Ok, now I have a table with all the basic benchmarking results. What if I have a better model of the system? We can accomplish this by building a discrete event simulation model of the actual system, and using a metamodel of this model to approximate its results (you can find the data from my penny fab model here). During my course, I used several Arena Simulation models to illustrate that adding variability to the system always degrades performance (as the variability law predicts!). Doing so allowed the students to build confidence into the model and the theory, which was great to see!

library(arena2r) ## Warning: package 'arena2r' was built under R version 3.5.2 library(tidyr) library(splines) arena_data = arena2r::get_simulation_results("2019-data/penny-fab") # Filtering only Statistics of our Interest: filtered_data = subset(arena_data, Statistic %in% c("w", "LeadTime", "Throughput")) # Spreading and Data Wrangling final_data = filtered_data %>% tidyr::spread(Statistic, Value) %>% dplyr::select(LeadTime, Throughput, w) colnames(final_data) = c("CycleTime", "Throughput", "WIP") # Now, build a spline metamodel for CycleTime and Throughput as a function of WIP. th_model = lm(Throughput ~ splines::bs(WIP), data = final_data) ct_model = lm(CycleTime ~ WIP, data = final_data) # Put Together a Final DataFrame like the Benchmarking: model_data = data.frame( WIP = unique(data_benchmark\$WIP), Throughput = predict(th_model, subset(data_benchmark, Scenario == "Best Case")), CycleTime = predict(ct_model, subset(data_benchmark, Scenario == "Best Case")), Scenario = "DES Model" ) ## Warning in splines::bs(WIP, degree = 3L, knots = numeric(0), Boundary.knots ## = c(1, : some 'x' values beyond boundary knots may cause ill-conditioned ## bases # Adding our Model's Data to the DataFrame: data_benchmark = rbind( data_benchmark, model_data )

Once we have data from the basic FP laws and from our model, let’s plot it!

library(tidyr) library(ggplot2) library(viridis) ## Carregando pacotes exigidos: viridisLite # Lets define a wrapper function for our plot: plot_benchmarking = function(data) { data %>% gather(-WIP, -Scenario, key = "var", value = "Value") %>% ggplot(aes(x = WIP, y = Value, color = Scenario)) + geom_line(size = 1) + facet_wrap(~ var, scales = "free", nrow = 2, ncol = 1) + labs(title = "Flow Benchmarking Plot") + scale_color_viridis(discrete = TRUE, option = "D") + theme_bw() } # Then let's just benchmark and plot! plot_benchmarking(data = data_benchmark)

Cool! My simulation metamodel is still quite equivalent to the marginal case.

However, it has one advantage. I can now build a model that can simulate arbitrarily complex scenarios (e,g.: I can include different product routings, change product mix, include non-stationary demand, simulate setup time reduction, even maybe use a multi-method model, etc.) and my model will actually be a better approximation of the actual system than any simple queueing network model. Also, my model can simulate detailed improvement what-if scenarios, which queueing network models won’t be able to simulate.

Wrapping up with Tradeoffs and Many-Objective Visuals

I also used simulation models to illustrate tradeoffs implied by two simple decisions: How much WIP a manufacturing flow should have and what should be the reorder level of a part. Unfortunetly, trying to use R to this task wasn’t productive. I ended up using DiscoveryDV, which is a great tool for many-objective visualization.

For instance, plotting WIP, Throughput, Cycle Time and Utilization of the Practical Worse Case yields this:

And visualizing the tradeoffs implied by different reorder levels in a (Q,r) inventory system yields this:

At this point, many of the participants were excited to get to grips with models that illuminate tradeoffs they have been facing for years. Hopefully, their intuition was sharpened by these exercises and they will be better equiped to use these frontiers to promote productive and tradeoff-aware discussions.

References

Hopp, W.J., and M.L. Spearman. 2008. Factory Physics. Irwin/Mcgraw-Hill Series in Operations and Decision Sciences. McGraw-Hill. https://books.google.com.br/books?id=tEjkAAAACAAJ.

Pound, E.S., J.H. Bell, and M.L. Spearman. 2014. Factory Physics for Managers: How Leaders Improve Performance in a Post-Lean Six Sigma World. McGraw-Hill Education. https://books.google.com.br/books?id=B5sXAwAAQBAJ.

SPEARMAN, MARK L., DAVID L. WOODRUFF, and WALLACE J. HOPP. 1990. “CONWIP: A Pull Alternative to Kanban.” International Journal of Production Research 28 (5). Taylor & Francis: 879–94. https://doi.org/10.1080/00207549008942761.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blogs by Pedro N. de Lima. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fast food, causality and R packages, part 1

Sun, 04/28/2019 - 02:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

I am currently working on a package for the R programming language; its initial goal was to simply
distribute the data used in the Card and Krueger 1994 paper that you can read
here (PDF warning).

The gist of the paper is to try to answer the following question: Do increases in minimum wages reduce employment?
According to Card and Krueger’s paper from 1994, no.
The authors studied a change in legislation in New Jersey which increased the minimum wage from \$4.25 an hour to
\$5.05 an hour. The neighbourghing state of Pennsylvania did not introduce such an increase. The authors thus used
the State of Pennsylvania as a control for the State of New Jersey and studied how the increase in minimum wage impacted
the employment in fast food restaurants and found, against what economic theory predicted, an
increase and not a decrease in employment.
The authors used a method called difference-in-differences to asses the impact of the minimum wage increase.

This result was and still is controversial, with subsequent studies finding subtler results.
For instance, showing that there is a reduction in employment following an increase in minimum wage,
but only for large restaurants (see Ropponen and Olli, 2011).

Anyways, this blog post will discuss how to create a package using to distribute the data. In a future
blog post, I will discuss preparing the data to make it available as a demo dataset inside the
package, and then writing and documenting functions.

The first step to create a package, is to create a new project:

Select “New Directory”:

Then “R package”:

and on the window that appears, you can choose the name of the package, as well as already some
starting source files:

Also, I’d highly recommend you click on the “Create a git repository” box and use git within your
project for reproducibility and sharing your code more easily. If you do not know git, there’s a lot of
online resources to get you started. It’s not super difficult, but it does require making some new
habits, which can take some time.

I called my package {diffindiff}, and clicked on “Create Project”. This opens up a new project
with a hello.R script, which gives you some pointers:

# Hello, world! # # This is an example function named 'hello' # which prints 'Hello, world!'. # # You can learn more about package authoring with RStudio at: # # http://r-pkgs.had.co.nz/ # # Some useful keyboard shortcuts for package authoring: # # Install Package: 'Ctrl + Shift + B' # Check Package: 'Ctrl + Shift + E' # Test Package: 'Ctrl + Shift + T' hello <- function() { print("Hello, world!") }

Now, to simplify the creation of your package, I highly recommend you use the {usethis} package.
{usethis} removes a lot of the pain involved in creating packages.

For instance, want to start by adding a README file? Simply run:

This creates a README.md file in the root directory of your package. Simply change that file, and that’s it.

The next step could be setting up your package to work with {roxygen2}, which is very useful for
writing documentation:

usethis::use_roxygen_md() ✔ Setting Roxygen field in DESCRIPTION to 'list(markdown = TRUE)' ✔ Setting RoxygenNote field in DESCRIPTION to '6.1.1' ● Run `devtools::document()`

See how the output tells you to run devtools::document()? This function will document your package,
transforming the comments you write to describe your functions to documentation and managing the NAMESPACE
file. Let’s run this function too:

devtools::document() Updating diffindiff documentation First time using roxygen2. Upgrading automatically... Loading diffindiff Warning: The existing 'NAMESPACE' file was not generated by roxygen2, and will not be overwritten.

You might have a similar message than me, telling you that the NAMESPACE file was not generated by
{roxygen2}, and will thus not be overwritten. Simply remove the file and run devtools::document()
again:

But what is actually the NAMESPACE file? This file is quite important, as it details where your
package’s functions have to look for in order to use other functions. This means that if your package needs function
foo() from package {bar}, it will consistently look for foo() inside {bar} and not confuse
it with, say, the foo() function from the {barley} package, even if you load {barley} after
{bar} in your interactive session. This can seem confusing now, but in the next blog posts I will
detail this, and you will see that it’s not that difficult. Just know that it is an important file,
and that you do not have to edit it by hand.

Next, I like to run the following:

usethis::use_pipe() ✔ Adding 'magrittr' to Imports field in DESCRIPTION ✔ Writing 'R/utils-pipe.R' ● Run `devtools::document()`

This makes the now famous %>% function available internally to your package (so you can use it
to write the functions that will be included in your package) but also available to the users that

Your package is still missing a license. If you plan on writing a package for your own personal use,
for instance, a collection of functions, there is no need to think about licenses. But if you’re making
your package available through CRAN, then you definitely need to think about it. For this package,
I’ll be using the MIT license, because the package will distribute data which I do not own (I’ve got permission
from Card to re-distribute it) and thus I think it would be better to use a permissive license (I don’t know
if the GPL, another license, which is stricter in terms of redistribution, could be used in this case).

We’re almost done setting up the structure of the package. If we forget something though, it’s not an issue,
we’ll just have to run the right use_* function later on. Let’s finish by preparing the folder
that will contains the script to prepare the data:

usethis::use_data_raw() ✔ Creating 'data-raw/' ✔ Adding '^data-raw\$' to '.Rbuildignore' ✔ Writing 'data-raw/DATASET.R' ● Modify 'data-raw/DATASET.R' ● Finish the data preparation script in 'data-raw/DATASET.R' ● Use `usethis::use_data()` to add prepared data to package

This creates the data-raw folder with the DATASET.R script inside. This is the script that will
contain the code to download and prepare datasets that you want to include in your package. This will
be the subject of the next blog post.

Let’s now finish by documenting the package, and pushing everything to Github:

devtools::document()

The following lines will only work if you set up the Github repo:

git add . git commit -am "first commit" git push origin master

Hope you enjoyed! If you found this blog post useful, you might want to follow
buy me an espresso or paypal.me.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

11,264 Regressions in One Tidy Plot

Sun, 04/28/2019 - 02:00

(This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers)

Following up on a recent blog article that discussed how to use R to explore your researcher degrees of freedom, this post introduces a specification curve plot as suggested in Simonsohn, Simmons and Nelson. With this plot, you can eyeball how various researcher degrees of freedom affect your main outcome of interest.

In a recent post, I introduced my in-development R package ‘rdfanaylsis’ that provides a coding environment allowing researchers to specify their researcher degrees of freedom ex ante and to systematically explore their effects on their findings ex post. Using a prominent topic from health economics and demographics (the association of national income with life expectancy) I identified a set of seven research design choices that in combination describe 11,264 different model specifications, referred to as researcher degrees of freedom.

The ‘rdfanaylsis’ package allows systematically exhausting all these choices and generates a large set of estimates. A key challenge is now to make all these related findings digestible for the reader. While the package offers plots to ‘drill-down’ into the data, it did not provide a quick ‘one-stop’ visual.

In their working paper, Simonsohn, Simmons and Nelson suggest a specification curve plot that in my opinion provides an excellent visual to eyeball the variance of generated estimates. The newly implemented function plot_rdf_spec_curve() generates a ‘ggplot’-based variant of their approach. See below:

devtools::install_github("joachim-gassen/rdfanalysis") library(rdfanalysis) load(url("https://joachim-gassen.github.io/data/rdf_ests.RData")) plot_rdf_spec_curve(ests, "est", "lb", "ub")
A specification curve visualizing your researcher degrees of freedom" width="450" />

Figure 1: Effect of a 10 % increase in national income on life expectancy in years:
A specification curve visualizing your researcher degrees of freedom

The top part of the plot visualizes the range of the estimates and their confidence intervals. The estimates measure the association of a 10 % increase in national income with the nation’s average life expectancy at birth in years. You see a relatively wide range of estimates (from -0.22 to 0.73 years) with the majority (80.0 %, indicated in blue) being significantly positive. This finding is consistent with a positive effect of national income on life expectancy.

In the bottom part, you can get a quick idea about which design choices are influential for the magnitude of the result: It is the list of the included control variables, the model specification (in terms of log-transformation of dependent and independent variables), and the fixed effect structure. On the other hand, outliers and their treatments seem to have only a limited impact on the findings and, naturally, the clustering of standard errors only affects the confidence intervals and not the estimates. The fact that the fixed effect structure and the control variables are influential clearly hints at the endogenous nature of national income and should caution the reader against interpreting the findings of the analysis in a causal way.

See my former post for more detail on the case and on how to drill deeper into the findings. Feel free to use the in-development ‘rdfanalysis’ package to exhaust the researcher degrees of freedoms in your own projects. If you have remarks about this project, I would love to hear from you. Use the comment section below or reach out via email or twitter.

Enjoy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1099]

Sun, 04/28/2019 - 00:19

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A simple 2×2 Le Monde mathematical puzzle:

Arielle and Brandwein play a game out of two distinct even integers between 1500 and 2500,  and y. Providing one another with either the pair (x/2,y+x/2) or the pair (x+y/2,y/2) until they run out of even possibilities or exceed 6 rounds. When x=2304, what is the value of y that makes Brandwein win?

Which I solved by a recursive function (under the constraint of a maximum of 11 levels of recursion):

nezt=function(x,y,i=1){ if ((i>11)||((is.odd(x)&is.odd(y)))){ return(-1) }else{ z=-1 if (is.even(x)) z=-nezt(x/2,y+x/2,i+1) if (is.even(y)) z=max(z,-nezt(y/2,x+y/2,i+1)) return(z)}}

and checking all values of y between 1500 and 2500 when x=2304, which produces y=1792 as the only value when Arielle loses. The reason behind (?) is that both 2304 and 1792 are divisible by 2⁸, which means no strategy avoids reaching stalemate after 8 steps, when it is Arielle’s turn to play.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

More General Weighted Binning

Sat, 04/27/2019 - 21:44

You might be wondering what motivates me spending countless weekend hours on the MOB package. The answer is plain and simple. It is users that are driving the development work.

After I published the wts_bin() function last week showing the impact of two-value weights on the monotonic binning outcome (https://statcompute.wordpress.com/2019/04/21/binning-with-weights), a question was asked if I can write a more general weighted binning function with weights being any positive value. The function wqtl_bin() is my answer (https://github.com/statcompute/MonotonicBinning/blob/master/code/wqtl_bin.R).

Below is an example demonstrating how to use the wqtl_bin() function. First of all, let’s apply the function to the case with two-value weights that was illustrated last week. As expected, statistics from both approaches are identical. In the second use case, let’s assume that weights can be any value under the Uniform distribution between 0 and 10. With positive random weights, all statistics have changed.

It is worth mentioning that, while binning rules can be the same with or without weights in some cases, it is not necessarily true in all situations, depending on the distribution of weights across the data sample. As shown in binning outcomes for “ltv” below, there are 7 bins without weights but only 5 with weights.

.gist table { margin-bottom: 0; } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Quick Hit: Scraping javascript-“enabled” Sites with {htmlunit}

Sat, 04/27/2019 - 20:20

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it.

The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you’d have to use Selenium/{Rselenium}, Splash/{splashr} or Chrome/{decapitated} to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While {htmlunit} does have an {rJava} dependency, I haven’t had any issues getting Java working with R on Windows, Ubuntu/Debian or macOS in a very long while—even on freshly minted systems–so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).

To demonstrate the capabilities of {htmlunit} we’ll work with a site that’s dedicated to practicing web scraping—toscrape.com—and, specifically, the javascript generated sandbox site. It looks like this:

Now bring up both the “view source” version of the page on your browser and the developer tools “elements” panel and you’ll see that the content is in javascript right there on the site but the source has no elements because they’re generated dynamically after the page loads.

The critical differences between both of those views is one reason I consider the use of tools like “Selector Gadget” to be more harmful than helpful. You’re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it’ll definitely come back to bite you some day.

Let’s try to grab that first page of quotes. Note that to run all the code you’ll need to install both {htmlunitjars} and {htmlunit} which can be done via: install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is").

First, we’ll try just plain ol’ {rvest}:

library(rvest) pg <- read_html("http://quotes.toscrape.com/js/") html_nodes(pg, "div.quote") ## {xml_nodeset (0)}

Getting no content back is to be expected since no javascript is executed. Now, we’ll use {htmlunit} to see if we can get to the actual content:

library(htmlunit) library(rvest) library(purrr) library(tibble) js_pg <- hu_read_html("http://quotes.toscrape.com/js/") html_nodes(js_pg, "div.quote") ## {xml_nodeset (10)} ## [1] \r\n \r\n “The world as we h ... ## [2] \r\n \r\n “It is our choices ... ## [3] \r\n \r\n “There are only tw ... ## [4] \r\n \r\n “The person, be it ... ## [5] \r\n \r\n “Imperfection is b ... ## [6] \r\n \r\n “Try not to become ... ## [7] \r\n \r\n “It is better to b ... ## [8] \r\n \r\n “I have not failed ... ## [9] \r\n \r\n “A woman is like a ... ## [10] \r\n \r\n “A day without sun ...

I loaded up {purrr} and {tibble} for a reason so let’s use them to make a nice data frame from the content:

tibble( quote = html_nodes(js_pg, "div.quote > span.text") %>% html_text(trim=TRUE), author = html_nodes(js_pg, "div.quote > span > small.author") %>% html_text(trim=TRUE), tags = html_nodes(js_pg, "div.quote") %>% map(~html_nodes(.x, "div.tags > a.tag") %>% html_text(trim=TRUE)) ) ## # A tibble: 10 x 3 ## quote author tags ## ## 1 “The world as we have created it is a process of our thinking. … Albert Einste…

To be fair, we didn’t really need {htmlunit} for this site. The javascript data comes along with the page and it’s in a decent form so we could also use {V8}:

library(V8) library(stringi) ctx <- v8() html_node(pg, xpath=".//script[contains(., 'data')]") %>% # target the tag with the data html_text() %>% # get the text of the tag body stri_replace_all_regex("for \\(var[[:print:][:space:]]*", "", multiline=TRUE) %>% # delete everything after the `var data=` content ctx\$eval() # pass it to V8 ctx\$get("data") %>% # get the data from V8 as_tibble() %>% # tibbles rock janitor::clean_names() # the names do not so make them better ## # A tibble: 10 x 3 ## tags author\$name \$goodreads_link \$slug text ## ## 1

But, the {htmlunit} code is (IMO) a bit more straightforward and is designed to work on sites that use post-load resource fetching as well as those that use inline javascript (like this one).

FIN

While {htmlunit} is great, it won’t work on super complex sites as it’s not trying to be a 100% complete browser implementation. It works amazingly well on a ton of sites, though, so give it a try the next time you need to scrape dynamic content. The package also contains a min-DSL if you need to perform more complex page scraping tasks as well.

You can find both {htmlunit} and {htmlunitjars} at:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Layout Exercises

Sat, 04/27/2019 - 18:16

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr/demo/, and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

For each of these examples we will show how to re-layout data using cdata.

Introduction

Each of these five problems will be solved using the same steps:

• Examine example input data and desired result data.
• Check if either the incoming or outgoing data format is in "row-record" format: is all the data for a single record contained in one row? This determines which data layout transform specification you will use:
• rowrecs_to_blocks_spec(): To specify data moving from single rows to arbitrary "block-shaped" records
• blocks_to_rowrecs_spec(): To specify data moving from block records to single rows
• layout_specification(): To specify data moving one shape of general block record to another block record.
• Identify which columns form the record ids (group sets of rows into records), which we call the "record keys."
• Draw the shape of the incoming record without the record key columns.
• Draw the shape of the outgoing record without the record key columns.
• Combine the above information as one of the above data layout transform specifications.
• Print the layout transform to confirm it is what you want.
• Apply the layout transform.

This may seem like a lot of steps, but it is only because we are taking the problems very slowly. The important point is that we want to minimize additional problem solving when applying the cdata methodology. Usually when you need to transform data you are in the middle of some other more important task, so you want to delegate the details of how the layout transform is implemented. With cdata the user is not asked to perform additional puzzle solving to guess a sequence of operators that may implement the desired data layout transform. The cdata solution pattern is always the same, which can help in mastering it.

With cdata, record layout transforms are simple R objects with detailed print() methods- so they are convenient to alter, save, and re-use later. The record layout transform also documents the expected columns and constants of the incoming data.

We will work some examples with the hope that practice brings familiarity. We have some notes on how to graphically solve exercise like this here and here, but let’s dive into working the exercises.

Example 1

From https://stats.idre.ucla.edu/stata/modules/reshaping-data-wide-to-long/ we can get get a copy of the data and the question or "transform ask":

# convert from this format dadmomw <- wrapr::build_frame( "famid" , "named", "incd", "namem", "incm" | 1 , "Bill" , 30000 , "Bess" , 15000 | 2 , "Art" , 22000 , "Amy" , 18000 | 3 , "Paul" , 25000 , "Pat" , 50000 ) # to this format dadmomt <- wrapr::build_frame( "famid" , "dadmom", "name", "inc" | 1 , "d" , "Bill", 30000 | 1 , "m" , "Bess", 15000 | 2 , "d" , "Art" , 22000 | 2 , "m" , "Amy" , 18000 | 3 , "d" , "Paul", 25000 | 3 , "m" , "Pat" , 50000 )

Each incoming record represents a family, and is designated by the record key famid. The data starts with each record in a single row (a row record):

famid named incd namem incm 1 Bill 30000 Bess 15000

So we are going from a row record to a general block record: this means we want to use rowrecs_to_blocks_spec(), and we only have to describe the outgoing record shape.

library("cdata") # identify the record key recordKeys <- "famid" # specify the outgoing record shape outgoing_record <- wrapr::qchar_frame( "dadmom" , "name", "inc" | "d" , named , incd | "m" , namem , incm )

Notice we take the column names from the incoming row-record and use them as cell-names in the outgoing record; this is how we show where the data goes. In specifying the record with wrapr::qchar_frame(), we use the convention that quoted entities are values we know (values that specify column names, or keys that describe the interior of the block record structure), and un-quoted entities are values we expect to be in the record.

outgoing_record is just a data.frame, you can create it however you like — you don’t need to use qchar_frame().

Now create the layout specification, and print it.

# put it all together into a layout layout <- rowrecs_to_blocks_spec( outgoing_record, recordKeys = recordKeys) # confirm we have the right layout print(layout) #> { #> row_record <- wrapr::qchar_frame( #> "famid" , "named", "namem", "incd", "incm" | #> . , named , namem , incd , incm ) #> row_keys <- c('famid') #> #> # becomes #> #> block_record <- wrapr::qchar_frame( #> "famid" , "dadmom", "name", "inc" | #> . , "d" , named , incd | #> . , "m" , namem , incm ) #> block_keys <- c('famid', 'dadmom') #> #> # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE) #> }

The print() method fully documents what columns are expected and the intent of the data layout transform. It uses the same quoted/unquoted convention that we used in specifying outgoing_record above.

The block_keys of the outgoing record shape specify the unique identifier of each row of the transformed data: that is, each row of dadmomt will be uniquely identified by the values of the columns famid and dadmom (which family, which parent). One of the block keys is always the record key; by default, rowrecs_to_blocks_spec() takes the other one from the first column of the outgoing_record shape. You can specify the block key (or keys) explicitly with the controlTableKeys argument:

# this is equivalent to the above call rowrecs_to_blocks_spec( outgoing_record, recordKeys = recordKeys, controlTableKeys = 'dadmom')

Now apply the layout to get the new data shape:

# apply the layout dadmomw %.>% layout %.>% knitr::kable(.) famid dadmom name inc 1 d Bill 30000 1 m Bess 15000 2 d Art 22000 2 m Amy 18000 3 d Paul 25000 3 m Pat 50000 Example 2

The original question was:

I want to reshape a wide format dataset that has multiple tests which are measured at 3 time points: ID Test Year Fall Spring Winter 1 1 2008 15 16 19 1 1 2009 12 13 27 1 2 2008 22 22 24 1 2 2009 10 14 20 ... into a data set that separates the tests by column but converts the measurement time into long format, for each of the new columns like this: ID Year Time Test1 Test2 1 2008 Fall 15 22 1 2008 Spring 16 22 1 2008 Winter 19 24 1 2009 Fall 12 10 1 2009 Spring 13 14 1 2009 Winter 27 20 ... I have unsuccessfully tried to use reshape and melt. Existing posts address transforming to single column outcome.

First, notice that neither the incoming nor outgoing forms are single-row records; a single record corresponds to a single ID and Year, and has three measurements (Fall, Spring, Winter) of two tests (1 and 2). So an example single row record would look something like:

ID Year Fall1 Fall2 Spring1 Spring2 Winter1 Winter2 1 2008 15 22 16 22 19 24

and the record key is formed from the ID and the Year (sometimes what the record keys are is not obvious, and is in fact domain knowledge).

Since neither the incoming nor outgoing shapes are row records, we use the general layout_specification().

library("cdata") # identify the record keys recordKeys <- c("ID", "Year") # specify the incoming record shape incoming_record <- wrapr::qchar_frame( "Test" , "Fall" , "Spring" , "Winter" | "1" , Fall1 , Spring1 , Winter1 | "2" , Fall2 , Spring2 , Winter2 ) # specify the outgoing record shape outgoing_record <- wrapr::qchar_frame( "Time" , "Test1" , "Test2" | "Fall" , Fall1 , Fall2 | "Spring" , Spring1 , Spring2 | "Winter" , Winter1 , Winter2 ) # put it all together into a layout layout <- layout_specification( incoming_shape = incoming_record, outgoing_shape = outgoing_record, recordKeys = recordKeys) # confirm we have the right layout print(layout) #> { #> in_record <- wrapr::qchar_frame( #> "ID" , "Year", "Test", "Fall", "Spring", "Winter" | #> . , . , "1" , Fall1 , Spring1 , Winter1 | #> . , . , "2" , Fall2 , Spring2 , Winter2 ) #> in_keys <- c('ID', 'Year', 'Test') #> #> # becomes #> #> out_record <- wrapr::qchar_frame( #> "ID" , "Year", "Time" , "Test1", "Test2" | #> . , . , "Fall" , Fall1 , Fall2 | #> . , . , "Spring", Spring1, Spring2 | #> . , . , "Winter", Winter1, Winter2 ) #> out_keys <- c('ID', 'Year', 'Time') #> #> # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE, allow_rqdatatable = FALSE) #> } # example data grades <- wrapr::build_frame( "ID" , "Test", "Year", "Fall", "Spring", "Winter" | 1 , 1 , 2008 , 15 , 16 , 19 | 1 , 1 , 2009 , 12 , 13 , 27 | 1 , 2 , 2008 , 22 , 22 , 24 | 1 , 2 , 2009 , 10 , 14 , 20 | 2 , 1 , 2008 , 12 , 13 , 25 | 2 , 1 , 2009 , 16 , 14 , 21 | 2 , 2 , 2008 , 13 , 11 , 29 | 2 , 2 , 2009 , 23 , 20 , 26 | 3 , 1 , 2008 , 11 , 12 , 22 | 3 , 1 , 2009 , 13 , 11 , 27 | 3 , 2 , 2008 , 17 , 12 , 23 | 3 , 2 , 2009 , 14 , 9 , 31 ) # apply the layout grades %.>% layout %.>% knitr::kable(.) ID Year Time Test1 Test2 1 2008 Fall 15 22 1 2008 Spring 16 22 1 2008 Winter 19 24 1 2009 Fall 12 10 1 2009 Spring 13 14 1 2009 Winter 27 20 2 2008 Fall 12 13 2 2008 Spring 13 11 2 2008 Winter 25 29 2 2009 Fall 16 23 2 2009 Spring 14 20 2 2009 Winter 21 26 3 2008 Fall 11 17 3 2008 Spring 12 12 3 2008 Winter 22 23 3 2009 Fall 13 14 3 2009 Spring 11 9 3 2009 Winter 27 31 Example 3

Question: given data such as below how does one move treatment and control values for each individual into columns? Or how does one take a to b?

a <- wrapr::build_frame( "Ind" , "Treatment", "value" | "Ind1", "Treat" , 1 | "Ind2", "Treat" , 2 | "Ind1", "Cont" , 3 | "Ind2", "Cont" , 4 ) b <- wrapr::build_frame( "Ind" , "Treat" , "Cont"| "Ind1", 1 , 3 | "Ind2", 2 , 4 )

In this case, a record corresponds to an individual, and the outgoing data is in row record form:

Ind Treat Cont Ind1 1 3

That means we will use blocks_to_rowrecs_spec().

The cdata solution is as follows.

library("cdata") # identify the record key recordKeys <- "Ind" # specify the incoming record shape incoming_record <- wrapr::qchar_frame( "Treatment" , "value" | "Treat" , Treat | "Cont" , Cont ) # put it all together into a layout layout <- blocks_to_rowrecs_spec( incoming_record, recordKeys = recordKeys) # confirm we have the right layout print(layout) #> { #> block_record <- wrapr::qchar_frame( #> "Ind" , "Treatment", "value" | #> . , "Treat" , Treat | #> . , "Cont" , Cont ) #> block_keys <- c('Ind', 'Treatment') #> #> # becomes #> #> row_record <- wrapr::qchar_frame( #> "Ind" , "Treat", "Cont" | #> . , Treat , Cont ) #> row_keys <- c('Ind') #> #> # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE, allow_rqdatatable = FALSE) #> } # apply the layout a %.>% layout %.>% knitr::kable(.) Ind Treat Cont Ind1 1 3 Ind2 2 4

This particular transform, from a block consisting of a single column of values (and the rest of the columns being keys) to a row record, is the transform typically referred to as spread, dcast, or pivot. The tidyr package has a convenient call for this transform: spread(); cdata also has a similar convenience call: pivot_to_rowrecs().

Don’t worry if you didn’t notice that this example is a spread; one of the values of cdata is that you shouldn’t have to think about it. Most of the examples we show here are neither a simple spread/pivot nor a simple gather/unpivot.

By now you should be able to see the cdata solution always follows a very similar path. We try not to let the nature of the data layout transform ("easy" versus "hard") dictate the solution method. Always slow down and draw out the "before" and "after" shapes before attempting to solve the problem.

Example 4

Convert data that has one different observation for each column to a data that has all observations in rows. That is take a to b in the following.

a <- wrapr::build_frame( "Name" , "50", "100", "150", "200", "250", "300", "350" | "Carla", 1.2 , 1.8 , 2.2 , 2.3 , 3 , 2.5 , 1.8 | "Mace" , 1.5 , 1.1 , 1.9 , 2 , 3.6 , 3 , 2.5 ) b <- wrapr::build_frame( "Name" , "Time", "Score" | "Carla", 50 , 1.2 | "Carla", 100 , 1.8 | "Carla", 150 , 2.2 | "Carla", 200 , 2.3 | "Carla", 250 , 3 | "Carla", 300 , 2.5 | "Carla", 350 , 1.8 | "Mace" , 50 , 1.5 | "Mace" , 100 , 1.1 | "Mace" , 150 , 1.9 | "Mace" , 200 , 2 | "Mace" , 250 , 3.6 | "Mace" , 300 , 3 | "Mace" , 350 , 2.5 )

Here a record corresponds to a single observation (keyed by Name), and the incoming data is arranged in row records:

Name 50 100 150 200 250 300 350 Carla 1.2 1.8 2.2 2.3 3 2.5 1.8

This particular transformation, from a single row of values to a single column of values (with multiple key columns), is the transform commonly called gather, melt, or unpivot. This is a very common transformation—probably the most common one, by far. Again, cdata has a convenience function, pivot_to_blocks() (or its alias unpivot_to_blocks()).

Here, we will do the transform "the long way" with rowrecs_to_blocks_spec(). As we have a large number of columns we will use a helper function to specify the data layout transform.

library("cdata") # how to find records recordKeys <- "Name" # specify the outgoing record shape, using a helper function # (and print it -- notice that it's a data frame) ( outgoing_record <- build_unpivot_control( nameForNewKeyColumn = "Time", nameForNewValueColumn = "Score", columnsToTakeFrom = setdiff(colnames(a), recordKeys)) ) #> Time Score #> 1 50 50 #> 2 100 100 #> 3 150 150 #> 4 200 200 #> 5 250 250 #> 6 300 300 #> 7 350 350 # put it all together into a layout layout <- rowrecs_to_blocks_spec( outgoing_record, recordKeys = recordKeys) # confirm we have the right layout print(layout) #> { #> row_record <- wrapr::qchar_frame( #> "Name" , "50", "100", "150", "200", "250", "300", "350" | #> . , 50 , 100 , 150 , 200 , 250 , 300 , 350 ) #> row_keys <- c('Name') #> #> # becomes #> #> block_record <- wrapr::qchar_frame( #> "Name" , "Time", "Score" | #> . , "50" , 50 | #> . , "100" , 100 | #> . , "150" , 150 | #> . , "200" , 200 | #> . , "250" , 250 | #> . , "300" , 300 | #> . , "350" , 350 ) #> block_keys <- c('Name', 'Time') #> #> # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE) #> } # apply the layout a %.>% layout %.>% transform(., Time = as.numeric(Time)) %.>% # sort the data frame by Name and then Time .[order(.\$Name, .\$Time), , drop = FALSE] %.>% knitr::kable(., row.names = FALSE) Name Time Score Carla 50 1.2 Carla 100 1.8 Carla 150 2.2 Carla 200 2.3 Carla 250 3.0 Carla 300 2.5 Carla 350 1.8 Mace 50 1.5 Mace 100 1.1 Mace 150 1.9 Mace 200 2.0 Mace 250 3.6 Mace 300 3.0 Mace 350 2.5 Example 5

Convert from a to b.

a <- wrapr::build_frame( "id" , "trt", "work.T1", "play.T1", "talk.T1", "total.T1", "work.T2", "play.T2", "talk.T2", "total.T2" | "x1.1", "cnt", 0.3443 , 0.7842 , 0.1079 , 0.888 , 0.6484 , 0.8795 , 0.7234 , 0.5631 | "x1.2", "tr" , 0.06132 , 0.8427 , 0.3339 , 0.04686 , 0.2348 , 0.1971 , 0.5164 , 0.7618 ) b <- wrapr::build_frame( "id" , "trt", "time", "work" , "play", "talk", "total" | "x1.1", "cnt", "T1" , 0.3443 , 0.7842, 0.1079, 0.888 | "x1.1", "cnt", "T2" , 0.6484 , 0.8795, 0.7234, 0.5631 | "x1.2", "tr" , "T1" , 0.06132, 0.8427, 0.3339, 0.04686 | "x1.2", "tr" , "T2" , 0.2348 , 0.1971, 0.5164, 0.7618 )

A record is an observation, keyed by id (plus trt, which is a function of id).

id trt work.T1 play.T1 talk.T1 total.T1 work.T2 play.T2 talk.T2 total.T2 x1.1 cnt 0.3443 0.7842 0.1079 0.888 0.6484 0.8795 0.7234 0.5631

The incoming data is in row record format, so we can use rowrecs_to_blocks_spec().

library("cdata") # identify the record keys recordKeys <- c("id", "trt") # specify the outgoing record shape outgoing_record <- wrapr::qchar_frame( "time" , "work" , "play" , "talk" , "total" | "T1" , work.T1, play.T1, talk.T1, total.T1 | "T2" , work.T2, play.T2, talk.T2, total.T2 ) # put it all together into a layout layout <- rowrecs_to_blocks_spec( outgoing_record, recordKeys = recordKeys) # confirm we have the right layout print(layout) #> { #> row_record <- wrapr::qchar_frame( #> "id" , "trt", "work.T1", "work.T2", "play.T1", "play.T2", "talk.T1", "talk.T2", "total.T1", "total.T2" | #> . , . , work.T1 , work.T2 , play.T1 , play.T2 , talk.T1 , talk.T2 , total.T1 , total.T2 ) #> row_keys <- c('id', 'trt') #> #> # becomes #> #> block_record <- wrapr::qchar_frame( #> "id" , "trt", "time", "work" , "play" , "talk" , "total" | #> . , . , "T1" , work.T1, play.T1, talk.T1, total.T1 | #> . , . , "T2" , work.T2, play.T2, talk.T2, total.T2 ) #> block_keys <- c('id', 'trt', 'time') #> #> # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE) #> } # apply the layout a %.>% layout %.>% # reorder the frame by the record keys plus time .[wrapr::orderv(.[ , c(recordKeys, "time"), drop = FALSE]), , drop = FALSE] %.>% knitr::kable(., row.names = FALSE) id trt time work play talk total x1.1 cnt T1 0.34430 0.7842 0.1079 0.88800 x1.1 cnt T2 0.64840 0.8795 0.7234 0.56310 x1.2 tr T1 0.06132 0.8427 0.3339 0.04686 x1.2 tr T2 0.23480 0.1971 0.5164 0.76180 Reversing Transforms

cdata transform specifications are usually reversible or invertible (and this can be enforced). So in solving any one of the above problems the user has complete freedom to try and solve "moving from a to b" or "moving from b to a" (and can pick whichever they find easier).

For example continuing with example 5, we can reverse the data layout transform using the t() function.

inv_layout <- t(layout) print(inv_layout) #> { #> block_record <- wrapr::qchar_frame( #> "id" , "trt", "time", "work" , "play" , "talk" , "total" | #> . , . , "T1" , work.T1, play.T1, talk.T1, total.T1 | #> . , . , "T2" , work.T2, play.T2, talk.T2, total.T2 ) #> block_keys <- c('id', 'trt', 'time') #> #> # becomes #> #> row_record <- wrapr::qchar_frame( #> "id" , "trt", "work.T1", "work.T2", "play.T1", "play.T2", "talk.T1", "talk.T2", "total.T1", "total.T2" | #> . , . , work.T1 , work.T2 , play.T1 , play.T2 , talk.T1 , talk.T2 , total.T1 , total.T2 ) #> row_keys <- c('id', 'trt') #> #> # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE) #> } # apply the inverse layout b %.>% inv_layout %.>% knitr::kable(.) id trt work.T1 work.T2 play.T1 play.T2 talk.T1 talk.T2 total.T1 total.T2 x1.1 cnt 0.34430 0.6484 0.7842 0.8795 0.1079 0.7234 0.88800 0.5631 x1.2 tr 0.06132 0.2348 0.8427 0.1971 0.3339 0.5164 0.04686 0.7618

In this case, the inverse transform recovered the original row and column order of a, but this is not guaranteed.

Package entry points

The main cdata interfaces are given by the following set of methods:

Some convenience functions include:

• pivot_to_rowrecs(), for moving data from multi-row block records with one value per row (a single column of values) to single-row records [spread or dcast].
• pivot_to_blocks()/unpivot_to_blocks(), for moving data from single-row records to possibly multi row block records with one row per value (a single column of values) [gather or melt].
• wrapr::qchar_frame() a helper function for specifying record control table layout specifications.
• wrapr::build_frame() a helper function for specifying data frames.

The package vignettes can be found in the "Articles" tab of the cdata documentation site.

Conclusion

The key step in using cdata is to understand the record structure: what constitutes a record, what it would look like in a single row, and how the records are keyed. This is not always easy. However, understanding your data record layout is worth the effort. Once you understand the record structure of your data, the rest is relatively straightforward. Really all one is doing when using cdata is formalizing the transform "ask" into a machine readable example.

To make your own solutions, we suggest trying one of the above example solutions as a template. The idea of having the data layout transform be simple data (a list of a couple of data.frames) means one can use the full power of R and other R packages to build the data layout transform specification (one isn’t limited to some interface grammar specified by the data layout transform package). The idea of using arbitrary code to build up a data layout transform was used to good end in the grid scatter-plot example here.

We also note the value of being able to print and review the bulk of data layout transform, as it documents expected incoming data columns and interior block record key values.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Setting up continuous multi-platform R package building, checking and testing with R-Hub, Docker and GitLab CI/CD for free, with a working example

Sat, 04/27/2019 - 14:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

In the previous post, we looked at how to easily automate R analysis, modeling, and development work for free using GitLab’s CI/CD. Together with the fantastic R-hub project, we can use GitLab CI/CD to do much more.

In this post, we will take it to the next level by using R-hub to test our development work on many different platforms such as multiple Linux setups, MS Windows and MacOS. We will also show how to automate and continuously execute those multiplatform checks using GitLab CI/CD integration and Docker images.

For those too busy to read, we also provide a working example implementation in a public GitLab repository.

Contents Using R-hub to build, check and test our R package on many platforms

R-hub is a project supported by the R Consortium and offers free R CMD check as a service on different platforms. This enables us to quickly and efficiently check the R package you are developing to make sure it passes all necessary checks on several platforms. As an added bonus, the checks seem to be running in a very short time span, which means we can have your results at hand in a few minutes.

CI/CD running checks on multiple platforms with R-hub

Getting started with R-hub

Getting started with R-hub is also very simple and can be achieved in 3 lines of code, from a package directory or an RStudio project for a package:

# Install the package install.packages("rhub") # Validate your e-mail address # Provide the email argument if not detected automatically rhub::validate_email() # In an interactive session, # this will offer a list of platforms to choose from cr <- rhub::check()

Your validated_emails.csv should be saved into rappdirs::user_data_dir("rhub", "rhub") directory once validate_email() was run successfully.

For more details on getting started, the Get started with rhub post has you covered in detail.

Using and evaluating R-hub check results via R scripts

For continuous integration purposes, we may want to evaluate the results of the check based on the number of errors, warnings, and notes that the check gives for each platform. To achieve this goal, we need to tackle 2 issues:

Getting the results in a non-interactive context

In a non-interactive session, R-hub will run the check asynchronously and end our process used to request the service to free up resources. This is great but can pose some challenges in the CI context, as we would have to keep around a job to repeatedly query the R-hub job’s status and processing the results once done. Or implement a much smarter reporting solution.

Luckily, since for this purpose maximizing efficiency is not our top concern, the simple workaround is to execute the check as-if in an interactive session via the CI tool. This will provide us with the actual results of the check as soon as done and also write the log into our CI’s run log, at the obvious cost of having the process blocked while waiting for the check to finish on R-hub’s servers.

Processing the check results

The public methods for an rhub_check object currently seem to provide only side-effecting results such as printing them in various levels of detail and returning self, so investigating results via code may be challenging.

The simplest current solution is to use the object’s private fields to access the results in the desired format. The below example looks at the status_ private field and returns a data frame with the number of errors, warnings, and notes for each. For an object containing only 1 check result it can look as follows:

statuses <- cr[[".__enclos_env__"]][["private"]][["status_"]] res <- do.call(rbind, lapply(statuses, function(thisStatus) { data.frame( plaform = thisStatus[["platform"]][["name"]], errors = length(thisStatus[["result"]][["errors"]]), warnings = length(thisStatus[["result"]][["warnings"]]), notes = length(thisStatus[["result"]][["notes"]]), stringsAsFactors = FALSE ) })) res ## plaform errors warnings notes ## 1 debian-gcc-release 0 0 0

Now we have a data frame which we can use to signal the CI/CD job to succeed or fail based on our wishes. For example, if we want to fail if the check discovered any notes, warnings or errors, a simple statement like the following will suffice:

if (any(colSums(res[2L:4L]) > 0)) { stop("Some checks resulted in errors, warnings or notes.") } Putting it together into a script

Now that we have solved the above challenges, we can put it all together into a script that can be later used in the context of a CI/CD job:

# Retrieve passed command line arguments args <- commandArgs(trailingOnly = TRUE) if (length(args) != 1L) { stop("Incorrect number of args, needs 1: platform (string)") } platform <- args[[1L]] # Check if passed platform is valid if (!is.element(platform, rhub::platforms()[[1L]])) { stop(paste( "Given platform not in rhub::platforms()[[1L]]:", platform )) } # Run the check on the selected platform # Use show_status = TRUE to wait for results cr <- rhub::check(platform = platform, show_status = TRUE) # Get the statuses from private field status_ statuses <- cr[[".__enclos_env__"]][["private"]][["status_"]] # Create and print a data frame with results res <- do.call(rbind, lapply(statuses, function(thisStatus) { data.frame( plaform = thisStatus[["platform"]][["name"]], errors = length(thisStatus[["result"]][["errors"]]), warnings = length(thisStatus[["result"]][["warnings"]]), notes = length(thisStatus[["result"]][["notes"]]), stringsAsFactors = FALSE ) })) print(res) # Fail if any errors, warnings or notes found if (any(colSums(res[2L:4L]) > 0)) { stop("Some checks had errors, warnings or notes. See above for details.") } Preparing a private docker image to use with R-hub

If you are new to Docker, Colin Fay has you covered with his Introduction to Docker for R Users blog post.

Creating and testing an image

Thanks to all the hard work done by the maintainers of the Rocker images, our task with creating an image suitable for use with R hub is very simple. Essentially we only need 2 additions to the r-base image:

1. The rhub package and a few system dependencies
2. A validated_emails.csv file placed into the correct directory, providing R-hub with the information on validated e-mail to use for the checks

The following Dockerfile can be used the create such an image for yourself. Just make sure you have your validated_emails.csv file present in the resources folder when running docker build.

To test our docker image, we can use a command like the following to create a container and run R within it in an interactive session:

docker run --rm -it /: R

Now we can see the list of validated e-mails in that R session:

rhub::list_validated_emails() ## email token ## 1 myemail@somemail.com 00000000000000000000 Pushing the image into a private repository

Now that we have our image created, we need to push it to a repository for GitLab CI to be able to use it. Normally this is very simple:

docker push /:

However as we are storing some relatively sensitive data in our image, namely our R-hub token we should probably make this image private. Thanks to Dockerhub, this process is very easy – just click the proper buttons as shown in this post in the Dockerhub docs. Note that for free a Dockerhub user has only 1 private repository available.

Creating a GitLab CI/CD pipeline

For an introduction to using GitLab CI/CD for R work, look at the previous post on How to easily automate R analysis, modeling and development work using CI/CD, with working examples

Setting up a pipeline with .gitlab-ci.yml

Now, we are ready with our private Docker image and the script to run and evaluate our R-hub checks, all that is left is to create and setup a CI/CD pipeline. For GitLab CI/CD, this means creating a .gitlab-ci.yml file in the root of our GitLab repository directory. Without much extra talk, that file can look as follows:

image: index.docker.io/jozefhajnala/rhub:rbase stages: - check variables: _R_CHECK_CRAN_INCOMING_: "false" _R_CHECK_FORCE_SUGGESTS_: "true" before_script: - apt-get update check_ubuntu: stage: check script: - Rscript inst/rhubcheck.R "ubuntu-gcc-release" check_fedora: stage: check script: - Rscript inst/rhubcheck.R "fedora-clang-devel" check_mswin: stage: check script: - Rscript inst/rhubcheck.R "windows-x86_64-devel" check_macos: stage: check script: - Rscript inst/rhubcheck.R "macos-elcapitan-release"

This file will make sure that:

1. The CI/CD jobs start from the image we have created
2. Will have one stage named check
3. Set a couple of environment variables for R
4. Run three jobs check_ubuntu, check_fedora, check_mswin, and check_macos – each of them by using Rscript to execute an R script stored under inst/rhubcheck.R, with different arguments specifying the platform to check on
Authenticating to use a private repository

Since we have made our Docker image private, GitLab will not be able to use it out of the box, we need to provide it with information on how to authenticate against Dockerhub to be able to pull the private image. There are a few ways to reach this goal, I have used the one to setup a variable via the Settings -> CI/CD -> Variables option in GitLab’s web UI:

Creating CI/CD variable with GitLab

The variable name should be DOCKER_AUTH_CONFIG and the value:

{ "auths": { "registry.example.com:5000": { "auth": "bXlfdXNlcm5hbWU6bXlfcGFzc3dvcmQ=" } } }

Where

• "registry.example.com:5000" is replaced by our registry, for example "index.docker.io"
• the value for "auth" is replaced by a base64-encoded version of our ":", which we can retrieve for example using R:

And that is all! We are now ready to run our checks using a Docker image stored in a private repository. Once we push the .gitlab-ci.yml and inst/rhubcheck.R files to a GitLab repository, the pipeline will be automatically executed every time we push a commit to that repository.

TL;DR: Just show it to me in action

In case you are only interested in seeing the CI/CD pipeline with R-hub implemented for an R package, look at:

References R-hub R work and GitLab R work and Docker
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A new image on DigitalOcean to start using RStudio Server without waiting more than 2 minutes

Sat, 04/27/2019 - 02:00

(This article was first published on Pachá, and kindly contributed to R-bloggers)

Motivation

I initially made an close-access image for DigitalOcean because I wanted to spend my lectures and workshops giving useful examples and not solving software installation issues. Now you can use my RStudio image which available on DigitalOcean Marketplace, it costs \$0 except for the cost of running the server, and it’s also scalable.

For novice users, installing R can be hard, that’s why I wrote a guide with videos in english here and I translated it to spanish here. The best solution I found was to provide my students a read to use droplet and then during the term me or the TAs can help them installing the software.

Description

This is a pre-configured image with open source editions of RStudio Server 1.1. and Shiny Server 1.5. All dependencies are solved for you to just go and use the next compiled R packages (the setup already includes R 3.5, TeX Live 2018 and OpenBLAS):

• Development: devtools, testthat
• Documentation: bookdown, pkgdown, rmarkdown, roxygen2
• Parallelization: doParallel
• Package managemente: pacman
• Manipulate data: dplyr, janitor, lubridate, purrr, tidyr
• Plots: ggplot2
• Reproducibility: crul, packrat, vcr
• Web applications: shiny
Software included Package Version License R 3.5.3 GPL 3 Rstudio Server 1.1.463 GPL 3 Shiny Server 1.5.9.923 GPL 3 OpenBLAS 0.2.20 BSD 3 TeX Live 2018 GPL 2 Results

If you start with a fresh droplet with Ubuntu Server, you have to add CRAN sources to apt, then install the Tidyverse and follow a few steps to solve dependencies and compile R packages.

I completed all of those steps for you, and I documented them here. For example, RPostgreSQL (amazing package!) requires to install libpq-dev which can be installed y using apt-get, all of that and more is already solved.

If you decide to build your own RStudio Server instance, it can take up to 45 minutes considering compilation time. This image gives you the same result in around 2 minutes.

Getting started after deploy Creating an administrator account

It is highly recommended that you create an administrator account separate from root:

adduser paul usermod -aG sudo paul

Now the user “paul” will be able to update the system and install software.

For the full reference please check this tutorial.

Let’s create three users that will only be able install R packages to their personal directory (and of course to use R, RStudio and Shiny):

From any modern browser such as Firefox, type /serveripaddress:8787/ on the address bar and then enter with any of the users you created before.

Please notice that the droplet already includes different R packages and full LaTeX installation. The next packages are ready to use and you don’t need to re-install them:

• Data visualization: ggplot2 and shiny
• Data wrangling: data.table, dplyr, haven, janitor, jsonlite, lubridate, purrr, readr and tidyr
• Development: crul, devtools, pacman, packrat and testthat
• Documentation: bookdown, pkgdown, rmarkdown, roxygen2
• Parallelization: doParallel
API Creation

In addition to creating a Droplet from the RStudio 1-Click App via the control panel, you can also use the DigitalOcean API.

You can list all 1-Click Apps using the API. As an example, to create a 4GB RStudio Droplet in the SFO2 region, you can use the following curl command. You’ll need to either save your API access token to an environment variable or substitute it into the command below.

curl -X POST -H 'Content-Type: application/json' \ -H 'Authorization: Bearer '\$TOKEN'' -d \ '{"name":"choose_a_name","region":"sfo2","size":"s-2vcpu-4gb","image":"simplystatistics-rstudio-18-04"}' \ "https://api.digitalocean.com/v2/droplets" Get free DigitalOcean credits

My DigitalOcean Referral Link https://m.do.co/c/6119f0430dad var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Robust measurement from a 2-way table

Fri, 04/26/2019 - 23:04

(This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers)

I work in a university.  My department runs degree courses that allow students a lot of flexibility in their choice of course “modules”.  (A typical student takes 8 modules per year, and is assessed separately on each module).

After the exams are finished each year, we promise our students to look carefully at the exam marks for each module — to ensure that students taking a “hard” module are not penalized for doing that, and that students taking an “easy” module are not unduly advantaged.

The challenge in this is to separate module difficulty from student ability: we need to be able to tell the difference between (for example) a hard module and a module that was chosen by weaker-than-average students.  This necessitates analysis of the exam marks for all modules together, rather than separately.

The data to be analysed are each student’s score (expressed as a percentage) in each module they took.  It is convenient to arrange those scores in a 2-way table, whose rows are indexed by student IDs, and whose columns correspond to all the different possible modules that were taken.  The task is then to analyse the (typically incomplete) 2-way table, to determine a numerical “module effect” for each module (a relatively high number for each module that was found relatively “easy”, and lower numbers for modules that were relatively “hard”.

A standard method for doing this robustly (i.e., in such a way that the analysis is not influenced too strongly by the performance of a small number of students) is the clever median polish method due to J W Tukey.  My university department has been using median polish now for several years, to identify any strong “module effects” that ought to be taken into account when assessing each student’s overall performance in their degree course.

Median polish works mostly OK, it seems: it gives answers that broadly make sense.  But there are some well known problems, including that it matters which way round the table is presented (i.e., “rows are students”, versus “rows are modules”) — the answer will depend on that.  So median polish is actually not just one method, but two.

When my university department asked me recently to implement its annual median-polish exercise in R, I could not resist thinking a bit about whether there might be something even better than median polish, for this specific purpose of identifying the column effects (module effects) robustly.  This led me to look at some simple “toy” examples, to help understand the principles.  I’ll just show one such example here, to illustrate how it’s possible to do better than median polish in this particular context.

Example: 5 modules, 3 students

> x module student A B C D E i NA NA NA 45 60 j NA NA NA 55 60 k 10 20 30 NA 50

There were five modules (labelled A,B,C,D,E).  Students i, j and k each took a selection of those modules.  It’s a small dataset, but that is deliberate: we can see easily what’s going on in a table this small.  Module E was easier than the others, for example; and student k looks to be the weakest student (since k was outperformed by the other two students in module E, the only one that they all took).

I will call the above table perfect, as far as the measurement of module effects is concerned.  If we assign module effects (−20, −10, 0, 10, 20) to the five modules A,B,C,D,E respectively, then for every pair of modules the observed within-student differences are centered upon the relevant difference in those module effects.  For example, look at modules D and E: student i scores 15 points more in E, while j scores 5 points more in E, and the median of those two differences is 10 — the same as the difference between the proposed “perfect” module effects for D and E.

When we perform median polish on this table, we get different answers depending on whether we apply the method to the table directly, or to its transpose:

> medpolish(x, na.rm = TRUE, maxiter = 20) ... Median Polish Results (Dataset: "x") Overall: 38.75 Row Effects: i j k 0.00 5.00 -8.75 Column Effects: A B C D E -20.00 -10.00 0.00 8.75 20.00 Residuals: module student A B C D E i NA NA NA -2.5 1.25 j NA NA NA 2.5 -3.75 k 0 0 0 NA 0.00 > medpolish(t(x), na.rm = TRUE, maxiter = 20) ... Median Polish Results (Dataset: "t(x)") Overall: 36.25 Row Effects: A B C D E -20.00 -10.00 0.00 11.25 20.00 Column Effects: i j k 0.625 5.625 -6.250 Residuals: student module i j k A NA NA 0 B NA NA 0 C NA NA 0 D -3.125 1.875 NA E 3.125 -1.875 0

Neither of those answers is the same as the “perfect” module-effect measurement that was mentioned above.  The module effect for D as computed by median polish is either 8.75 or 11.25, depending on the orientation of the input table — but not the “perfect 10”.

A better method: Median difference analysis

I decided to implement, in place of median polish, a simple non-iterative method that targets directly the notion of “perfect” measurement that is mentioned above.

The method is in two stages.

Stage 1 computes within-student differences and takes the median of those, for each possible module pair.  For our toy example:

> md <- meddiff(x) A B C D E A NA -10 -20 NA -40 B 1 NA -10 NA -30 C 1 1 NA NA -20 D 0 0 0 NA -10 E 1 1 1 2 NA

The result here has all of the available median-difference values above the diagonal.  Below the diagonal is the count of how many differences were used in computing each one of those medians.  So, for example, the median difference between modules  D and E is −10; and that was computed from 2 students’ exam scores.

Stage 2 then fits a linear model to the median-difference values, using weighted least squares.  The linear model finds the vector of module effects that most closely approximates the available median differences (i.e., best approximates the numbers above the diagonal).  The weights are simply the counts from the lower triangle of the above matrix.

In this “perfect” example, we achieve the desired perfect answer (which here is presented with E as the “reference” module):

> fit(md)\$coefficients A B C D E -40 -30 -20 -10 0

My plan now is to make these simple R functions robust enough to use for our students’ actual exam marks, and to add also inference on the module-effect values (via a suitably designed bootstrap calculation).

For now, here are my prototype functions in case anyone else wants to play with them:

meddiff <- function(xmat) { ## rows are students, columns are modules S <- nrow(xmat) M <- ncol(xmat) result <- matrix(NA, M, M) rownames(result) <- colnames(result) <- colnames(xmat) for (m in 1:(M-1)) { for (mm in (m+1):M) { diffs <- xmat[, m] - xmat[, mm] ## upper triangle result[m, mm] <- median(diffs, na.rm = TRUE) ## lower triangle result[mm, m] <- sum(!is.na(diffs)) } } return(result) } fit <- function(m) { ## matrix m needs to be fully connected above the diagonal upper <- upper.tri(m) diffs <- m[upper] weights <- t(m)[upper] rows <- factor(row(m)[upper]) cols <- factor(col(m)[upper]) X <- cbind(model.matrix(~ rows - 1), 0) - cbind(0, model.matrix(~ cols - 1)) colnames(X) <- colnames(m) rownames(X) <- paste0(colnames(m)[rows], "-", colnames(m)[cols]) result <- lm.wfit(X, diffs, weights) result\$coefficients[is.na(result\$coefficients)] <- 0 class(result) <- c("meddiff_fit", "list") return(result) }

To cite this entry:
Firth, D (2019). Robust measurement from a 2-way table. Weblog entry at URL https://statgeek.net/2019/04/26/robust-measurement-from-a-2-way-table/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Frankfurt Data Science Meetup: Opening the Black Box

Fri, 04/26/2019 - 15:40

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

The place to be: Frankfurt Data Science Meetup

We at STATWORX love meetups: mingle with other data scientists, gain insights into their newest projects and approaches while having a beer… what’s not to like?

That’s why STATWORX supports the Frankfurt Data Science Meetup. However, while STATWORX chips in some beer money, the real work is done by the organizers of the meetup, who are doing a terrific job. A big thanks to everyone!

Yesterday’s meetup was extra special for us since our Head of Data Science Fabian Müller gave a talk on explainable machine learning. With data science growing more and more important in almost all fields of business, “cracking open the black box” is a matter of growing relevance. So, do yourself a favor and take a look at Fabian’s slides and code on github. (Even if you’re not that interested in the content, the memes and cute animal pictures are definitely worth it, believe me!)

The takeaways: How and why to open the black box

As an appetizer, here are three of my most important takeaways:

1. It’s not only cool to understand your model but it’s also the responsible thing to do. Because our models are trained on historical data, they mirror historical circumstances. Thus, model-based decisions might perpetuate discriminatory (social) structures. On top, understanding our models helps us to make them more performant.
2. If we use model agnostic methods, which only use the model’s predictions, we can separate learning from explaining. Such model independent methods even enable comparisons across different classes of models. Also, there are many possible angles: We can focus on a single prediction and, e.g. explore what-if scenarios, decompose predictions or identify key features. Alternatively, we can focus on the model as a whole and analyze its structure or performance.
3. Many methods to open the black box even apply to extremely complex models, e.g. CNN’s. The best thing: no matter how complex the model, most methods are intuitively understandable. You might need the help of a neat plot or two, but even data science non-specialists will get the gist.

I hope to have motivated you to check out more talks on the FFM Data Science Youtube channel. or, even better, come to the next Meetup in May. We’d love to see you!

Lea Waniek

I am data scientist at STATWORX, apart from machine learning, I love to play around with RMarkdown and ggplot2, making data science beautiful inside and out.

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI.

.button {background-color: #0085af;}

Der Beitrag Frankfurt Data Science Meetup: Opening the Black Box erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Checking reverse dependencies: the tiny way

Fri, 04/26/2019 - 11:31

(This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

The tools package that comes with base R makes checking reverse dependencies super easy.

1. Build your package tarball (the pkg_x.y.z.tar.gz file).

R CMD build /your/package/location

It is a good idea to make sure that the tarball is in a dedicated directory, because the next step will download and install reverse dependencies in the directory where the tarball resides.

1. In an R terminal type
result <- check_packages_in_dir("/directory/containing/tarball" , revdep = list() )

The result can be printed and summarized and analyzed further if there is any breakage. Here’s an example of output when I ran this on my gower package today.

> result Check results for packages in dir '/home/mark/projects/gower/output': Package sources: 1, Reverse depends: 5 Use summary() for more information. > summary(result) Check results for packages in dir '/home/mark/projects/gower/output': Check status summary: ERROR NOTE OK Source packages 0 0 1 Reverse depends 1 3 1 Check results summary: gower ... OK rdepends_ceterisParibus ... NOTE * checking dependencies in R code ... NOTE rdepends_lime ... ERROR * checking tests ... ERROR * checking re-building of vignette outputs ... WARNING rdepends_live ... NOTE * checking dependencies in R code ... NOTE rdepends_recipes ... NOTE * checking dependencies in R code ... NOTE rdepends_simputation ... OK

(Checking the logs in output/rdepends_lime.Rcheck/00check.log shows that lime fails because of a missing JAVA engine [I just updated my OS and have no JAVA installed yet].)

Notes

1. Checking reverse dependencies can be done in parallel by setting the Ncpus argument larger than one.
2. Be aware that the documentation states that (R 3.5.2) This functionality is still experimental: interfaces may change in future versions. Nevertheless, it has worked fine for me so far.
Markdown with by wp-gfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

DALEX for keras and parsnip

Fri, 04/26/2019 - 11:00

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

DALEX is a set of tools for explanation, exploration and debugging of predictive models. The nice thing about it is that it can be easily connected to different model factories.

Recently Michal Maj wrote a nice vignette how to use DALEX with models created in keras (an open-source neural-network library in python with an R interface created by RStudio). Find the vignette here.
Michal compared a keras model against deeplearning from h2o package, so you can check which model won on the Titanic dataset.

Next nice vignette was created by Szymon Maksymiuk. In this vignette Szymon shows how to use DALEX with parsnip models (parsnip is a part of the tidymodels ecosystem, created by Max Kuhn and Davis Vaughan). Models like boost_tree, mlp and svm_rbf are competing on the Titanic data.

These two new vignettes add to our collection how to use DALEX with mlr, caret, h2o and others model factories.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny v1.3.2

Fri, 04/26/2019 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce the release of Shiny v1.3.2. This release has two main features: a new reactivity debugging tool we call reactlog, and much faster serving of static file assets.

Introducing reactlog: Visually debug your reactivity issues

Debugging faulty reactive logic can be challenging, as we’ve written and talked about in the past. In particular, some of the most difficult Shiny app bugs to track down are when reactive expressions and observers re-execute either too often (i.e. plots that render multiple times in succession after a single change), or not often enough (i.e. outputs that don’t update when you expected them to).

This release has an important new addition to the Shiny debugging toolbox: reactlog! To use reactlog, execute this line before running your Shiny app:

options(shiny.reactlog = TRUE)

This will instruct Shiny to keep a record of all the interactions between reactive objects.

Then, use your app, reproducing the problematic symptoms. Once you have done that, press Ctrl+F3 (Mac users: Cmd+F3) from within your browser, and you’ll see something like this:

This screen lets you interactively explore the reactive history of your Shiny session. You can step forwards and backwards through time, watching as reactive objects execute, create and sever relationships, invalidate, etc.

Filtering the reactlog

For medium and large Shiny apps, the reactive graph may be pretty crowded when visualized in two dimensions. Two reactlog features help you separate the signal from the noise.

• First, you can use the search field in the upper-right corner to filter by name (such as input or output ID, or the variable name of a reactive expression).

• Second, you can double-click a node or edge in the graph to focus in on it, which will remove all unrelated reactive elements. Double-click on the background to restore the original view.

Together, these features make it easy to find and focus on the relevant objects in your app.

You can find out more in this rstudio::conf talk by Barret Schloerke, or read the docs at the reactlog website.

Improved performance for serving JavaScript and CSS files

In previous versions of Shiny, every HTTP request was handled by R, including requests for static JavaScript and CSS files. For apps that have many add-on interactive components, there could be a dozen or more of these requests. As an R process becomes heavily loaded with long-running computations, the requests for these static files have to fight for a slice of R’s attention.

This is most noticeable when one user’s session affects the startup of another user’s session. A single R process can serve multiple Shiny user sessions, and in previous versions of Shiny, a user’s session could be blocked from loading startup-related JavaScript and CSS files because another user happened to be doing an intensive computation at that moment.

With the new version of Shiny, static files are always served up at lightning speed, no matter what’s going on in R. We accomplished this by adding new static-file serving options to httpuv, using dedicated C++ code paths running on a background thread. This means that computations in R won’t affect the serving of static files, and serving static files won’t affect computations in R. The experience for users of heavily-loaded Shiny applications should be noticeably better. Note that it has always been possible with RStudio Connect and Shiny Server Pro to improve performance by increasing the number of R processes serving an application, but now Shiny itself is more efficient and multithreaded, so each R process can effectively handle more user sessions.

The best part is that you don’t need to do anything to take advantage of these speed improvements—just upgrading Shiny to v1.3.2 will do it!

See the full list of v1.3.0 changes (and v1.3.1, v1.3.2) to learn about minor bug fixes and improvements we’ve made in this release.

Note: A number of users have reported that upgrading to Shiny v1.3.0 (or higher) breaks their apps when running behind an Nginx proxy: the HTML loads, but none of the styles are applied and none of the calculations run. This occurs when Nginx is subtly misconfigured. We’ve posted details and a fix in this RStudio Community post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

March 2019: “Top 40” New CRAN Packages

Fri, 04/26/2019 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

By my count, two hundred and thirty-three packages stuck to CRAN last month. I have tried to capture something of the diversity of the offerings by selecting packages in ten categories: Computational Methods, Data, Machine Learning, Medicine, Science, Shiny, Statistics, Time Series, Utilities, and Visualization. The Shiny category contains packages that expand on Shiny capabilities, not just packages that implement a Shiny application. It is not clear whether this is going to be a new cottage industry or not.

Computational Methods

DistributionOptimization v1.2.1: Fits Gaussian mixtures by applying Genetic algorithms from the GA package using Gaussian Mixture Logic stems from AdaptGauss.

latte v0.2.1: Implements connections to LattE for counting lattice points and integration inside convex polytopes, and 4ti2 for algebraic, geometric, and combinatorial problems on linear spaces and front-end tools facilitating their use in the ‘R’ ecosystem. Look here for an example.

nlrx v0.2.0: Provides tools to set up, run, and analyze NetLogo model simulations in R. There is a Getting Started Guide, vignettes for Advanced Configuration, and Examples.

nvctr v0.1.1: Implements the n-vector approach to calculating geographical positions using an ellipsoidal model of the Earth. This package is a translation of the FFi Matlab library from FFI described in Gade (2010). The vignette provides examples.

Data

EHRtemporalVariability v1.0: Provides functions to delineate reference changes over time in Electronic Health Records through the projection and visualization of dissimilarities among data temporal batches, and explore results through data temporal heat maps, information geometric temporal (IGT) plots, and a Shiny app. The vignette shows how to use the package.

kayadata v0.4.0: Provides data for Kaya identity variables (population, gross domestic product, primary energy consumption, and energy-related CO2 emissions), and includes utility functions for exploring and plotting fuel mix for a given country or region. See the vignette for examples.

newsanchor v0.1.0: Implements an interface to gather news from the News API. A personal API key is required. The vignette shows how to scrape New York Times online articles.

raustats v0.1.0: Provides functions for downloading Australian economic statistics from the Australian Bureau of Statistics and Reserve Bank of Australia websites. The vignette shows how to use the package.

Machine Learning

akmedoids v0.1.2: Advances a set of R-functions for longitudinal clustering of long-term trajectories, and determines the optimal solution based on the Caliński-Harabasz criterion ( Caliński and Harabasz (1974) ). The vignette works through an extended example.

shapper v0.1.0: Implements a wrapper for the Python shap library that provides SHapley Additive exPlanations (SHAP) for the variables that influence particular observations in machine learning models. There are vignettes for classification and regression.

sparkxgb v0.1.0: Implements a sparklyr extension that provides an interface for XGBoost on Apache Spark. See the README for a brief overview.

xgb2sql v0.1.2: Enables in-database scoring of XGBoost models built in R, by translating trained model objects into SQL query. See Chen & Guestrin (2016) for details on XGBoost, and the vignette for an overview of the package.

Medicine

ctrdata v0.18: Provides functions for querying, retrieving, and analyzing protocol- and results-related information on clinical trials from two public registers, the European Union Clinical Trials Register and ClinicalTrials.gov. There is a Getting Started Guide and a vignette with examples.

pubtatordb v0.1.3: Provides functions to download PubTator (National Center for Biotechnology Information) annotations, and then create and query a local version of the database. There is a vignette.

tacmagic v0.2.1: Provides functions to facilitate the analysis of positron emission tomography (PET) time activity curve (TAC) data. See Logan et al. (1996) and Aizenstein et al. (2008) for use cases, and the vignette for a detailed overview.

Science

bulletcp v1.0.0: Provides functions to automatically detect groove locations via a Bayesian changepoint detection method, to be used in the data pre-processing step of forensic bullet matching algorithms. See Stephens (1994) for reference, the vignette for the theory, and Mejia et al. in the most recent issue of Significance for the big picture.

earthtide v0.0.5: Ports the Fortran ETERNA 3.4 program by H.G. Wenzel for calculating synthetic Earth tides using the Hartmann and Wenzel (1994) or Kudryavtsev (2004) tidal catalogs. See the vignette for an introduction.

steps v0.2.1: Implements functions to simulate population dynamics across space and time. The Eastern Grey Kangeroo vignette offers an extended example.

Shiny

reactlog v1.0.0: Provides visual insight into that black box of shiny reactivity by constructing a directed dependency graph of the application’s reactive state at any point in a reactive recording. See the vignette for an introduction.

shinyhttr v1.0.0: Modifies the progress() function from the httr package to let it send output to progressBar() function from the shinyWidgets package.

Statistics

CoopGame v0.2.1: Provides a comprehensive set of tools for cooperative game theory with transferable utility, enabling users to create special families of cooperative games, such as bankruptcy games, cost-sharing games, and weighted-voting games. The vignette offers theory and examples.

discfrail v0.1: Provides functions for fitting Cox proportional hazards models for grouped time-to-event data, where the shared group-specific frailties have a discrete non-parametric distribution. See Gasperoni et. al (2018). The vignette shows the math.

fastglm v0.1.1: Provides functions to fit generalized linear models efficiently using RcppEigen. The iteratively reweighted least squares implementation utilizes the step-halving approach of Marschner (2011). There is a vignette.

hettx v0.1.1: Implements methods developed by Ding, Feller, and Miratrix (2016), and Ding, Feller, and Miratrix (2018) for testing whether there is unexplained variation in treatment effects across observations, and for characterizing the extent of the explained and unexplained variation in treatment effects. There are vignettes on heterogeneous treatment effects and systematic fariation estimation.

mcmcabn v0.1: Implements a structural MCMC sampler for Directed Acyclic Graphs (DAGs). It supports the new edge reversal move from Grzegorczyk and Husmeier (2008) and the Markov blanket resampling from Su and Borsuk (2016), and three priors: a prior controlling for structure complexity from Koivisto and Sood (2004), an uninformative prior, and a user-defined prior. The vignette provides an overview of the package.

networkABC v0.5-3: Implements a new multi-level approximation Bayesian computation (ABC) algorithm to decipher network data and assess the strength of the inferred links between network’s actors. The vignette provides an example.

retrodesign v0.1.0: Provides tools for working with Type S (Sign) and Type M (Magnitude) errors, as proposed in Gelman and Tuerlinckx (2000) and Gelman & Carlin (2014), using the closed forms solutions for the probability of a Type S/M error from Lu, Qiu, and Deng (2018). The vignette shows how to use Type S and M errors in hypothesis testing.

senssobol v0.1.1: Enables users to compute, bootstrap, and plot up to third-order Sobol indices using the estimators by Saltelli et al. (2010) and Jansen (1999), and calculate the approximation error in the computation of Sobol first and total indices using the algorithm of Khorashadi Zadeh et al. (2017). The vignette provides an overview.

Time Series

DTSg v:0.1.2: Provides a class for working with time series data based on data.table and R6 with reference semantics. There are vignettes for Basic and Advanced usage.

RJDemetra v0.1.2: Implements an interface to JDemetra+, the seasonal adjustment software officially recommended to the members of the European Statistical System (ESS) and the European System of Central Banks.

runstats v1.0.1: Provides methods for quickly computing time series sample statistics, including: (1) mean, (2) standard deviation, and (3) variance over a fixed-length window of time-series, (4) correlation, (5) covariance, and (6) Euclidean distance (L2 norm) between short-time pattern and time-series. See the vignette for examples.

Utilities

aweek v0.2.0: Converts dates to arbitrary week definitions. The vignette provides examples.

credentials v1.1: Provides tools for managing SSH and git credentials. See the vignette for details.

cyphr v1.0.1: Implements wrappers using low-level support from sodium and OpenSSL to facilitate using encryption for data analysis. There is an Introduction and a vignette on Data Encryption.

encryptr v0.1.2: Provides functions to encrypt data frame or tibble columns using strong RSA encryption. See README for examples.

lenses v0.0.3: Provides tools for creating and using lenses to simplify data manipulation. Lenses are composable getter/setter pairs for working with data in a purely functional way, which were inspired by the Haskell library lens ( Kmett (2012) ). For a comprehensive history of lenses, see the lens wiki and look here for examples.

yum v0.0.1: Provides functions to facilitate extracting information in YAML fragments from one or multiple files, optionally structuring the information in a data.tree. See the README file.

Visualization

ggasym v0.1.1: Provides functions for asymmetric matrix plotting with ggplot2. See the vignette for examples.

predict3d v:0.1.0: Provides functions for 2- and 3-dimensional plots for multiple regression models using packages ggplot2 and rgl. It supports linear models (lm), generalized linear models (glm), and local polynomial regression fittings (loess). There is a vignette.

_____='https://rviews.rstudio.com/2019/04/26/march-2019-top-40-new-cran-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Couting Pairs

Fri, 04/26/2019 - 02:00

(This article was first published on Analysis of AFL, and kindly contributed to R-bloggers)

Saw a tweet from [@matt_pavlich](https://twitter.com/matt_pavlich) asking twitter roughly how many games he and David Mundy have played together.

Thankfully, you don’t have to wonder anymore and you can reproduce the results yourself and do running counts for your favourite players!

library(tidyverse) ## ── Attaching packages ─────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2 ## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1 ## ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.4.0 ## ── Conflicts ────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library(fitzRoy) df<-fitzRoy::get_afltables_stats(start_date = "1990-01-01", end_date=Sys.Date()) ## Returning data from 1990-01-01 to 2019-04-26 ## Downloading data ## ## Finished downloading data. Processing XMLs ## Warning: Detecting old grouped_df format, replacing `vars` attribute by ## `groups` ## Finished getting afltables data df\$matchID<-paste(df\$Season, df\$Round, df\$Home.team, df\$Away.team) df\$name<-paste(df\$First.name, df\$Surname) df_data<-df%>%filter(Playing.for=="Fremantle")%>%select(name, matchID, Playing.for) df_data %>% mutate(n = 1) %>% spread(name, n, fill=0) %>% select(-Playing.for, -matchID) %>% {crossprod(as.matrix(.))} %>% replace(lower.tri(., diag=T), NA) %>% reshape2::melt(na.rm=T) %>% unite('Pair', c('Var1', 'Var2'), sep=", ")%>% filter(value>150)%>% arrange(desc(value)) ## Pair value ## 1 Aaron Sandilands, Matthew Pavlich 232 ## 2 David Mundy, Matthew Pavlich 226 ## 3 Luke McPharlin, Matthew Pavlich 224 ## 4 David Mundy, Michael Johnson 222 ## 5 Aaron Sandilands, David Mundy 214 ## 6 Matthew Pavlich, Paul Hasleby 200 ## 7 Antoni Grover, Matthew Pavlich 191 ## 8 Matthew Pavlich, Michael Johnson 187 ## 9 David Mundy, Luke McPharlin 186 ## 10 Aaron Sandilands, Luke McPharlin 183 ## 11 David Mundy, Stephen Hill 183 ## 12 David Mundy, Ryan Crowley 175 ## 13 Aaron Sandilands, Michael Johnson 173 ## 14 Luke McPharlin, Michael Johnson 171 ## 15 Shane Parker, Shaun McManus 171 ## 16 Matthew Pavlich, Ryan Crowley 167 ## 17 Matthew Pavlich, Shaun McManus 164 ## 18 Michael Johnson, Ryan Crowley 163 ## 19 Matthew Pavlich, Peter Bell 160 ## 20 David Mundy, Garrick Ibbotson 158 ## 21 Chris Mayne, David Mundy 154 ## 22 David Mundy, Hayden Ballantyne 154 ## 23 David Mundy, Paul Duffield 153 ## 24 Antoni Grover, Paul Hasleby 153 ## 25 Matthew Pavlich, Stephen Hill 151

Some interesting things you might want to do now you have the script and data, you might want to see which pair has played the most together for each team.

Something you might also want to do is look the games that they played together in.

df_data%>% filter(name %in% c("David Mundy", "Matthew Pavlich")) %>% group_by(matchID)%>% count(n=n())%>% filter(n==2) ## # A tibble: 226 x 3 ## # Groups: matchID [226] ## matchID n nn ## ## 1 2005 10 Geelong Fremantle 2 2 ## 2 2005 11 Fremantle Brisbane Lions 2 2 ## 3 2005 12 Sydney Fremantle 2 2 ## 4 2005 13 Fremantle North Melbourne 2 2 ## 5 2005 14 Adelaide Fremantle 2 2 ## 6 2005 15 Fremantle Western Bulldogs 2 2 ## 7 2005 16 Carlton Fremantle 2 2 ## 8 2005 17 Fremantle Melbourne 2 2 ## 9 2005 18 Collingwood Fremantle 2 2 ## 10 2005 19 Fremantle Richmond 2 2 ## # … with 216 more rows var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

modelplotr v1.0 now on CRAN: Visualize the Business Value of your Predictive Models

Thu, 04/25/2019 - 23:55

(This article was first published on Modelplot[R/py] introduction, and kindly contributed to R-bloggers)

Summary

In this blog we explain most valuable evaluation plots to assess the business value of a predictive model. Since these visualisations are not included in most popular model building packages or modules in R and Python, we show how you can easily create these plots for your own predictive models with our modelplotr r package and our modelplotpy python module (Prefer python? Read all about modelplotpy here!). This will help you to explain your model’s business value in laymans terms to non-techies.

Intro

‘…And as we can see clearly on this ROC plot, the sensitivity of the model at the value of 0.2 on one minus the specificity is quite high! Right?…’.

If your fellow business colleagues didn’t already wander away during your presentation about your fantastic predictive model, it will definitely push them over the edge when you start talking like this. Why? Because the ROC curve is not easy to quickly explain and also difficult to translate into answers on the business questions your spectators have. And these business questions were the reason you’ve built a model in the first place!

What business questions? We build models for all kinds of supervised classification problems. Such as predictive models to select the best records in a dataset, which can be customers, leads, items, events… For instance: You want to know which of your active customers have the highest probability to churn; you need to select those prospects that are most likely to respond to an offer; you have to identify transactions that have a high risk to be fraudulent. During your presentation, your audience is therefore mainly focused on answering questions like Does your model enable us to our target audience? How much better are we, using your model? What will the expected response on our campaign be? What is the financial impact of using your model?

During our model building efforts, we should already be focused on verifying how well the model performs. Often, we do so by training the model parameters on a selection or subset of records and test the performance on a holdout set or external validation set. We look at a set of performance measures like the ROC curve and the AUC value. These plots and statistics are very helpful to check during model building and optimization whether your model is under- or overfitting and what set of parameters performs best on test data. However, these statistics are not that valuable in assessing the business value the model you developed.

One reason that the ROC curve is not that useful in explaining the business value of your model, is because it’s quite hard to explain the interpretation of ‘area under the curve’, ‘specificity’ or ‘sensitivity’ to business people. Another important reason that these statistics and plots are useless in your business meetings is that they don’t help in determining how to apply your predictive model: What percentage of records should we select based on the model? Should we select only the best 10% of cases? Or should we stop at 30%? Or go on until we have selected 70%?… This is something you want to decide together with your business colleague to best match the business plans and campaign targets they have to meet. The four plots – the cumulative gains, cumulative lift, response and cumulative response – and three financial plots – costs & revenues, profit and return on investment – we are about to introduce are in our view the best ones for that cause.

Installing modelplotr

Before we start exploring the plots, let’s install our modelplotr package. Since it is available on CRAN it can easily be installed. All source code and latest developments are also available on github.

install.packages('modelplotr')

Now we’re ready to use modelplotr. In the package vignette (also available here) we go into much more detail on our modelplot package in r and all its functionalities. Here, we’ll focus on using modelplotr with a business example.

Example: Predictive models from caret, mlr, h2o and keras on the Bank Marketing Data Set

Let’s get down to business! Our example is based on a publicly available dataset, called the Bank Marketing Data Set. It is one of the most popular datasets which is made available on the UCI Machine Learning Repository. The data set comes from a Portugese bank and deals with a frequently-posed marketing question: whether a customer did or did not acquire a term deposit, a financial product. There are 4 datasets available and the bank-additional-full.csv is the one we use. It contains the information of 41.188 customers and 21 columns of information.

To illustrate how to use modelplotr, let’s say that we work as a data scientist for this bank and our marketing colleagues have asked us to help to select the customers that are most likely to respond to a term deposit offer. For that purpose, we will develop a predictive model and create the plots to discuss the results with our marketing colleagues. Since we want to show you how to build the plots, not how to build a perfect model, we’ll only use six of the available columns as features in our example. Here’s a short description on the data we use:

1. y: has the client subscribed a term deposit?
2. duration: last contact duration, in seconds (numeric)
3. campaign: number of contacts performed during this campaign and for this client
4. pdays: number of days that passed by after the client was last contacted from a previous campaign
5. previous: number of contacts performed before this campaign and for this client (numeric)
6. euribor3m: euribor 3 month rate

Let’s load the data and have a quick look at it:

On this data, we’ve applied some predictive modeling techniques. To show modelplotr can be used for any kind of model, built with numerous packages, we’ve created some models with the caret package, the mlr package, the h2o package and the keras package. These four are very popular R packages to build models with many predictive modeling techniques, such as logistic regression, random forest, XG boost, svm, neural nets and many, many others. When you’ve built your model with one of these packages, using modelplotr is super simple.

mlr, caret, h2o and keras provide numerous different algorithms for binary classification. It should be noted, that to use our modelplotr package, you don’t have to use one of these packages to build your models. It’s just a tiny bit more work to use modelplotr for models created differently, using R or even outside or R! More on this in the vignette (see: vignette(modelplotr) or go here. For now, to have a few models to evaluate with our plots, we do take advantage of mlr’s/caret’s/h2o’s/keras’ greatness.

Ok, we’ve generated some predictive models. Now, let’s prepare the data for plotting! We will focus on explaining to our marketing colleagues how good our predictive model works and how it can help them select customers for their term deposit campaign.

library(modelplotr) # transform datasets and model objects into scored data and calculate ntiles # preparation steps scores_and_ntiles <- prepare_scores_and_ntiles(datasets=list("train","test"), dataset_labels = list("train data","test data"), models = list("rf","mnl", "gbm","nn"), model_labels = list("random forest","multinomial logit", "gradient boosted trees","artificial neural network"), target_column="y", ntiles=100) ## ... scoring mlr model "rf" on dataset "train". ## ... scoring caret model "mnl" on dataset "train". ## ... scoring h2o model "gbm" on dataset "train". ## ... scoring keras model "nn" on dataset "train". ## ... scoring mlr model "rf" on dataset "test". ## ... scoring caret model "mnl" on dataset "test". ## ... scoring h2o model "gbm" on dataset "test". ## ... scoring keras model "nn" on dataset "test". ## Data preparation step 1 succeeded! Dataframe created.

What just happened? In the prepare_scores_and_ntiles function, we’ve scored the customers in the train dataset and test dataset with their probability to acquire a term deposit, according to the predictive models we’ve just built with caret, mlr, h2o and keras. Aside from the datasets and model objects, we had to specify the name of the target variable and for our convenience, we gave our datasets and models some useful labels. These labels are used in the plots. And we could specify how many ntiles we want – bear with us for a few seconds, we’ll get into the details of ntiles in a while. Let’s have a look at some lines from the dataframe scores_and_ntiles we just created:

model_label dataset_label y_true prob_no prob_term.deposit ntl_no ntl_term.deposit random forest train data no 1.0000000 0.0000000 44 63 multinomial logit test data no 0.9818192 0.0181808 33 68 random forest test data no 1.0000000 0.0000000 6 85 gradient boosted trees test data no 0.9874957 0.0125043 52 49 artificial neural network train data no 0.9848772 0.0151228 52 49 artificial neural network train data no 0.8355713 0.1644287 85 16 random forest train data term.deposit 0.2040000 0.7960000 97 4 gradient boosted trees test data no 0.9891243 0.0108757 49 52 multinomial logit train data no 0.9089722 0.0910278 68 33 gradient boosted trees train data no 0.9943736 0.0056264 36 65

The function created a dataframe, with per record in each dataset, per model the actual target value (y), the predicted probability per target class (‘no’ and ‘term.deposit’) and the bucketing in ntiles (1 to 100, we’ll explain what they mean shortly).

Now that we have scored all models we want on all datasets we want for the target we want, we can take the last step before plotting. In this second step, we specify the scope of the analysis. For this, we use the plotting_scope function and use the dataframe as our prepared_input. There are some other parameters you might want to tweak. An important parameter is scope, enabling you to set the analysis perspective you want to see in the plots. You can use modelplotr to evaluate your model(s) from several perspectives:

• Interpret just one model (the default)
• Compare the model performance across different datasets
• Compare the performance across different models
• Compare the performance across different target classes

We don’t set the “scope” parameter now, therefore the default – no comparison – is chosen.
We will keep it simple and evaluate – from a business perspective – how well a selected model will perform in a selected dataset for one target class. We can use other parameters to focus on a specific model (let’s say the gradient boosted trees model) and on a specific dataset (let’s pick the test data). When not specified, modelplotr will set a value for you. The default value for the target class is the smallest target category, in our case term.deposit ; since we want to focus on customers that do take term deposits, this default is perfect so we don’t specify it explicitly here:

# transform data generated with prepare_scores_and_ntiles into aggregated data for chosen plotting scope plot_input <- plotting_scope(prepared_input = scores_and_ntiles, select_model_label = "gradient boosted trees", select_dataset_label = "test data") ## Data preparation step 2 succeeded! Dataframe created. ## "prepared_input" aggregated... ## Data preparation step 3 succeeded! Dataframe created. ## ## No comparison specified, default values are used. ## ## Single evaluation line will be plotted: Target value "term.deposit" plotted for dataset "test data" and model "gradient boosted trees. ## " ## -> To compare models, specify: scope = "compare_models" ## -> To compare datasets, specify: scope = "compare_datasets" ## -> To compare target classes, specify: scope = "compare_targetclasses" ## -> To plot one line, do not specify scope or specify scope = "no_comparison".

As you can see in the output above, modelplotr prints some guiding information about the options chosen and unused options to the console. We have generated an aggregated dataframe, with the scope and selections we want and many metrics used for our plots:

scope model_label dataset_label target_class ntile neg pos tot pct negtot postot tottot pcttot cumneg cumpos cumtot cumpct gain cumgain gain_ref gain_opt lift cumlift cumlift_ref legend no_comparison gradient boosted trees test data term.deposit 0 0 0 0 NA NA NA NA NA 0 0 0 NA 0.0000000 0.0000000 0.00 0.0000000 NA NA 1 term.deposit no_comparison gradient boosted trees test data term.deposit 1 27 97 124 0.7822581 11014 1343 12357 0.1086833 27 97 124 0.7822581 0.0722264 0.0722264 0.01 0.0923306 7.197590 7.197590 1 term.deposit no_comparison gradient boosted trees test data term.deposit 2 22 102 124 0.8225806 11014 1343 12357 0.1086833 49 199 248 0.8024194 0.0759494 0.1481757 0.02 0.1846612 7.568599 7.383095 1 term.deposit no_comparison gradient boosted trees test data term.deposit 3 45 78 123 0.6341463 11014 1343 12357 0.1086833 94 277 371 0.7466307 0.0580789 0.2062547 0.03 0.2762472 5.834807 6.869781 1 term.deposit no_comparison gradient boosted trees test data term.deposit 4 48 76 124 0.6129032 11014 1343 12357 0.1086833 142 353 495 0.7131313 0.0565897 0.2628444 0.04 0.3685778 5.639349 6.561552 1 term.deposit no_comparison gradient boosted trees test data term.deposit 5 46 77 123 0.6260163 11014 1343 12357 0.1086833 188 430 618 0.6957929 0.0573343 0.3201787 0.05 0.4601638 5.760002 6.402020 1 term.deposit

(plot_input has 101 rows and 25 columns)

We’re ready to start plotting now!

Let’s introduce the Gains, Lift and (cumulative) Response plots.

Before we throw more code and output at you, let’s get you familiar with the plots we so strongly advocate to use to assess a predictive model’s business value. Although each plot sheds light on the business value of your model from a different angle, they all use the same data:

• Predicted probability for the target class
• Equally sized groups based on this predicted probability, named ntiles
• Actual number of observed target class observations in these groups (ntiles)

Regarding the ntiles: It’s common practice to split the data to score into 10 equally large groups and call these groups deciles. Observations that belong to the top-10% with highest model probability in a set, are in decile 1 of that set; the next group of 10% with high model probability are decile 2 and finally the 10% observations with the lowest model probability on the target class belong to decile 10.

*Notice that modelplotr does support that you specify the number of equally sized groups with the parameter ntiles. Hence, ntiles=100 results in 100 equally sized groups with in the first group the 1% with the highest model probability and in group 100 the 1% with the lowest model probability. These groups are often referred to as percentiles; modelplotr will also label them as such. Any value between 4 and 100 can be specified for ntiles.

Each of the plots in modelplotr places the ntiles on the x axis and another measure on the y axis. The ntiles are plotted from left to right so the observations with the highest model probability are on the left side of the plot. This results in plots like this:

Now that it’s clear what is on the horizontal axis of each of the plots, we can go into more detail on the metrics for each plot on the vertical axis. For each plot, we’ll start with a brief explanation what insight you gain with the plot from a business perspective. After that, we apply it to our banking data and show some neat features of modelplotr to help you explain the value of your wonderful predictive models to others.

1. Cumulative gains plot

The cumulative gains plot – often named ‘gains plot’ – helps you answer the question:

When we apply the model and select the best X ntiles, what % of the actual target class observations can we expect to target?

Hence, the cumulative gains plot visualises the percentage of the target class members you have selected if you would decide to select up until ntile X. This is a very important business question, because in most cases, you want to use a predictive model to target a subset of observations – customers, prospects, cases,… – instead of targeting all cases. And since we won’t build perfect models all the time, we will miss some potential. And that’s perfectly all right, because if we are not willing to accept that, we should not use a model in the first place. Or build only perfect models, that scores all actual target class members with a 100% probability and all the cases that do not belong to the target class with a 0% probability. However, if you’re such a wizard, you don’t need these plots any way or you should have a careful look at your model – maybe you’re cheating?….

So, we’ll have to accept we will lose some. What percentage of the actual target class members you do select with your model at a given ntile, that’s what the cumulative gains plot tells you. The plot comes with two reference lines to tell you how good/bad your model is doing: The random model line and the wizard model line. The random model line tells you what proportion of the actual target class you would expect to select when no model is used at all. This vertical line runs from the origin (with 0% of cases, you can only have 0% of the actual target class members) to the upper right corner (with 100% of the cases, you have 100% of the target class members). It’s the rock bottom of how your model can perform; are you close to this, then your model is not much better than a coin flip. The wizard model is the upper bound of what your model can do. It starts in the origin and rises as steep as possible towards 100%. Say, exactly 10% of all cases belong to the target category. This means that this line goes steep up from the origin to the value of decile 1 (or percentile 10 in case ntiles=100) and cumulative gains of 100% and remains there for all other ntiles as it is a cumulative measure. Your model will always move between these two reference lines – closer to a wizard is always better – and looks like this:

Back to our business example. How many of the term deposit buyers can we select with the top-20% of our predictive models? Let’s find out! To generate the cumulate gains plot, we can simply call the function plot_cumgains():

# plot the cumulative gains plot plot_cumgains(data = plot_input)

We don’t need to specify any other parameters than the data to use, plot_input, which is generated with the plotting_scope() function we ran previously. There are several other parameters available to customize the plot, though. If we want to emphasize the model performance at a specific ntile, we can add both highlighting to the plot and add some explanatory text below the plot. Both are optional, though:

# plot the cumulative gains plot and annotate the plot at percentile = 20 plot_cumgains(data = plot_input,highlight_ntile = 20)

Our highlight_ntile parameter adds some guiding elements to the plot at ntile 20 as well as a text box below the graph with the interpretation of the plot at ntile 20 in words. This interpretation is also printed to the console. Our simple model – only 6 pedictors were used – seems to do a nice job selecting the customers interested in buying term deposites. When we select 20% with the highest probability according to gradient boosted trees, this selection holds 87% of all term deposit cases in test data. With a perfect model, we would have selected 100%, since less than 20% of all customers in the test set buy term deposits. A random pick would only hold 20% of the customers with term deposits. How much better than random do we do? This brings us to plot number two!

2. Cumulative lift plot

The cumulative lift plot, often referred to as lift plot or index plot, helps you answer the question:

When we apply the model and select the best X ntiles, how many times better is that than using no model at all?

The lift plot helps you in explaining how much better selecting based on your model is compared to taking random selections instead. Especially when models are not yet used that often within your organisation or domain, this really helps business understand what selecting based on models can do for them.

The lift plot only has one reference line: the ‘random model’. With a random model we mean that each observation gets a random number and all cases are devided into ntiles based on these random numbers. When we would do that, the % of actual target category observations in each ntile would be equal to the overall % of actual target category observations in the total set. Since the lift is calculated as the ratio of these two numbers, we get a horizontal line at the value of 1. Your model should however be able to do better, resulting in a high ratio for ntile 1. How high the lift can get, depends on the quality of your model, but also on the % of target class observations in the data: If 50% of your data belongs to the target class of interest, a perfect model would ‘only’ do twice as good (lift: 2) as a random selection. With a smaller target class value, say 10%, the model can potentially be 10 times better (lift: 10) than a random selection. Therefore, no general guideline of a ‘good’ lift can be specified. Towards the last ntile, since the plot is cumulative, with 100% of cases, we have the whole set again and therefore the cumulative lift will always end up at a value of 1. It looks like this:

To generate the cumulative lift plot for our gradient boosted trees model predicting term deposit buying, we call the function plot_cumlift(). Let’s add some highlighting to see how much better a selection based on our model containing the best 20 (perce)ntiles would be, compared to a random selection of 20% of all customers:

# plot the cumulative lift plot and annotate the plot at percentile = 20 plot_cumlift(data = plot_input,highlight_ntile = 20) ## ## Plot annotation for plot: Cumulative lift ## - When we select 20% with the highest probability according to model gradient boosted trees in test data, this selection for term.deposit cases is 4.3 times better than selecting without a model. ## ##

A term deposit campaign targeted at a selection of 20% of all customers based on our gradient boosted trees model can be expected to have a 4 times higher response (434%) compared to a random sample of customers. Not bad, right? The cumulative lift really helps in getting a positive return on marketing investments. It should be noted, though, that since the cumulative lift plot is relative, it doesn’t tell us how high the actual reponse will be on our campaign…

3. Response plot

One of the easiest to explain evaluation plots is the response plot. It simply plots the percentage of target class observations per ntile. It can be used to answer the following business question:

When we apply the model and select ntile X, what is the expected % of target class observations in that ntile?

The plot has one reference line: The % of target class cases in the total set. It looks like this:

A good model starts with a high response value in the first ntile(s) and suddenly drops quickly towards 0 for later ntiles. This indicates good differentiation between target class members – getting high model scores – and all other cases. An interesting point in the plot is the location where your model’s line intersects the random model line. From that ntile onwards, the % of target class cases is lower than a random selection of cases would hold.

To generate the response plot for our term deposit model, we can simply call the function plot_response(). Let’s immediately highlight the plot to have the interpretation of the response plot at (perce)ntile 10 added to the plot:

# plot the response plot and annotate the plot at ntile = 10 plot_response(data = plot_input,highlight_ntile = 10) ## ## Plot annotation for plot: Response ## - When we select ntile 10 according to model gradient boosted trees in dataset test data the % of term.deposit cases in the selection is 50.4%. ## ##

As the plot shows and the text below the plot states: When we select decile 1 according to model gradient boosted trees in dataset test data the % of term deposit cases in the selection is 51%.. This is quite good, especially when compared to the overall likelihood of 12%. The response in the 20th ntile is much lower, about 10%. From ntile 22 onwards, the expected response is lower than the overall likelihood of 12%. However, most of the time, our model will be used to select the highest decile up until some decile. That makes it even more relevant to have a look at the cumulative version of the response plot. And guess what, that’s our final plot!

4. Cumulative response plot

Finally, one of the most used plots: The cumulative response plot. It answers the question burning on each business reps lips:

When we apply the model and select up until ntile X, what is the expected % of target class observations in the selection?

The reference line in this plot is the same as in the response plot: the % of target class cases in the total set.

Whereas the response plot crosses the reference line, in the cumulative response plot it never crosses it but ends up at the same point for the last ntile: Selecting all cases up until ntile 100 is the same as selecting all cases, hence the % of target class cases will be exactly the same. This plot is most often used to decide – together with business colleagues – up until what decile to select for a campaign.

Back to our banking business case. To generate the cumulative response plot, we call the function plot_cumresponse(). Let’s highlight it at percentile 30 to see what the overall expected response will be if we select prospects for a term deposit offer based on our gradient boosted trees model:

# plot the cumulative response plot and annotate the plot at decile = 3 plot_cumresponse(data = plot_input,highlight_ntile = 30) ## ## Plot annotation for plot: Cumulative response ## - When we select ntiles 1 until 30 according to model gradient boosted trees in dataset test data the % of term.deposit cases in the selection is 34.9%. ## ##

When we select ntiles 1 until 30 according to model gradient boosted trees in dataset test data the % of term deposit cases in the selection is 36%. Since the test data is an independent set, not used to train the model, we can expect the response on the term deposit campaign to be 36%.

The cumulative response percentage at a given decile is a number your business colleagues can really work with: Is that response big enough to have a successfull campaign, given costs and other expectations? Will the absolute number of sold term deposits meet the targets? Or do we lose too much of all potential term deposit buyers by only selecting the top 30%? To answer that question, we can go back to the cumulative gains plot. And that’s why there’s no absolute winner among these plots and we advice to use them all. To make that happen, there’s also a function to easily combine all four plots.

All four plots together

With the function call plot_multiplot we get the previous four plots on one grid. We can easily save it to a file to include it in a presentation or share it with colleagues.

# plot all four evaluation plots and save to file, highlight decile 2 plot_multiplot(data = plot_input,highlight_ntile=2, save_fig = TRUE,save_fig_filename = 'Selection model Term Deposits') ## ## Plot annotation for plot: Cumulative gains ## - When we select 20% with the highest probability according to model gradient boosted trees, this selection holds 87% of all term.deposit cases in test data. ## ## ## ## Plot annotation for plot: Cumulative lift ## - When we select 20% with the highest probability according to model gradient boosted trees in test data, this selection for term.deposit cases is 4.3 times better than selecting without a model. ## ## ## ## Plot annotation for plot: Response ## - When we select ntile 2 according to model gradient boosted trees in dataset test data the % of term.deposit cases in the selection is 32.2%. ## ## ## ## Plot annotation for plot: Cumulative response ## - When we select ntiles 1 until 2 according to model gradient boosted trees in dataset test data the % of term.deposit cases in the selection is 47.2%. ## ## ## Warning: No location for saved plot specified! Plot is saved to tempdir(). Specify 'save_fig_filename' to customize location and name. ## Plot is saved as: C:\Users\J9AF3~1.NAG\AppData\Local\Temp\Rtmpc5waFv/Selection model Term Deposits.png

And there’s more! To plot the financial implications of implementing a predictive model, modelplotr provides three additional plots: the Costs & revenues plot, the Profit plot and the ROI plot. So, when you know what the fixed costs, variable costs and revenues per sale associated with a campaign based on your model are, you can use these to visualize financial consequences of using your model. Here, we’ll just show the Profit plot. See the package vignette for more details on the Costs & Revenues plot and the Return on Investment plot.

The profit plot visualized the cumulative profit up until that decile when the model is used for campaign selection. It can be used to answer the following business question:

When we apply the model and select up until ntile X, what is the expected profit of the campaign?

From this plot, it can be quickly spotted with what selection size the campaign profit is maximized. However, this does not mean that this is the best option from an investment point of view.

Business colleagues should be able to tell you the expected costs and revenues regarding the campaign. Let’s assume they told us fixed costs for the campaign (a tv commercial and some glossy print material) are in total € 75,000 and each targeted customer costs another € 50 (prospects are called and receive an incentive) and the expected revenue per new term deposit customer is € 250 according to the business case.

#Profit plot - automatically highlighting the ntile where profit is maximized! plot_profit(data=plot_input,fixed_costs=75000,variable_costs_per_unit=50,profit_per_unit=250) ## ## Plot annotation for plot: Profit ## - When we select ntiles 1 until 19 in dataset test data using model gradient boosted trees to target term.deposit cases the expected profit is 93,850 ## ##

Using this plot, we can decide to select the top 19 ntiles according to our model to maximize profit and earn about € 94,000 with this campaign. Both decreasing and increasing the selection based on the model would harm profits, as the plot clearly shows.

Neat! With these plots, we are able to talk with business colleagues about the actual value of our predictive model, without having to bore them with technicalities any nitty gritty details. We’ve translated our model in business terms and visualised it to simplify interpretation and communication. Hopefully, this helps many of you in discussing how to optimally take advantage of your predictive model building efforts.

Get more out of modelplotr: using different scopes, highlight ntiles, customize text & colors.

As we mentioned earlier, the modelplotr package also enables to make interesting comparisons, using the scope parameter. Comparisons between different models, between different datasets and (in case of a multiclass target) between different target classes. Also, modelplotr provides a lot of customization options: You can hightlight plots and add annotation texts to explain the plots, change all textual elements in the plots and customize line colors. Curious? Please have a look at the package documentation or read our other posts on modelplotr.

However, to give one example, we could compare whether gradient boosted trees was indeed the best choice to select the top-30% customers for a term deposit offer:

# set plotting scope to model comparison plot_input <- plotting_scope(prepared_input = scores_and_ntiles,scope = "compare_models") ## Data preparation step 2 succeeded! Dataframe created. ## "prepared_input" aggregated... ## Data preparation step 3 succeeded! Dataframe created. ## ## Models "random forest", "multinomial logit", "gradient boosted trees", "artificial neural network" compared for dataset "test data" and target value "term.deposit". # plot the cumulative response plot and annotate the plot at ntile 20 plot_cumresponse(data = plot_input,highlight_ntile = 20) ## ## Plot annotation for plot: Cumulative response ## - When we select ntiles 1 until 20 according to model random forest in dataset test data the % of term.deposit cases in the selection is 46.5%. ## - When we select ntiles 1 until 20 according to model multinomial logit in dataset test data the % of term.deposit cases in the selection is 42.1%. ## - When we select ntiles 1 until 20 according to model gradient boosted trees in dataset test data the % of term.deposit cases in the selection is 47.2%. ## - When we select ntiles 1 until 20 according to model artificial neural network in dataset test data the % of term.deposit cases in the selection is 40.7%. ## ##

Seems like the algorithm used will not make a big difference in this case. Hopefully you agree by now that using these plots really can make a difference in explaining the business value of your predictive models!

You can read more on how to use modelplotr here. In case you experience issues when using modelplotr, please let us know via the issues section on Github. Any other feedback or suggestions, please let us know via jurriaan.nagelkerke@gmail.com or pb.marcus@hotmail.com. Happy modelplotting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Modelplot[R/py] introduction. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Encryptr now makes it easy to encrypt and decrypt files

Thu, 04/25/2019 - 10:03

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

Data security is paramount and encryptr was written to make this easier for non-experts. Columns of data can be encrypted with a couple of lines of R code, and single cells decrypted as required.

But what was missing was an easy way to encrypt the file source of that data.

Now files can be encrypted with a couple of lines of R code.

Encryption and decryption with asymmetric keys is computationally expensive. This is how encrypt for data columns works. This makes it easy for each piece of data in a data frame to be decrypted without compromise of the whole data frame. This works on the presumption that each cell contains less than 245 bytes of data.

File encryption requires a different approach as files are larger in size. encrypt_file encrypts a file using a symmetric “session” key and the AES-256 cipher. This key is itself then encrypted using a public key generated using genkeys. In OpenSSL this combination is referred to as an envelope.

It should work with any type of single file but not folders.

Documentation is maintained at encrypt-r.org

Generate keys genkeys() #> Private key written with name 'id_rsa' #> Public key written with name 'id_rsa.pub' Encrypt file

To demonstrate, the included dataset is written as a .csv file.

write.csv(gp, "gp.csv") encrypt_file("gp.csv") #> Encrypted file written with name 'gp.csv.encryptr.bin'

Important: check that the file can be decrypted prior to removing the original file from your system.

Warning: it is strongly suggested that the original unencrypted data file is securely stored else where as a back-up in case unencryption is not possible, e.g., the private key file or password is lost

Decrypt file

The decrypt_file function will not allow the original file to be overwritten, therefore if it is still present, use the option to specify a new name for the unencrypted file.

decrypt_file("gp.csv.encryptr.bin", file_name = "gp2.csv") #> Decrypted file written with name 'gp2.csv' Support / bugs

The new version 0.1.3 is on its way to CRAN today or you can install from github:

github.com/SurgicalInformatics/encryptr

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Probability of winning a best-of-7-series (part 2)

Thu, 04/25/2019 - 09:46

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

In this previous post, I explored the probability that a team wins a best-of-n series, given that its win probability for any one game is some constant . As one commenter pointed out, most sports models consider the home team to have an advantage, and this home advantage should affect the probability of winning a series. In this post, I will explore this question, limiting myself (most of the time) to the case of .

To be as general as possible, instead of giving a team just one parameter , I will give a team two parameters: for the probability of winning at home, and for winning away. In general we will have p_A" title="p_H > p_A" class="latex" />, although that is not always the case. (For example, in the NBA 2018/19 regular season, all but 2 teams had more home wins than road wins, the 2 teams being the Miami Heat and the Chicago Bulls.) In my simulations, I will allow and to take on any values between 0 and 1.

The random variable denoting the number of wins in a best-of-7 series will depend on whether the team starts at home or not. (We will assume our team will alternate being playing at home and playing away.) If starting at home, the number of wins has distribution ; if starting away, the distribution would be .

We can modify the sim_fn function from the previous post easily so that it now takes in more parameters (home_p, away_p, and start indicating whether the first game is at “home” or “away”):

sim_fn <- function(simN, n, home_p, away_p, start) { if (n %% 2 == 0) { stop("n should be an odd number") } if (start == "home") { more_p <- home_p less_p <- away_p } else { more_p <- away_p less_p <- home_p } mean(rbinom(simN, (n+1)/2, more_p) + rbinom(simN, (n-1)/2, less_p) > n / 2) }

The following code sets up a grid of home_p,away_p and start values, then runs 50,000 simulation runs for each combination of those values:

home_p <- seq(0, 1, length.out = 51) away_p <- seq(0, 1, length.out = 51) df <- expand.grid(home_p = home_p, away_p = away_p, start = c("home", "away"), stringsAsFactors = FALSE) set.seed(111) simN <- 50000 n <- 7 df\$win_prob <- apply(df, 1, function(x) sim_fn(simN, n, as.numeric(x[1]), as.numeric(x[2]), x[3]))

How can we plot our results? For each combination of (home_p, away_p, start) values, we want to plot the series win probability. A decent 2D representation would be a heatmap:

library(tidyverse) ggplot(df, aes(x = home_p, y = away_p)) + geom_raster(aes(fill = win_prob)) + labs(title = paste("Win probability for best of", n, "series"), x = "Home win prob", y = "Away win prob") + scale_fill_distiller(palette = "Spectral", direction = 1) + facet_wrap(~ start, labeller = label_both) + coord_fixed() + theme(legend.position = "bottom")

While the colors give us a general sense of the trends, we could overlay the heatmap with contours to make comparisons between different points easier. Here is the code, the geom_text_contour function from the metR package is for labeling the contours.

library(metR) ggplot(df, aes(x = home_p, y = away_p, z = win_prob)) + geom_raster(aes(fill = win_prob)) + geom_contour(col = "white", breaks = 1:9 / 10) + geom_text_contour(col = "#595959") + labs(title = paste("Win probability for best of", n, "series"), x = "Home win prob", y = "Away win prob") + scale_fill_distiller(palette = "Spectral", direction = 1) + facet_wrap(~ start, labeller = label_both) + coord_fixed() + theme(legend.position = "bottom")

For a full 3D surface plot, I like to use the plotly package. This is because it is hard to get information from a static 3D plot: I want to be able to turn the plot around the axes to see it from different angles. plotly allows you to do that, and it also gives you information about the data point that you are hovering over. The code below gives us a 3D surface plot for the start = "home" data. (Unfortunately WordPress.com does not allow me to embed the chart directly in the post, so I am just showing a screenshot to give you a sense of what is possible. Run the code in your own R environment and play with the graph!)

library(plotly) df_home <- df %>% filter(start == "home") win_prob_matrix <- matrix(df_home\$win_prob, nrow = length(away_p), byrow = TRUE) plot_ly(x = home_p, y = away_p, z = win_prob_matrix) %>% add_surface(contours = list( z = list( show = TRUE, usecolormap = TRUE, highlightcolor = "#ff0000", project = list(z = TRUE) ) ) ) %>% layout( title = paste("Win probability for best of", n, "series (start at home)"), scene = list( xaxis = list(title = "Home win prob"), yaxis = list(title = "Away win prob"), zaxis = list(title = "Series win prob") ) )

How much better is it to start at home than away?

To answer this question, we will subtract the probability of winning a series when starting away from that when starting at home. The more positive the result, the more “important” starting at home is.

# difference in starting at home vs. away df2 <- df %>% spread(start, value = win_prob) %>% mutate(home_adv = home - away) # raster plot ggplot(df2, aes(x = home_p, y = away_p)) + geom_raster(aes(fill = home_adv)) + labs(title = "Difference in series win prob (starting at home vs. away)", x = "Home win prob", y = "Away win prob") + scale_fill_distiller(palette = "Spectral", direction = 1) + coord_fixed() + theme(legend.position = "bottom")

It looks like most of the variation is in the corners. Drawing contours at manually-defined levels helps to give the center portion of the plot more definition (note the “wigglyness” of the contours: this is an artifact of Monte Carlo simulation):

breaks <- c(-0.5, -0.2, -0.1, -0.05, -0.02, 0, 0.02, 0.05, 0.1, 0.2, 0.5) ggplot(df2, aes(x = home_p, y = away_p, z = home_adv)) + geom_raster(aes(fill = home_adv)) + geom_contour(col = "white", breaks = breaks) + geom_text_contour(breaks = breaks, col = "#595959", check_overlap = TRUE) + labs(title = "Difference in series win prob", subtitle = "Starting at home vs. away", x = "Home win prob", y = "Away win prob") + scale_fill_distiller(palette = "Spectral", direction = 1) + coord_fixed() + theme(legend.position = "bottom")

When does being at home tip you over (or back under) the 50% threshold?

That is, when does starting at home make you more likely to win the series than your opponent, but starting away makes your opponent the favorite? This can be answered easily by performing some data manipulation. In the plot, “Better” means that you are the favorite at home but the underdog away, while “Worse” means the opposite.

threshold <- 0.5 df2\$home_impt <- ifelse(df2\$home > threshold & df2\$away < threshold, 1, ifelse(df2\$home < threshold & df2\$away > threshold, -1, NA)) df2\$home_impt <- factor(df2\$home_impt) df2\$home_impt <- fct_recode(df2\$home_impt, "Better" = "1", "Worse" = "-1") df2 %>% filter(!is.na(home_impt)) %>% ggplot(aes(x = home_p, y = away_p)) + geom_raster(aes(fill = home_impt)) + labs(title = paste("Does starting at home push you \nover the series win threshold", threshold, "?"), x = "Home win prob", y = "Away win prob") + scale_fill_manual(values = c("#ff0000", "#0000ff")) + coord_fixed() + theme(legend.position = "bottom")

The blue area might not look very big, but sometimes teams do fall in it. The plot below is the same as the above, but overlaid with the 30 NBA teams (home/away win probabilities based on the regular season win percentages). The team squarely in the blue area? The Detroit Pistons. The two teams on the edge? The Orlando Magic and the Charlotte Hornets.

How does the length of the series affect the series win probability?

In the analysis above, we focused on best-of-seven series. What happens to the series win probabilities if we change the length of the series. Modifying the code for the first plot above slightly (and doing just 10,000 simulations for each parameter setting), we can get the following plot:

There doesn’t seem to be much difference between best-of-7 and best-of-9 series. What happens as grows large? The plot for best-of-1001 series is below, try to figure out why this is the case!

Code for this post (in one file) can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Meta analysis of multiple multi-omics data… Oy Vey

Thu, 04/25/2019 - 09:00

(This article was first published on Dror Berel's R Blog, and kindly contributed to R-bloggers)

TL;DR

tidy tibbles can contain non-atomic classes. This is a proof of concept demonstration for such implementation with S4 object-oriented classes, for meta-analysis of complex genomic data.

Motivation

In my previous post I reviewed the evolution of Bioconductor S4 classes for omics data. The most recent extended class is the MultiAssayExperiment for multi-assay, where each (single) assay store data for thousand of analytes/gens for many subjects.

What if… I want to run a meta-analysis of the multiple multi-omic data?

For example, suppose I want to find a genetic marker/signature (a set of genes) that is unique to a subset of subject with a specific phenotype (e.g. disease), and then compare it to another cohort of subjects, with different phenotype. Is that genetic marker unique only to a specific cohort of subjects, or also shared across different sub-types of the disease?

Public repositories of multiple multi-omic datasets

Below are some of common big public repositories for multi-omics experiments.

ImmuneSpace: a web-portal for integrative modeling of human immunological data. Currently, it contain data from 72 human immunology studies, covering a total of 5,355 participants. Disclaimer: I am a proud member of one of the groups that develop and maintain this repository.

The Cancer Genome Atlas (TCGA)

Cancer Cell Line Encyclopedia (CCLE)

These repositories collect and annotate an enormous amount of genomics data in a unified way that allow proper downstream analysis of the data, with minimum additional handling. for simplicity, I will refer to different collections of subjects of the same phenotype (disease type) as a ‘study’.

Each ‘study’ is stored in a single S4 object that contain an entire “universe” of multi-assays data sets, each of multiple genes, for many subjects. This is a collection of very rich data sets, well condensed in a systematic way into a single object in R. And,… each of these repositories have many of these ‘Universes’.

Image credit: the movie ‘Man in Black’, Columbia pictures. Now I understand why Purrr has a cat on its hexagon logo.

Where should I store my multiple S4 classes?

In a meta-analysis pipeline, I need the flexibility of accessing each ‘Study’ and play with it, that is, get/extract specific data of interest, e.g. an entire group of analytes within a specific assay, a gene, or a phenotype (clinical feature), and run some analytical procedures (normalization, scaling etc) on it. I also need to do it systematically across all of the studies, within each study across multiple assays, within each assay across all subjects and all genes… Oy Vey.

In my early training as a Statistician, I was taught that in most data analysis, you have a data set with rows for the sampling unit, and columns for the variables/features. Therefore, it is not surprise that the tidyverse paradigm is totally make sense to me. In a meta-analysis the sampling unit is each of the studies. But my data at each cell (unit in my tidy dataset) is not so simple… it is NOT an atomic vector! (i.e. not a number or a string). It is a S4 class. What the #\$%& should I do?

Where should I store my multiple S4 objects? I wish I had some sort of a data frame that could store non-atomic vectors like my S4 objects.

OK, prepared yourself for a scary thought. Is there such a thing in R which can act like a data frame of (nested) data frames? Something like a list of a lists, but for data frames?

Luckily, yes! there is! and it is called a tibble. It looks like a data frame, you could access the rows and columns just like any data frame, but in fact each column is a list, and the rows are the elements in that list. And because it is a list, it can also contain non-atomic objects.

For the meta-analysis, all I had to do is creating a tibble with a single column, where each rows has a single study. From that column I extracted embedded/nested ‘layer’ within the S4, into new columns in my tibble (more details below). I then calculated an association statistic (odds ratios from a Logisitic regression). And findally, the last step was plotting forest plot that summarize association at each study.

How to extract the embedded layers from each of the tibble columns?

Arranging the S4 objects in a tibble column, for the purpose of structure simplicity, does not change the fact that it is still a list. And handling a list is sometimes (ok, many times), a lot of pain!

I used to handle such nested lists with lapply, sapply, tapply and its similar functions, but I it wasn’t quite stable. I never felt confidence enough that the input to these functions will result the expected output, and what would be the format of the returned output? Is it going to be a list, a number, a character, a table? What if I have multiple lists that I wish to go though simultaneously? It was a mess!

Luckily the purrr package makes some order in that chaos. The map() function unified the syntax for the returned object, and it goes along nicely with the tidyverse and pipe operator. It also has an elegant implementation for going though each element simultaneously across multiple lists. (map2, pmap). It is kind of like writing as many as lapply as needed, on the fly, which allowed me to send a (very) long hand into the Kishkas of the S4 class, get/extract what i need from there, and run some analysis on it. It was definitely wasn’t trivial at the beginning, but after some time I could finally see the benefit of using it. It ease my coding style so much that it felt like typing code with an extra finger.

credit: Disney XD

So finally it all made sense! I had a rich well annotated omic data with tremendous clinical interpretation potential, and a good hypothesis. I stored the data in a designated S4 container that support my routine analysis tasks. For the meta-analysis, where my row unit is a single study, I stored my S4 objects in a tibble, and extracted the layers of interest using purrr.

I even had a poster to present at the annual Bioconductor conference.

To sum it up, no need to reinvent the wheel. The secret recipe is very simple: MultiAssayExperiment+tibble+purrr

Let the designated packages do the heavy lifting. This is what they are good at, well, because this is what they were designed to do.
What’s next?

Should there be a bioconductor package for meta-analysis of multi-omic data? Maybe. Hopefully someone is already working on that right now. It could have designated methods for forest plots and other mate-analysis techniques.

What I find really cool about this tidyverse approach is that it allows to extend the idea of nested objects, and go beyond the current resolution, for example, what about a meta-meta-analysis (this is not a typo). OK, maybe I got carried away. This can be a really scary idea.

How would you do it with other tools? What could be an interesting next analytical step to follow and extend this idea?

Check more related topics here: https://drorberel.github.io/

first published at my Medium blog

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dror Berel's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...