Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 14 hours 40 min ago

An introduction to Monte Carlo Tree Search

Wed, 11/29/2017 - 01:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)


We recently witnessed one of the biggest game AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found here. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights.

Perfect Information Games

Monte Carlo Tree Search is an algorithm used when playing a so-called perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or Tic-Tac-Toe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over 10^{170}.
sticky noteSource

Every perfect information game can be represented in the form of a tree data structure in the following way. At first, you have a root which encapsulates the beginning state of the game. For Chess that would be 16 white figures and 16 black figures placed in the proper places on the chessboard. For Tic-Tac-Toe it would be simply 3×3 empty matrix. The first player has some number n_1 of possible choices to make. In the case of Tic-Tac-Toe this would be 9 possible places to draw a circle. Each such move changes the state of the game. These outcome states are the children of the root node. Then, for each of n_1 children, the next player has n_2 possible moves to consider, each of them generating another state of the game – generating a child node. Note that n_2 might differ for each of n_1 nodes. For instance in chess you might make a move which forces your enemy to make a move with their king or consider another move which leaves your opponent with many other options.

An outcome of a play is a path from the root node to one of the leaves. Each leaf consist a definite information which player (or players) have won and which have lost the game.

Making a decision based on a tree

There are two main problems we face when making a decision in perfect information game. The first, and main one is the size of the tree.

This doesn’t bother us with very limited games such as Tic-Tac-Toe. We have at most 9 children nodes (at the beginning) and this number gets smaller and smaller as we continue playing. It’s a completely different story with Chess or Go. Here the corresponding tree is so huge that you cannot hope to search the entire tree. The way to approach this is to do a random walk on the tree for some time and get a subtree of the original decision tree.

This, however, creates a second problem. If every time we play we just walk randomly down the tree, we don’t care at all about the efficiency of our move and do not learn from our previous games. Whoever played Chess during his or her life knows that making random moves on a chessboard won’t get him too far. It might be good for a beginner to get an understanding of how the pieces move. But game after game it’s better to learn how to distinguish good moves from bad ones.

So, is there a way to somehow use the facts contained in the previously built decision trees to reason about our next move? As it turns out, there is.

Multi-Armed Bandit Problem

Imagine that you are at a casino and would like to play a slot machine. You can choose one randomly and enjoy your game. Later that night, another gambler sits next to you and wins more in 10 minutes than you have during the last few hours. You shouldn’t compare yourself to the other guy, it’s just luck. But still, it’s normal to ask whether the next time you can do better. Which slot machine should you choose to win the most? Maybe you should play more than one machine at a time?

The problem you are facing is the Multi-Armed Bandit Problem. It was already known during II World War, but the most commonly known version today was formulated by Herbert Robbins in 1952. There are N slot machines, each one with a different expected return value (what you expect to net from a given machine). You don’t know the expected return values for any machine. You are allowed to change machines at any time and play on each machine as many times as you’d like. What is the optimal strategy for playing?

What does “optimal” mean in this scenario? Clearly your best option would be to play only on the machine with highest return value. An optimal strategy is a strategy for which you do as well as possible compared to the best machine.

It was actually proven that you cannot do better than O( \ln n ) on average. So that’s the best you can hope for. Luckily, it was also proven that you can achieve this bound (again – on average). One way to do this is to do the following.

sticky noteRead this paper if you are interested in the proof.

For each machine i we keep track of two things: how many times we have tried this machine (n_i) and what the mean return value (x_i) was. We also keep track of how many times (n) we have played in general. Then for each i we compute the confidence interval around x_i:

x_i \pm \sqrt{ 2 \cdot \frac{\ln n}{n_i} }

All the time we choose to play on the machine with the highest upper bound for x_i (so “+” in the formula above).

This is a solution to Multi-Armed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our pay-out. For simplicity, I will assume that we are only interested in winning, so pay-out is 1 if we have won and 0 otherwise.

Real world application example

MAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: 99$, 100$, 115$ and 120$. Now you want to test how those prices will perform and which to choose eventually.
During first day of your experiment 4000 people visited your shop when the first price (99$) was tested and 368 bought the product,
for the rest of the prices we have the following outcome:

  • 100$ 4060 visits and 355 purchases,
  • 115$ 4011 visits and 373 purchases,
  • 120$ 4007 visits and 230 purchases.

Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment.

library(bandit) set.seed(123) visits_day1 <- c(4000, 4060, 4011, 4007) purchase_day1 <- c(368, 355, 373, 230) prices <- c(99, 100, 115, 120) post_distribution = sim_post(purchase_day1, visits_day1, ndraws = 10000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.3960 0.0936 0.5104 0.0000

We calculated the Bayesian probability that the price performed the best and can see that the price 115$ has the highest probability (0.5). On the other hand 120$ seems bit too much for the customers.

The experiment continues for a few more days.

Day 2 results:

visits_day2 <- c(8030, 8060, 8027, 8037) purchase_day2 <- c(769, 735, 786, 420) post_distribution = sim_post(purchase_day2, visits_day2, ndraws = 1000000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.308623 0.034632 0.656745 0.000000

After the second day price 115$ still shows the best results, with 99$ and 100$ performing very similar.

Using bandit package we can also perform significant analysis, which is handy for overall proportion comparison using prop.test.

significance_analysis(purchase_day2, visits_day2) ## successes totals estimated_proportion lower upper ## 1 769 8030 0.09576588 -0.004545319 0.01369494 ## 2 735 8060 0.09119107 0.030860453 0.04700507 ## 3 786 8027 0.09791952 -0.007119595 0.01142688 ## 4 420 8037 0.05225831 NA NA ## significance rank best p_best ## 1 3.322143e-01 2 1 3.086709e-01 ## 2 1.437347e-21 3 1 2.340515e-06 ## 3 6.637812e-01 1 1 6.564434e-01 ## 4 NA 4 0 1.548068e-39

At this point we can see that 120$ is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the p_best are very small (p_best has negligible value).

Day 3 results:

visits_day3 <- c(15684, 15690, 15672, 8037) purchase_day3 <- c(1433, 1440, 1495, 420) post_distribution = sim_post(purchase_day3, visits_day3, ndraws = 1000000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.087200 0.115522 0.797278 0.000000 value_remaining = value_remaining(purchase_day3, visits_day3) potential_value = quantile(value_remaining, 0.95) potential_value ## 95% ## 0.02670002

Day 3 results led us to conclude that 115$ will generate the highest conversion rate and revenue.
We are still unsure about the conversion probability for the best price 115$, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining).

The histograms below show what happens to the value-remaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price $115.

If this example was interesting to you, checkout our another post about dynamic pricing.

Monte Carlo Tree Search

We are ready to learn how the Monte Carlo Tree Search algorithm works.

As long as we have enough information to treat child nodes as slot machines, we choose the next node (move) as we would have when solving Multi-Armed Bandit Problem. This can be done when we have some information about the pay-outs for each child node.

At some point we reach the node where we can no longer proceed in this manner because there is at least one node with no statistic present. It’s time to explore the tree to get new information. This can be done either completely randomly or by applying some heuristic methods when choosing child nodes (in practice this might be necessary for games with high branching factor – like chess or Go – if we want to achieve good results).

Finally, we arrive at a leaf. Here we can check whether we have won or lost.

It’s time to update the nodes we have visited on our way to the leaf. If the player making a move at the corresponding node turned out to be the winner we increase the number of wins by one. Otherwise we keep it the same. Whether we have won or not, we always increase the number of times the node was played (in the corresponding picture we can automatically deduce it from the number of loses and wins).

That’s it! We repeat this process until some condition is met: timeout is reached or the confidence intervals we mentioned earlier stabilize (convergence). Then we make our final decision based on the information we have gathered during the search. We can choose a node with the highest upper bound for pay-out (as we would have in each iteration of the Multi-Armed Bandit). Or, if you prefer, choose the one with the highest mean pay-out.

The decision is made, a move was chosen. Now it’s time for our opponent (or opponents). When they’ve finished we arrive at a new node, somewhere deeper in the tree, and the story repeats.

Not just for games

As you might have noticed, Monte Carlo Tree Search can be considered as a general technique for making decisions in perfect information scenarios. Therefore it’s use does not have to be restrained to games only. The most amazing use case I have heard of is to use it for planning interplanetary flights. You can read about it at this website but I will summarize it briefly.

Think of an interplanetary flight as of a trip during which you would like to visit more than one planet. For instance, you are flying from Earth to Jupiter via Mars.

An efficient way to do this is to make use of these planets gravitational field (like they did in ‘The Martian’ movie) so you can take less fuel with you. The question is when the best time to arrive and leave from each planets surface or orbit is (for the first and last planet it’s only leave and arrive, respectively).

You can treat this problem as a decision tree. If you divide time into intervals, at each planet you make a decision: in which time slot I should arrive and in which I should leave. Each such choice determines the next. First of all, you cannot leave before you arrive. Second – your previous choices determine how much fuel you have left and consequently – what places in the universe you can reach.

Each such set of consecutive choices determines where you arrive at the end. If you visited all required checkpoints – you’ve won. Otherwise, you’ve lost. It’s like a perfect information game. There is no opponent and you make a move by determining the timeslot for leave/arrival. This can be treated using the above Monte Carlo Tree Search. As you can read here, it fares quite well in comparison with other known approaches to this problem. And at the same time – it does not require any domain-specific knowledge to implement it.

Write your question and comments below. We’d love to hear what you think.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcing a New rOpenSci Software Review Collaboration

Wed, 11/29/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

rOpenSci is pleased to announce a new collaboration with the Methods and Ecology and Evolution (MEE), a journal of the British Ecological Society, published by Wiley press 1. Publications destined for MEE that include the development of a scientific R package will now have the option of a joint review process whereby the R package is reviewed by rOpenSci, followed by fast-tracked review of the manuscript by MEE. Authors opting for this process will be recognized via a mark on both web and print versions of their paper.

We are very excited for this partnership to improve the rigor of both scientific software and software publications and to provide greater recognition to developers in the fields of ecology and evolution. It is a natural outgrowth of our interest in supporting scientists in developing and maintaining software, and of MEE’s mission of vetting and disseminating tools and methods for the research community. The collaboration formalizes and eases a path already pursued by researchers: The rotl, RNexML, and treebase packages were all developed or reviewed by rOpenSci and subsequently had associated manuscripts published in MEE.

About rOpenSci software review

rOpenSci is a diverse community of researchers from academia, non-profit, government, and industry who collaborate to develop and maintain tools and practices around open data and reproducible research. The rOpenSci suite of tools is made of core infrastructure software developed and maintained by the project staff. The suite also contains numerous packages that are contributed by members of the broader R community. The volume of community submissions has grown considerably over the years necessitating a formal system of review quite analogous to that of a peer reviewed academic journal.

rOpenSci welcomes full software submissions that fit within our aims and scope, with the option of a pre-submission inquiry in cases when the scope of a submission is not immediately obvious. This software peer review framework, known as the rOpenSci Onboarding process, operates with three editors and one editor in chief who carefully vet all incoming submissions. After an editorial review, editors solicit detailed, public and signed reviews from two reviewers, and the path to acceptance from then on is similar to a standard journal review process. Details about the system are described in various blog posts by the editorial team.

Collaboration with journals

This is our second collaboration with a journal. Since late 2015, rOpenSci has partnered with the Journal of Open Source software (JOSS), an open access journal that publishes brief articles on research software. Packages accepted to rOpenSci can be submitted for fast-track publication at JOSS, in which JOSS editors may evaluate based on rOpenSci’s reviews alone. As rOpenSci’s review criteria is significantly more stringent and designed to be compatible with JOSS, these packages are generally accepted without additional review. We have had great success with this partnership providing rOpenSci authors with an additional venue to publicize and archive their work. Given this success, we are keen on expanding to other journals and fields where there is potential for software reviewed and created by rOpenSci to play a significant role in supporting scientific findings.

The details

Our new partnership with MEE broadly resembles that with JOSS, with the major difference that MEE, rather than rOpenSci, leads review of the manuscript component. Authors with R packages and associated manuscripts that fit the Aims and Scope for both rOpenSci and MEE are encouraged to first submit to rOpenSci. The rotl, RNexML, and treebase packages are all great examples of such packages. MEE editors may also refer authors to this option if authors submit an appropriate manuscript to MEE first.

On submission to rOpenSci, authors can use our updated submission template to choose MEE as a publication venue. Following acceptance by rOpenSci, the associated manuscript will be reviewed by an expedited process at MEE, with reviewers and editors having the knowledge that the software has already been reviewed and the public reviews available to them.

Should the manuscript be accepted, a footnote will appear in the web version and the first page of the print version of the MEE article indicating that the software as well as the manuscript has been peer-reviewed, with a link to the rOpenSci open reviews.

As with any collaboration, there may be a few hiccups early on and we welcome ideas to make the process more streamlined and efficient. We look forward to the community’s submissions and to your participation in this process.

Many thanks to MEE’s Assistant Editor Chris Grieves and Senior Editor Bob O’Hara for working with us on this collaboration.

  1. See also MEE’s post from today at
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Some new time series packages

Wed, 11/29/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

This week I have finished preliminary versions of two new R packages for time series analysis. The first (tscompdata) contains several large collections of time series that have been used in forecasting competitions; the second (tsfeatures) is designed to compute features from univariate time series data. For now, both are only on github. I will probably submit them to CRAN after they’ve been tested by a few more people.
tscompdata There are already two packages containing forecasting competition data: Mcomp (containing the M and M3 competition data) and Tcomp (containing the tourism competition data).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Competition: StanCon 2018 ticket

Wed, 11/29/2017 - 01:00

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

Today we are happy to announce our Stan contest. Something we feel very strongly at Jumping Rivers is giving back to the community. We have benefited immensely from hard work by numerous people, so when possible, we try to give something back.

This year we’re sponsoring StanCon 2018. If you don’t know,

Stan is freedom-respecting, open-source software for facilitating statistical inference at the frontiers of applied statistics.

Or to put it another way, it makes Bayesian inference fast and (a bit) easier. StanCon is the premier conference for all things Stan related and this year it will take place at the Asilomar Conference Grounds, a National Historic Landmark on the Monterey Peninsula right on the beach.

Our interaction with Stan is via the excellent R package, rstan and the introduction to Rstan course we run.

The prize

One of the benefits of sponsoring a conference, is free tickets. Unfortunately, we can’t make StanCon this year, so we’ve decided to give away the ticket via a competition.

How do I enter?

Entering the contest is straightforward. We want a nice image to use in our introduction to Rstan course. This can be a cartoon, some neat looking densities, whatever you think!

For example, my first attempt at a logo resulted in

Pretty enough, but lacking something.

Once you’ve come up with your awesome image, simply email with a link to the image. If the image was generated in R (or other language), than the associated code would be nice. The deadline is the 6th December.

  • Can I submit more than one image? Feel free.
  • Who owns the copyright? Our intention is to make the image CC0. However we’re happy to change the copyright to suit. The only proviso is that we can use the image for our rstan course.
  • Another question. Contact us via the contact page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cyber Week Only: Save 50% on DataCamp!

Tue, 11/28/2017 - 23:59

(This article was first published on, and kindly contributed to R-bloggers)

For Cyber Week only, DataCamp offers the readers of R-bloggers over $150 off for unlimited access to its data science library. That’s over 90 courses and 5,200 exercises of which a large chunk are R-focused, plus access to the mobile app, Practice Challenges, and hands-on Projects. All by expert instructors such as Hadley Wickham (RStudio), Matt Dowle (data.table), Garrett Grolemund (RStudio), and Max Kuhn (caret)! Clair your offer now! Offer ends 12/5!

Skills track:

  • Data Analyst with R

    • Learn how to translate numbers into plain English for businesses, whether it’s sales figures, market research, logistics, or transportation costs get the most of your data.

  • Data Scientist with R

    • Learn how to combine statistical and machine learning techniques with R programming to analyze and interpret complex data.

  • Quantitative Analyst with R

    • Learn how to ensure portfolios are risk balanced, help find new trading opportunities, and evaluate asset prices using mathematical models.

Skills track:

  • Data Visualization in R

    • Communicate the most important features of your data by creating beautiful visualizations using ggplot2 and base R graphics.

  • Importing and Cleaning Data

    • Learn how to parse data in any format. Whether it’s flat files, statistical software, databases, or web data, you’ll learn to handle it all.

  • Statistics with R

    • Learn key statistical concepts and techniques like exploratory data analysis, correlation, regression, and inference.

  • Applied Finance

    • Apply your R skills to financial data, including bond valuation, financial trading, and portfolio analysis.

Individual courses:

A word about DataCamp For those of you who don’t know DataCamp: it’s the most intuitive way out there to learn data science thanks to its combination of short expert videos and immediate hands-on-the-keyboard exercises as well as its instant personalized feedback on every exercise. They focus only on data science to offer the best learning experience possible and rely on expert instructors to teach.  var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to make Python easier for the R user: revoscalepy

Tue, 11/28/2017 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Siddarth Ramesh, Data Scientist, Microsoft

I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.

Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python. 

Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy

Why revoscalepy?

Revoscalepy is the Python implementation of the R package RevoScaleR included with Microsoft Machine Learning Server.

The methods in ‘revoscalepy’ are the same, and more importantly, the way the R user can view data is the same. The reason this is so important is that for an R programmer, being able to understand the data shape and structure is one of the challenges with getting used to Python. In Python, data types are different, preprocessing the data is different, and the criteria to feed the processed dataset into a model is different.

To understand how revoscalepy eases the transition from R to Python, the following section will compare building a decision tree using revoscalepy with building a decision tree using sklearn. The Titanic dataset from Kaggle will be used for this example. To be clear, this post is written from an R user’s perspective, as many of the challenges this post will outline are standard practices for native Python users.

revoscalepy versus sklearn

revoscalepy works on Python 3.5, and can be downloaded as a part of Microsoft Machine Learning Server. Once downloaded, set the Python environment path to the python executable in the MML directory, and then import the packages.

The first chunk of code imports the revoscalepy, numpy, pandas, and sklearn packages, and imports the Titatic data. Pandas has some R roots in that it has its own implementation of DataFrames as well as methods that resemble R’s exploratory methods. 

import revoscalepy as rp; import numpy as np; import pandas as pd; import sklearn as sk; titanic_data = pd.read_csv('titanic.csv') titanic_data.head() Preprocessing with sklearn

One of the challenges as an R user with using sklearn is that the decision tree model for sklearn can only handle the numeric datatype. Pandas has a categorical type that looks like factors in R, but sklearn’s Decision Tree does not integrate with this. As a result, numerically encoding the categorical data becomes a mandatory step. This example will use a one-hot encoder to shape the categories in a way that sklearn’s decision tree understands.

The side effect of having to one-hot encode the variable is that if the dataset contains high cardinality features, it can be memory intensive and computationally expensive because each category becomes its own binary column. While implementing one-hot encoding itself is not a difficult transformation in Python and provides good results, it is still an extra step for an R programmer to have to manually implement. The following chunk of code detaches the categorical columns, label and one-hot encodes them, and then reattaches the encoded columns to the rest of the dataset.

from sklearn import tree le = sk.preprocessing.LabelEncoder() x = titanic_data.select_dtypes(include=[object]) x = x.drop(['Name', 'Ticket', 'Cabin'], 1) x = pd.concat([x, titanic_data['Pclass']], axis = 1) x['Pclass'] = x['Pclass'].astype('object') x = pd.DataFrame(x) x = x.fillna('Missing') x_cats = x.apply(le.fit_transform) enc = sk.preprocessing.OneHotEncoder() onehotlabels = enc.transform(x_cats).toarray() encoded_titanic_data = pd.concat([pd.DataFrame(titanic_data.select_dtypes(include=[np.number])), pd.DataFrame(onehotlabels)], axis = 1)

At this point, there are more columns than before, and the columns no longer have semantic names (they have been enumerated). This means that if a decision tree is visualized, it will be difficult to understand without going through the extra step of renaming these columns. There are techniques in Python to help with this, but it is still an extra step that must be considered.

Preprocessing with revoscalepy

Unlike sklearn, revoscalepy reads pandas’ ‘category’ type like factors in R. This section of code iterates through the DataFrame, finds the string types, and converts those types to ‘category’. In pandas, there is an argument to set the order to False, to prevent ordered factors.

titanic_data_object_types = titanic_data.select_dtypes(include = ['object']) titanic_data_object_types_columns = np.array(titanic_data_object_types.columns) for column in titanic_data_object_types_columns: titanic_data[column] = titanic_data[column].astype('category', ordered = False) titanic_data['Pclass'] = titanic_data['Pclass'].astype('category', ordered = False)

This dataset is already ready to be fed into the revoscalepy model.

Training models with sklearn

One difference between implementing a model in R and in sklearn in Python is that sklearn does not use formulas.

Formulas are important and useful for modeling because they provide a consistent framework to develop models with varying degrees of complexity. With formulas, users can easily apply different types of variable cases, such as ‘+’ for separate independent variables, ‘:’ for interaction terms, and ‘*’ to include both the variable and its interaction terms, along with many other convenient calculations. Within a formula, users can do mathematical calculations, create factors, and include more complex entities third order interactions. Furthermore, formulas allow for building highly complex models such as mixed effect models, which are next to impossible build without them. In Python, there are packages such as ‘statsmodels’ which have more intuitive ways to build certain statistical models. However, statsmodels has a limited selection of models, and does not include tree based models.

With sklearn, expects the independent and dependent terms to be columns from the DataFrame. Interactions must be created manually as a preprocessing step for more complex examples. The code below trains the decision tree:

model = tree.DecisionTreeClassifier(max_depth = 50) x = encoded_titanic_data.drop(['Survived'], 1) x = x.fillna(-1) y = encoded_titanic_data['Survived'] model =,y) Training models with revoscalepy

revoscalepy brings back formulas. Granted, users cannot view the formula the same way as they can in R, because formulas are strings in Python. However, importing code from R to Python is an easy transition because formulas are read the same way in the revoscalepy functions as the model fit functions in R. The below code fits the Decision Tree in revoscalepy.

#rx_dtree works with formulas, just like models in R form = 'Survived ~ Pclass + Sex + Age + Parch + Fare + Embarked' titanic_data_tree = rp.rx_dtree(form, titanic_data, max_depth = 50)

The resulting object, titanic_data_tree, is the same structural object that RxDTree() would create in R. Because the individual elements that make up the rx_dtree() object are the same as RxDTree(), it allows R users to easily understand the decision tree without having to translate between two object structures.


From the workflow, it should be clear how revoscalepy can help with transliteration between R and Python. Sklearn has different preprocessing considerations because the data must be fed into the model differently. The advantage to revoscalepy is that R programmers can easily convert their R code to Python without thinking too much about the ‘Pythonic way’ of implementing their R code. Categories replace factors, rx_dtree() reads the R-like formula, and the arguments are similar to the R equivalent. Looking at the big picture, revoscalepy is one way to ease Python for the R user and future posts will cover other ways to transition between R and Python.

Microsoft Docs: Introducing revoscalepy

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

Tue, 11/28/2017 - 12:33

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).

My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:

The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.

The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:

Or something that looks a bit more like a finite state machine diagram:

(In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)

Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.

For example, the following could describe total flow from one module to another over a given period of time:

If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.

The sankeyview package is actually part of a family of packages that includes the d3-sankey-diagram and  the ipysankeywidget.

We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.

One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?

If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial  diagram from the R alluvial package.

With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).

So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget  packages.

First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:

If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).

The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.

The R sankey package generates a simple Sankey diagram from a simple data table:

Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.

matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):

What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…

If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Secret Santa is a graph traversal problem

Tue, 11/28/2017 - 07:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

Last week at Thanksgiving, my family drew names from a hat for our annual game
of Secret Santa. Actually, it wasn’t a hat but you know what I mean. (Now that I
think about it, I don’t think I’ve ever seen names drawn from a literal hat
before!) In our family, the rules of Secret Santa are pretty simple:

  • The players’ names are put in “a hat”.
  • Players randomly draw a name from a hat, become that person’s Secret Santa,
    and get them a gift.
  • If a player draws their own name, they draw again.

Once again this year, somebody asked if we could just use an app or a website to
handle the drawing for Secret Santa. Or I could write a script to do it I
thought to myself. The problem nagged at the back of my mind for the past few
days. You could just shuffle the names… no, no, no. It’s trickier than that.

In this post, I describe a couple of algorithms for Secret Santa sampling using
R and directed graphs. I use the
DiagrammeR package which creates
graphs from dataframes of nodes and edges, and I liberally use
dplyr verbs to manipulate tables of

If you would like a more practical way to use R for Secret Santa, including
automating the process of drawing names and emailing players, see
this blog post.

Making a graph, connecting nodes twice

Let’s start with a subset of my family’s game of just five players. I assign
each name a unique ID number.

library(DiagrammeR) library(magrittr) library(dplyr, warn.conflicts = FALSE) players <- tibble::tribble( ~ Name, ~ Number, "Jeremy", 1, "TJ", 2, "Jonathan", 3, "Alex", 4, "Marissa", 5)

We can think of the players as nodes in a directed graph. An edge connecting two
players indicates a “gives-to” (Secret Santa) relationship. Suppose I drew
Marissa’s name. Then the graph will have an edge connecting me to her. In the
code below, I use DiagrammeR to create a graph by combining a node dataframe
(create_node_df()) and an edge dataframe (create_edge_df()).

nodes <- create_node_df( n = nrow(players), type = players$Name, label = players$Name) tj_drew_marissa <- create_edge_df( from = 2, to = 5, rel = "gives-to", color = "#FF4136", penwidth = 1) create_graph(nodes, tj_drew_marissa) %>% render_graph()













Before the game starts, anyone could draw anyone else’s name, so let’s visualize
all possible gives-to relations. We can do this by using combn(n, 2) to
generate all n-choose-2 pairs and creating two edges for each pair.

combn(players$Name, 2) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] #> [1,] "Jeremy" "Jeremy" "Jeremy" "Jeremy" "TJ" "TJ" "TJ" #> [2,] "TJ" "Jonathan" "Alex" "Marissa" "Jonathan" "Alex" "Marissa" #> [,8] [,9] [,10] #> [1,] "Jonathan" "Jonathan" "Alex" #> [2,] "Alex" "Marissa" "Marissa" # All the edge-manipulating functions in this post take an optional take `...` # argument for setting the style of edges. create_all_giving_edges <- function(xs, ...) { aes_options <- quos(...) pairs <- combn(seq_along(xs), 2) # Each column from `combn()` is a pair. We make an edge moving down the column # and another edge up the column by having each row as a `from` index and a # `to` index. from <- c(pairs[1, ], pairs[2, ]) to <- c(pairs[2, ], pairs[1, ]) create_edge_df(from = from, to = to) %>% mutate(!!! aes_options) %>% as_tibble() } all_possible_edges <- create_all_giving_edges( players$Name, rel = "potential-gift", penwidth = .5, color = "#CCCCCC90") create_graph(nodes, all_possible_edges) %>% render_graph()
































A fast, simple solution is a Hamiltonian path

Do you need to organize a gift-giving drawing for a group of people? The easiest
solution is to shuffle the names and have the first name give to the second name,
the second to the third, and so on with last name giving looping back around to
the first name. This solution is equivalent to walking through the graph and
visiting every node just once. Such a path is called a
Hamiltonian path.

Here we find a Hamiltonian path and create a helper function overwrite_edges()
to update the edges that fall on the path.

overwrite_edges <- function(old_df, new_df) { old_df %>% anti_join(new_df, by = c("from", "to")) %>% bind_rows(new_df) } create_hamiltonian_gift_edges <- function(xs, ...) { loop_from <- sample(seq_along(xs)) # last name gives to first loop_to <- c(loop_from[-1], loop_from[1]) create_edge_df(from = loop_from, to = loop_to, ...) } # For reproducible blogging set.seed(11282017) hamiltonian_edges <- create_hamiltonian_gift_edges( players$Name, rel = "gives-to", color = "#FF4136", penwidth = 1 ) all_possible_edges %>% overwrite_edges(hamiltonian_edges) %>% create_graph(nodes, .) %>% render_graph()
































As promised, the red paths loop through all nodes exactly once. No one is their
own gift giver :white_check_mark:, and everyone has an incoming red path
:white_check_mark: and an outgoing red path :white_check_mark:. Very nice.
Actually… let’s put that checklist into a validation function.

# Check for valid gift-giving edges has_valid_gift_edges <- function(edge_df, indices) { indices <- sort(unique(indices)) pairs <- edge_df %>% filter(rel == "gives-to") no_self_loop <- !any(pairs$from == pairs$to) exhaustive_from <- isTRUE(all.equal(sort(pairs$from), indices)) exhaustive_to <- isTRUE(all.equal(sort(pairs$to), indices)) all(no_self_loop, exhaustive_from, exhaustive_to) } has_valid_gift_edges(hamiltonian_edges, all_possible_edges$from) #> [1] TRUE

Despite its elegance, this solution does not simulate drawing names from a
hat! Because each node is visited only once, there is no backtracking, so there
there is no reciprocal gift-giving or other sub-circuits in the graph.

Whether you think this is a bad thing is a matter of preference. Personally, I
would like all remaining pairs to be equally probable at each step of the
drawing. This is not the case when backtracking is not allowed. (For example, if
I draw Marissa, then all of the remaining edges are not equally likely because
P(Marissa draws TJ | TJ draws Marissa) = 0.)

Okay, do the hat-drawing thing already

Let’s think about what happens when I draw Marissa’s name from a nice big red
Santa hat.

  • The edge from TJ to Marissa is fixed. (I drew her name.)
  • All other edges from TJ become illegal. (I can’t draw any more names.)
  • All other edges onto Marissa become illegal. (No one else can draw her name

To simulate a single hat-drawing, we randomly select a legal edge, fix it, and
delete all illegal edges. Let’s work through a couple of examples.

First, we need some helper functions.

draw_secret_santa_edge <- function(edge_df, ...) { aes_options <- quos(...) edge_df %>% filter(rel != "gives-to") %>% sample_n(1) %>% mutate(!!! aes_options) } find_illegal_edges <- function(edge_df, edge, ...) { aes_options <- quos(...) outgoing <- edge_df %>% filter(from %in% edge$from) incoming <- edge_df %>% filter(to %in% edge$to) # The one edge that is not illegal is in both # the incoming and outgoing sets to_keep <- dplyr::intersect(outgoing, incoming) outgoing %>% bind_rows(incoming) %>% anti_join(to_keep, by = c("from", "to")) %>% mutate(!!! aes_options) }

Here we draw a single edge (red with fat arrow). All of the other edges that
point to the same node are illegal (navy) as are all of the edges that have the
same origin as the drawn edge.

current_pick <- draw_secret_santa_edge( all_possible_edges, rel = "gives-to", color = "#FF4136", penwidth = 1, arrowsize = 1) current_illegal_edges <- all_possible_edges %>% find_illegal_edges(current_pick, color = "#001f3f", penwidth = .5) all_possible_edges %>% overwrite_edges(current_pick) %>% overwrite_edges(current_illegal_edges) %>% create_graph(nodes, .) %>% render_graph(title = "Selected vs. illegal")

%0 Selected vs. illegal































We delete those illegal edges and leaving us with the following graph.

edges_after_pick1 <- all_possible_edges %>% overwrite_edges(current_pick %>% mutate(arrowsize = NULL)) %>% anti_join(current_illegal_edges, by = "id") create_graph(nodes, edges_after_pick1) %>% render_graph(title = "After one draw")

%0 After one draw

























The name has been removed from the hat, and the graph is simpler now!

Let’s do it again. Draw a random legal edge (fat arrow) and identify all the
illegal paths (navy).

current_pick <- edges_after_pick1 %>% draw_secret_santa_edge( rel = "gives-to", color = "#FF4136", penwidth = 1, arrowsize = 1) current_illegal_edges <- edges_after_pick1 %>% find_illegal_edges(edge = current_pick, color = "#001f3f", penwidth = .5) edges_after_pick1 %>% overwrite_edges(current_pick) %>% overwrite_edges(current_illegal_edges) %>% create_graph(nodes, .) %>% render_graph(title = "Selected vs. illegal")

%0 Selected vs. illegal

























After deleting illegal edges, the problem simplifies further.

edges_after_pick2 <- edges_after_pick1 %>% overwrite_edges(current_pick %>% mutate(arrowsize = NULL)) %>% anti_join(current_illegal_edges, by = "id") create_graph(nodes, edges_after_pick2) %>% render_graph(title = "After two draws")

%0 After two draws





















You can tell where this is going… Loop Town!

To finish up, we are going to repeat this process until there are only
gift-giving edges left. We will control the loop with this helper function
which tells us if there are any free edges remaining.

has_free_edge <- function(edge_df) { edges_left <- edge_df %>% filter(rel != "gives-to") %>% nrow() edges_left != 0 }

In the function below, the while-loop does the same steps as above: Randomly
selecting a free edge and removing illegal edges.

draw_edges_from_hat <- function(edge_df, ...) { aes_options <- quos(...) raw_edge_df <- edge_df indices <- unique(c(raw_edge_df$from, raw_edge_df$to)) while (has_free_edge(edge_df)) { pick <- edge_df %>% draw_secret_santa_edge(!!! aes_options) %>% mutate(rel = "gives-to") illegal_edges <- edge_df %>% find_illegal_edges(pick) edge_df <- edge_df %>% overwrite_edges(pick) %>% anti_join(illegal_edges, by = "id") } if (!has_valid_gift_edges(edge_df, indices)) { warning(call. = FALSE, "Invalid drawing. Trying again.") edge_df <- Recall(raw_edge_df, !!! aes_options) } edge_df }

After the while-loop, the function checks if it has a valid set of gift edges,
and if it doesn’t, the function calls itself again. This bit is intended to
handle more constrained situations. Such as…

The nibling gift exchange

I lied above. Secret Santa is not so simple in family. For my generation (with
me, my siblings and our partners), there’s a rule that a player can’t draw their
partner’s name. Similarly, my nieces and nephews (and now also my child
:sparkling_heart:) have
their own gift exchange with an added constraint: A player can’t give their
sibling a gift. The elegant and simple Hamiltonian solution fails under these
constraints unless you write a special shuffling algorithm. Our hat-drawing
approach, however, handles this situation with minimal effort. Let’s work
through an example with my nieces and nephews (and :baby:!). To protect the very
young, I have replaced their names with Pokemon names.

Below we define the children and their families and do some data-wrangling so
that we have columns with the family at the start and end of each node.

niblings <- tibble::tribble( ~ Family, ~ Name, ~ Number, "Water", "Squirtle", 1, "Water", "Wartortle", 2, "Electric", "Pikachu", 3, "Plant", "Bulbasaur", 4, "Plant", "Ivysaur", 5, "Plant", "Venusaur", 6, "Fighting", "Machamp", 7, "Fighting", "Machoke", 8, "Normal", "Ratata", 9, "Normal", "Raticate", 10, "Psychic", "Mew", 11, "Psychic", "Mewtwo", 12) nibling_edges <- create_all_giving_edges( niblings$Name, rel = "potential-gift", penwidth = .5, color = "#CCCCCC90") %>% left_join(niblings, by = c("from" = "Number")) %>% rename(from_fam = Family) %>% select(-Name) %>% left_join(niblings, by = c("to" = "Number")) %>% rename(to_fam = Family) %>% select(-Name) %>% select(id, from, to, rel, from_fam, to_fam, everything()) nibling_edges #> # A tibble: 132 x 8 #> id from to rel from_fam to_fam penwidth color #> #> 1 1 1 2 potential-gift Water Water 0.5 #CCCCCC90 #> 2 2 1 3 potential-gift Water Electric 0.5 #CCCCCC90 #> 3 3 1 4 potential-gift Water Plant 0.5 #CCCCCC90 #> 4 4 1 5 potential-gift Water Plant 0.5 #CCCCCC90 #> 5 5 1 6 potential-gift Water Plant 0.5 #CCCCCC90 #> 6 6 1 7 potential-gift Water Fighting 0.5 #CCCCCC90 #> 7 7 1 8 potential-gift Water Fighting 0.5 #CCCCCC90 #> 8 8 1 9 potential-gift Water Normal 0.5 #CCCCCC90 #> 9 9 1 10 potential-gift Water Normal 0.5 #CCCCCC90 #> 10 10 1 11 potential-gift Water Psychic 0.5 #CCCCCC90 #> # ... with 122 more rows

In the graph below, we draw an olive edge to connect edge pair of siblings.
These edges are illegal before we even start drawing names.

sibling_edges <- nibling_edges %>% filter(from_fam == to_fam) %>% mutate( rel = "sibling", color = "#3D9970", penwidth = 1) # Update edges that represent siblings nibling_edges <- nibling_edges %>% overwrite_edges(sibling_edges) nibling_nodes <- create_node_df( n = nrow(niblings), type = niblings$Name, label = niblings$Name) nibling_edges %>% overwrite_edges(sibling_edges) %>% create_graph(nibling_nodes, .) %>% render_graph(height = 400)






























































































































































The solution for this trickier, more constrained version of the problem is to
delete the illegal edges beforehand so that they can never be drawn from the
hat. After that, the same algorithm works as before.

nibling_edges %>% filter(rel != "sibling") %>% draw_edges_from_hat(color = "#FF4136") %>% create_graph(nibling_nodes, .) %>% render_graph(height = 500) #> Warning: Invalid drawing. Trying again. #> Warning: Invalid drawing. Trying again.






































Like usual, I’m not sure how to close one of these blog posts. I guess: When a
problem involves relations among individuals, always consider whether there is a
graph hiding in the background. Even the simple process of drawing names from a
hat for Secret Santa describes a graph: a graph of gift-giving relations among
individuals. This graph wasn’t obvious to me until I thought about Hamilitonian
path solution. Hey, wait a minute, that’s a big gift-giving circle! It’s like
some kind of network… ooooooh.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Vectorized Block ifelse in R

Tue, 11/28/2017 - 06:33

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.

From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure.

When porting code from one language to another you hope the expressive power and style of the languages are similar.

  • If the source language is too weak then the original code will be very long (and essentially over specified), meaning a direct transliteration will be unlikely to be efficient, as you are not using the higher order operators of the target language.
  • If the source language is too strong you will have operators that don’t have direct analogues in the target language.

SAS has some strong and powerful operators. One such is what I am calling “the vectorized block if(){}else{}“. From SAS documentation:

The subsetting IF statement causes the DATA step to continue processing only those raw data records or those observations from a SAS data set that meet the condition of the expression that is specified in the IF statement.

That is a really wonderful operator!

R has some available related operators: base::ifelse(), dplyr::if_else(), and dplyr::mutate_if(). However, none of these has the full expressive power of the SAS operator, which can per data row:

  • Conditionally choose where different assignments are made to (not just choose conditionally which values are taken).
  • Conditionally specify blocks of assignments that happen together.
  • Be efficiently nested and chained with other IF statements.

To help achieve such expressive power in R Win-Vector is introducing seplyr::if_else_device(). When combined with seplyr::partition_mutate_se() you get a good high performance simulation of the SAS power in R. These are now available in the open source R package seplyr.

For more information please reach out to us here at Win-Vector or try help(if_else_device).

Also, we will publicize more documentation and examples shortly (especially showing big data scale use with Apache Spark via Sparklyr).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#11: (Much) Faster Package (Re-)Installation via Caching

Tue, 11/28/2017 - 03:19

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the eleventh post in the rarely rued R rants series, or R4 for short. Time clearly flies as it has been three months since out last post on significantly reducing library size via stripping. I had been meaning to post on today’s topic for quite some time, but somehow something (working on a paper, releasing a package, …) got in the way.

Just a few days ago Colin (of Efficient R Programming fame) posted about speed(ing up) package installation. His recommendation? Remember that we (usually) have multiple cores and using several of them via options(Ncpus = XX). It is an excellent point, and it bears repeating.

But it turns I have not one but two salient recommendations too. Today covers the first, we should hopefully get pretty soon to the second. Both have one thing in common: you will be fastest if you avoid doing the work in the first place.


One truly outstanding tool for this in the context of the installation of compiled packages is ccache. It is actually a pretty old tool that has been out for well over a decade, and it comes from the folks that gave the Samba filesystem.

What does it do? Well, it nutshell, it "hashes" a checksum of a source file once the preprocessor has operated on it and stores the resulting object file. In the case of rebuild with unchanged code you get the object code back pretty much immediately. The idea is very similar to memoisation (as implemented in R for example in the excellent little memoise package by Hadley, Jim, Kirill and Daniel. The idea is the same: if you have to do something even moderately expensive a few times, do it once and then recall it the other times.

This happens (at least to me) more often that not in package development. Maybe you change just one of several source files. Maybe you just change the R code, the Rd documentation or a test file—yet still need a full reinstallation. In all these cases, ccache can help tremdendously as illustrated below.


Because essentially all our access to compilation happens through R, we need to set this in a file read by R. I use ~/.R/Makevars for this and have something like these lines on my machines:

VER= CCACHE=ccache CC=$(CCACHE) gcc$(VER) CXX=$(CCACHE) g++$(VER) CXX11=$(CCACHE) g++$(VER) CXX14=$(CCACHE) g++$(VER) FC=$(CCACHE) gfortran$(VER) F77=$(CCACHE) gfortran$(VER)

That way, when R calls the compiler(s) it will prefix with ccache. And ccache will then speed up.

There is an additional issue due to R use. Often we install from a .tar.gz. These will be freshly unpackaged, and hence have "new" timestamps. This would usually lead ccache to skip to file (fear of "false positives") so we have to override this. Similarly, the tarball is usually unpackage in a temporary directory with an ephemeral name, creating a unique path. That too needs to be overwritten. So in my ~/.ccache/ccache.conf I have this:

max_size = 5.0G # important for R CMD INSTALL *.tar.gz as tarballs are expanded freshly -> fresh ctime sloppiness = include_file_ctime # also important as the (temp.) directory name will differ hash_dir = false Show Me

A quick illustration will round out the post. Some packages are meatier than others. More C++ with more templates usually means longer build times. Below is a quick chart comparing times for a few such packages (ie RQuantLib, dplyr, rstan) as well as igraph ("merely" a large C package) and lme4 as well as Rcpp. The worst among theseis still my own RQuantLib package wrapping (still just parts of) the genormous and Boost-heavy QuantLib library.

Pretty dramatic gains. Best of all, we can of course combine these with other methods such as Colin’s use of multiple CPUs, or even a simple MAKE=make -j4 to have multiple compilation units being considered in parallel. So maybe we all get to spend less time on social media and other timewasters as we spend less time waiting for our builds. Or maybe that is too much to hope for…

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rule Your Data with Tidy Validation Reports. Design

Tue, 11/28/2017 - 01:00

(This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

The story about design of ruler package: dplyr-style exploration and validation of data frame like objects.


Some time ago I had a task to write data validation code. As for most R practitioners, this led to exploration of present solutions. I was looking for a package with the following features:

  • Relatively small amount of time should be spent learning before comfortable usage. Preferably, it should be built with tidyverse in mind.
  • It should be quite flexible in terms of types of validation rules.
  • Package should offer functionality for both validations (with relatively simple output format) and assertions (with relatively flexible behaviour).
  • Pipe-friendliness.
  • Validating only data frames would be enough.

After devoting couple of days to research, I didn’t find any package fully (subjectively) meeting my needs (for a composed list look here). So I decided to write one myself. More precisely, it turned out into not one but two packages: ruler and keyholder, which powers some of ruler’s functionality.

This post is a rather long story about key moments in the journey of ruler’s design process. To learn other aspects see its README (for relatively brief introduction) or vignettes (for more thorough description of package capabilities).


In my mind, the whole process of data validation should be performed in the following steps:

  • Create conditions (rules) for data to meet.
  • Expose data to them and obtain some kind of unified report as a result.
  • Act based on the report.

The design process went through a little different sequence of definition steps:

Of course, there was switching between these items in order to ensure they would work well together, but I feel this order was decisive for the end result.

suppressMessages(library(dplyr)) suppressMessages(library(purrr)) library(ruler) Validation result Dplyr data units

I started with an attempt of simple and clear formulation of validation: it is a process of checking whether something satisfies certain conditions. As it was enough to be only validating data frames, something should be thought of as parts of data frame which I will call data units. Certain conditions might be represented as functions, which I will call rules, associated with some data unit and which return TRUE, if condition is satisfied, and FALSE otherwise.

I decided to make dplyr package a default tool for creating rules. The reason is, basically, because it satisfies most conditions I had in mind. Also I tend to use it for interactive validation of data frames, as, I am sure, many more R users. Its pipe-friendliness creates another important feature: interactive code can be transformed into a function just by replacing the initial data frame variable by a dot .. This will create a functional sequence, “a function which applies the entire chain of right-hand sides in turn to its input.”.

dplyr offers a set of tools for operating with the following data units (see comments):

is_integerish <- function(x) {all(x == as.integer(x))} z_score <- function(x) {abs(x - mean(x)) / sd(x)} mtcars_tbl <- mtcars %>% as_tibble() # Data frame as a whole validate_data <- . %>% summarise(nrow_low = n() >= 15, nrow_up = n() <= 20) mtcars_tbl %>% validate_data() ## # A tibble: 1 x 2 ## nrow_low nrow_up ## ## 1 TRUE FALSE # Group as a whole validate_groups <- . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7) %>% ungroup() mtcars_tbl %>% validate_groups() ## # A tibble: 4 x 3 ## vs am vs_am_low ## ## 1 0 0 TRUE ## 2 0 1 FALSE ## 3 1 0 TRUE ## 4 1 1 TRUE # Column as a whole validate_columns <- . %>% summarise_if(is_integerish, funs(is_enough_sum = sum(.) >= 14)) mtcars_tbl %>% validate_columns() ## # A tibble: 1 x 6 ## cyl_is_enough_sum hp_is_enough_sum vs_is_enough_sum am_is_enough_sum ## ## 1 TRUE TRUE TRUE FALSE ## # ... with 2 more variables: gear_is_enough_sum , ## # carb_is_enough_sum # Row as a whole validate_rows <- . %>% filter(vs == 1) %>% transmute(is_enough_sum = rowSums(.) >= 200) mtcars_tbl %>% validate_rows() ## # A tibble: 14 x 1 ## is_enough_sum ## ## 1 TRUE ## 2 TRUE ## 3 TRUE ## 4 TRUE ## 5 TRUE ## # ... with 9 more rows # Cell validate_cells <- . %>% transmute_if(is.numeric, funs(is_out = z_score(.) > 1)) %>% slice(-(1:5)) mtcars_tbl %>% validate_cells() ## # A tibble: 27 x 11 ## mpg_is_out cyl_is_out disp_is_out hp_is_out drat_is_out wt_is_out ## ## 1 FALSE FALSE FALSE FALSE TRUE FALSE ## 2 FALSE TRUE TRUE TRUE FALSE FALSE ## 3 FALSE TRUE FALSE TRUE FALSE FALSE ## 4 FALSE TRUE FALSE FALSE FALSE FALSE ## 5 FALSE FALSE FALSE FALSE FALSE FALSE ## # ... with 22 more rows, and 5 more variables: qsec_is_out , ## # vs_is_out , am_is_out , gear_is_out , carb_is_out Tidy data validation report

After realizing this type of dplyr structure, I noticed the following points.

In order to use dplyr as tool for creating rules, there should be one extra level of abstraction for the whole functional sequence. It is not a rule but rather a several rules. In other words, it is a function that answers multiple questions about one type of data unit. I decided to call this rule pack or simply pack.

In order to identify, whether some data unit obeys some rule, one needs to describe that data unit, rule and result of validation. Descriptions of last two are simple: for rule it is a combination of pack and rule names (which should always be defined) and for validation result it is value TRUE or FALSE.

Description of data unit is trickier. After some thought, I decided that the most balanced way to do it is with two variables:

  • var (character) which represents the variable name of data unit:
    • Value “.all” is reserved for “all columns as a whole”.
    • Value equal to some column name indicates column of data unit.
    • Value not equal to some column name indicates the name of group: it is created by uniting (with delimiter) group levels of grouping columns.
  • id (integer) which represents the row index of data unit:
    • Value 0 is reserved for “all rows as a whole”.
    • Value not equal to 0 indicates the row index of data unit.

Combinations of these variables describe all mentioned data units:

  • var == '.all' and id == 0: Data as a whole.
  • var != '.all' and id == 0: Group (var shouldn’t be an actual column name) or column (var should be an actual column name) as a whole.
  • var == '.all' and id != 0: Row as a whole.
  • var != '.all' and id != 0: Described cell.

With this knowledge in mind, I decided that the tidy data validation report should be a tibble with the following columns:

  • pack : Pack name.
  • rule : Rule name inside pack.
  • var : Variable name of data unit.
  • id : Row index of data unit.
  • value : Whether the described data unit obeys the rule.

Using only described report as validation output is possible if only information about breakers (data units which do not obey respective rules) is interesting. However, reproducibility is a great deal in R community, and keeping information about call can be helpful for future use.

This idea led to creation of another object in ruler called packs info. It is also a tibble which contains all information about exposure call:

  • name : Name of the rule pack. This column is used to match column pack in tidy report.
  • type : Name of pack type. Indicates which data unit pack checks.
  • fun : List of actually used rule pack functions.
  • remove_obeyers : Value of convenience argument of the future expose function. It indicates whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.

To fully represent validation, described two tibbles should be returned together. So the actual validation result is decided to be exposure which is basically an S3 class list with two tibbles packs_info and report. This data structure is fairly easy to understand and use. For example, exposures can be binded together which is useful for combining several validation results. Also its elements are regular tibbles which can be filtered, summarised, joined, etc.

Rules definition Interpretation of dplyr output

I was willing to use pure dplyr in creating rule packs, i.e. without extra knowledge of data unit to be validated. However, I found it impossible to do without experiencing annoying edge cases. Problem with this approach is that all of dplyr outputs are tibbles with similar structures. The only differentiating features are:

  • summarise without grouping returns tibble with one row and user-defined column names.
  • summarise with grouping returns tibble with number of rows equal to number of summarised groups. Columns consist from grouping and user-defined ones.
  • transmute returns tibble with number of rows as in input data frame and user-defined column names.
  • Scoped variants of summarise and transmute differ from regular ones in another mechanism of creating columns. They apply all supplied functions to all chosen columns. Resulting names are “the shortest … needed to uniquely identify the output”. It means that:
    • In case of one function they are column names.
    • In case of more than one function and one column they are function names.
    • In case of more than one column and function they are combinations of column and function names, pasted with character _ (which, unfortunately, is hardcoded). To force this behaviour in previous cases both columns and functions should be named inside of helper functions vars and funs. To match output columns with combination of validated column and rule, this option is preferred. However, there is a need of different separator between column and function names, as character _ is frequently used in column names.

The first attempt was to use the following algorithm to interpret (identify validated data unit) the output:

  • If there is at least one non-logical column then groups are validated. The reason is that in most cases grouping columns are character or factor ones. This already introduces edge case with logical grouping columns.
  • Combination of whether number of rows equals 1 (n_rows_one) and presence of name separator in all column names (all_contain_sep) is used to make interpretation:
    • If n_rows_one == TRUE and all_contain_sep == FALSE then data is validated.
    • If n_rows_one == TRUE and all_contain_sep == TRUE then columns are validated.
    • If n_rows_one == FALSE and all_contain_sep == FALSE then rows are validated. This introduces an edge case when output has one row which is intended to be validated. It will be interpreted as ‘data as a whole’.
    • If n_rows_one == FALSE and all_contain_sep == TRUE then cells are validated. This also has edge case when output has one row in which cells are intended to be validated. It will be interpreted as ‘columns as a whole’.

Despite of having edge cases, this algorithm is good for guessing the validated data unit, which can be useful for interactive use. Its important prerequisite is to have a simple way of forcing extended naming in scoped dplyr verbs with custom rarely used separator.

Pack creation

Research of pure dplyr-style way of creating rule packs left no choice but to create a mechanism of supplying information about data unit of interest along with pack functions. It consists of following important principles.

Use ruler’s function rules() instead of funs(). Its goals are to force usage of full naming in scoped dplyr verbs as much as possible and impute missing rule names (as every rule should be named for validation report). rules is just a wrapper for funs but with extra functionality of naming its every output element and adding prefix to that names (which will be used as a part of separator between column and rule name). By default prefix is a string ._.. It is chosen for its, hopefully, rare usage inside column names and symbolism (it is the Morse code of letter ‘R’).

funs(mean, sd) ## ## $ mean: mean(.) ## $ sd : sd(.) rules(mean, sd) ## ## $ ._.rule..1: mean(.) ## $ ._.rule..2: sd(.) rules(mean, sd, .prefix = "___") ## ## $ ___rule..1: mean(.) ## $ ___rule..2: sd(.) rules(fn_1 = mean, fn_2 = sd) ## ## $ ._.fn_1: mean(.) ## $ ._.fn_2: sd(.)

Note that in case of using only one column in scoped verb it should be named within dplyr::vars in order to force full naming.

Use functions supported by keyholder to build rule packs. One of the main features I was going to implement is a possibility of validating only a subset of all possible data units. For example, validation of only last two rows (or columns) of data frame. There is no problem with columns: they can be specified with summarise_at. However, the default way of specifying rows is by subsetting data frame, after which all information about original row position is lost. To solve this, I needed a mechanism of tracking rows as invisibly for user as possible. This led to creation of keyholder package (which is also on CRAN now). To learn details about it go to its site or read my previous post.

Use specific rule pack wrappers for certain data units. Their goal is to create S3 classes for rule packs in order to carry information about data unit of interest through exposing process. All of them always return a list with supplied functions but with changed attribute class (with additional group_vars and group_sep for group_packs()). Note that packs might be named inside these functions, which is recommended. If not, names will be imputed during exposing process. Also note that supplied functions are not checked to be correct in terms of validating specified data unit. This is done during exposure (exposing process).

# Data unit. Rule pack is manually named 'my_data' my_data_packs <- data_packs(my_data = validate_data) map(my_data_packs, class) ## $my_data ## [1] "data_pack" "rule_pack" "fseq" "function" # Group unit. Need to supply grouping variables explicitly my_group_packs <- group_packs(validate_groups, .group_vars = c("vs", "am")) map(my_group_packs, class) ## [[1]] ## [1] "group_pack" "rule_pack" "fseq" "function" # Column unit. Need to be rewritten using `rules` my_col_packs <- col_packs( my_col = . %>% summarise_if(is_integerish, rules(is_enough_sum = sum(.) >= 14)) ) map(my_col_packs, class) ## $my_col ## [1] "col_pack" "rule_pack" "fseq" "function" # Row unit. One can supply several rule packs my_row_packs <- row_packs( my_row_1 = validate_rows, my_row_2 = . %>% transmute(is_vs_one = vs == 1) ) map(my_row_packs, class) ## $my_row_1 ## [1] "row_pack" "rule_pack" "fseq" "function" ## ## $my_row_2 ## [1] "row_pack" "rule_pack" "fseq" "function" # Cell unit. Also needs to be rewritten using `rules`. my_cell_packs <- cell_packs( my_cell = . %>% transmute_if(is.numeric, rules(is_out = z_score(.) > 1)) %>% slice(-(1:5)) ) map(my_cell_packs, class) ## $my_cell ## [1] "cell_pack" "rule_pack" "fseq" "function" Exposing process

After sorting things out with formats of validation result and rule packs it was time to combine them in the main ruler’s function: expose(). I had the following requirements:

  • It should be insertable inside common %>% pipelines as smoothly and flexibly as possible. Two main examples are validating data frame before performing some operations with it and actually obtaining results of validation.
  • There should be possibility of sequential apply of expose with different rule packs. In this case exposure (validation report) after first call should be updated with new exposure. In other words, the result should be as if those rule packs were both supplied in expose by one call.

These requirements led to the following main design property of expose: it never modifies content of input data frame but possibly creates or updates attribute exposure with validation report. To access validation data there are wrappers get_exposure(), get_report() and get_packs_info(). The whole exposing process can be described as follows:

  • Apply all supplied rule packs to keyed with keyholder::use_id version of input data frame.
  • Impute names of rule packs based on possible present exposure (from previous use of expose) and validated data units.
  • Bind possible present exposure with new ones and create/update attribute exposure with it.

Also it was decided (for flexibility and convenience) to add following arguments to expose:

  • .rule_sep. It is a regular expression used to delimit column and function names in the output of scoped dplyr verbs. By default it is a string ._. possibly surrounded by punctuation characters. This is done to account of dplyr’s hardcoded use of _ in scoped verbs. Note that .rule_sep should take into account separator used in rules().
  • .remove_obeyers. It is a logical argument indicating whether to automatically remove elements, which obey rules, from tidy validation report. It can be very useful because the usual result of validation is a handful of rule breakers. Without possibility of setting .remove_obeyers to TRUE (which is default) validation report will grow unnecessary big.
  • .guess. By default expose guesses the type of unsupported rule pack type with algorithm described before. In order to write strict and robust code this can be set to FALSE in which case error will be thrown after detecting unfamiliar pack type.

Some examples:

mtcars_tbl %>% expose(my_data_packs, my_col_packs) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 2 x 4 ## name type fun remove_obeyers ## ## 1 my_data data_pack TRUE ## 2 my_col col_pack TRUE ## ## Tidy data validation report: ## # A tibble: 2 x 5 ## pack rule var id value ## ## 1 my_data nrow_up .all 0 FALSE ## 2 my_col is_enough_sum am 0 FALSE # Note that `id` starts from 6 as rows 1:5 were removed from validating mtcars_tbl %>% expose(my_cell_packs, .remove_obeyers = FALSE) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 1 x 4 ## name type fun remove_obeyers ## ## 1 my_cell cell_pack FALSE ## ## Tidy data validation report: ## # A tibble: 297 x 5 ## pack rule var id value ## ## 1 my_cell is_out mpg 6 FALSE ## 2 my_cell is_out mpg 7 FALSE ## 3 my_cell is_out mpg 8 FALSE ## 4 my_cell is_out mpg 9 FALSE ## 5 my_cell is_out mpg 10 FALSE ## # ... with 292 more rows # Note name imputation and guessing mtcars_tbl %>% expose(my_data_packs, .remove_obeyers = FALSE) %>% expose(validate_rows) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 2 x 4 ## name type fun remove_obeyers ## ## 1 my_data data_pack FALSE ## 2 row_pack..1 row_pack TRUE ## ## Tidy data validation report: ## # A tibble: 3 x 5 ## pack rule var id value ## ## 1 my_data nrow_low .all 0 TRUE ## 2 my_data nrow_up .all 0 FALSE ## 3 row_pack..1 is_enough_sum .all 19 FALSE Act after exposure

After creating data frame with attribute exposure, it is pretty straightforward to design how to perform any action. It is implemented in function act_after_exposure with the following arguments:

  • .tbl which should be the result of using expose().
  • .trigger: function which takes .tbl as argument and returns TRUE if some action needs to be performed.
  • actor: function which takes .tbl as argument and performs the action.

Basically act_after_exposure() is doing the following:

  • Check that .tbl has a proper exposure attribute.
  • Compute whether to perform intended action by computing .trigger(.tbl).
  • If trigger results in TRUE then .actor(.tbl) is returned. In other case .tbl is returned.

It is a good idea that .actor should be doing one of two things:

  • Making side effects. For example throwing an error (if condition in .trigger is met), printing some information and so on. In this case it should return .tbl to be used properly inside a pipe.
  • Changing .tbl based on exposure information. In this case it should return the imputed version of .tbl.

As a main use case, ruler has function assert_any_breaker. It is a wrapper for act_after_exposure with .trigger checking presence of any breaker in exposure and .actor being notifier about it.

mtcars_tbl %>% expose(my_data_packs) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 my_data nrow_up .all 0 FALSE ## Error: assert_any_breaker: Some breakers found in exposure. Conclusions
  • Design process of a package deserves its own story.
  • Package ruler offers tools for dplyr-style exploration and validation of data frame like objects. With its help validation is done with 3 commands/steps each designed for specific purpose.


sessionInfo() ## R version 3.4.2 (2017-09-28) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 16.04.3 LTS ## ## Matrix products: default ## BLAS: /usr/lib/openblas-base/ ## LAPACK: /usr/lib/ ## ## locale: ## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8 ## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8 ## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] bindrcpp_0.2 ruler_0.1.0 purrr_0.2.4 dplyr_0.7.4 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.13 knitr_1.17 bindr_0.1 magrittr_1.5 ## [5] tidyselect_0.2.3 R6_2.2.2 rlang_0.1.4 stringr_1.2.0 ## [9] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 rprojroot_1.2 ## [13] digest_0.6.12 assertthat_0.2.0 tibble_1.3.4 bookdown_0.5 ## [17] tidyr_0.7.2 glue_1.2.0 evaluate_0.10.1 rmarkdown_1.7 ## [21] blogdown_0.2 stringi_1.1.5 compiler_3.4.2 keyholder_0.1.1 ## [25] backports_1.1.1 pkgconfig_2.0.1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

MICE (Multiple Imputation by Chained Equations) in R – sketchnotes from MünsteR Meetup

Tue, 11/28/2017 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Last night, the MünsteR R user-group had another great meetup:

Karin Groothuis-Oudshoorn, Assistant Professor at the University of Twente, presented her R package mice about Multivariate Imputation by Chained Equations.

It was a very interesting talk and here are my sketchnotes that I took during it:

MICE talk sketchnotes

Here is the link to the paper referenced in my notes:

“The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.”" (

For more information on the package go to

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building the oomsifyer

Tue, 11/28/2017 - 01:00

(This article was first published on Clean Code, and kindly contributed to R-bloggers)

Today I will show you a quick hack (OK it took me like 4 hours during my travels today yesterday and today),
on how to add a dancing banana to any picture.

Now, you might be thinking… Really, why would you add a dancing banana to
a picture, but I don’t have time for that kind of negativity in my life.

Why oomsifier?

Jeroen Ooms is one of those crazy productive people in the R world. The way he
wraps c++ libraries into packages makes you think his middle name is c-plus-plus.

At the Sat-R-day in budapest in 2016 (?) Jeroen demonstrated the magick library.
You can now control images from inside R using dark magic and the bindings to
the magick library. In honor of this work and because I needed a cool name,
I will demonstrate THE OOMSIFYER.

Solving the life long dream of people all around the world … of adding dancing banana gifs to pictures…

If you are like me, you would have thought this would be really easy, but you would be wrong.

First the function then the explanation and some stuff that took me waaay too long
to find out.

The function

banana <- image_read("images/banana.gif") # this assumes you have a project with the folder /images/ inside. add_banana <- function(image, offset = NULL, debug = FALSE){ image_in <- magick::image_read(image) banana <- image_read("images/banana.gif") # 365w 360 h image_info <- image_info(image_in) if("gif" %in% image_info$format ){stop("gifs are to difficult for me now")} stopifnot(nrow(image_info)==1) # scale banana to correct size: # take smallest dimension. target_height <- min(image_info$width, image_info$height) # scale banana to 1/3 of the size scaling <- (target_height /3) front <- image_scale(banana, scaling) # place in lower right corner # offset is width and hieight minus dimensions picutre? scaled_dims <- image_info(front) x_c <- image_info$width - scaled_dims$width y_c <- image_info$height - scaled_dims$height offset_value <- ifelse(is.null(offset), paste0("+",x_c,"+",y_c), offset) if(debug) print(offset_value) frames <- lapply(as.list(front), function(x) image_composite(image_in, x, offset = offset_value)) result <- image_animate(image_join(frames), fps = 10) result } So what can it do?

This function takes an image, f.i. a ggplot2 image that you saved to disk, and adds the dancing banana gif to the bottom right corner.

I had to combine information from the magick vignette and an earlier blogpost about magick in R.

Things I learned:

  • the magick package returns image information as a data frame
  • a gif is a succesion of images (frames)
  • a normal picture is a single frame
  • to combine a gif and a single frame you have to have exactly as much frames of your original picture as the number of frames in the gif
  • for every frame you have to merge the gif and image with each other into a composite image
  • the offset is the number of pixels from the left of the frame and from the top of the frame. I wanted to put the dancing banana at the bottom right of the picture AND I wanted to scale the banana so that it would take over the entire image so the offset needed to be responsive to both scaling and the input dimensions
  • the practical dev has many silly o-reilly like book covers that I find hilarious

In a the following posts I might turn this function into an API, stay tuned!

Building the oomsifyer was originally published by at Clean Code on November 28, 2017.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Clean Code. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn

Tue, 11/28/2017 - 01:00

(This article was first published on - Articles, and kindly contributed to R-bloggers)

Customer churn is a problem that all companies need to monitor, especially those that depend on subscription-based revenue streams. The simple fact is that most organizations have data that can be used to target these individuals and to understand the key drivers of churn, and we now have Keras for Deep Learning available in R (Yes, in R!!), which predicted customer churn with 82% accuracy. We’re super excited for this article because we are using the new keras package to produce an Artificial Neural Network (ANN) model on the IBM Watson Telco Customer Churn Data Set! As for most business problems, it’s equally important to explain what features drive the model, which is why we’ll use the lime package for explainability. We cross-checked the LIME results with a Correlation Analysis using the corrr package. We’re not done yet. In addition, we use three new packages to assist with Machine Learning (ML): recipes for preprocessing, rsample for sampling data and yardstick for model metrics. These are relatively new additions to CRAN developed by Max Kuhn at RStudio (creator of the caret package). It seems that R is quickly developing ML tools that rival Python. Good news if you’re interested in applying Deep Learning in R! We are so let’s get going!!

Customer Churn: Hurts Sales, Hurts Company

Customer churn refers to the situation when a customer ends their relationship with a company, and it’s a costly problem. Customers are the fuel that powers a business. Loss of customers impacts sales. Further, it’s much more difficult and costly to gain new customers than it is to retain existing customers. As a result, organizations need to focus on reducing customer churn.

The good news is that machine learning can help. For many businesses that offer subscription based services, it’s critical to both predict customer churn and explain what features relate to customer churn. Older techniques such as logistic regression can be less accurate than newer techniques such as deep learning, which is why we are going to show you how to model an ANN in R with the keras package.

Churn Modeling With Artificial Neural Networks (Keras)

Artificial Neural Networks (ANN) are now a staple within the sub-field of Machine Learning called Deep Learning. Deep learning algorithms can be vastly superior to traditional regression and classification methods (e.g. linear and logistic regression) because of the ability to model interactions between features that would otherwise go undetected. The challenge becomes explainability, which is often needed to support the business case. The good news is we get the best of both worlds with keras and lime.

IBM Watson Dataset (Where We Got The Data)

The dataset used for this tutorial is IBM Watson Telco Dataset. According to IBM, the business challenge is…

A telecommunications company [Telco] is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.

The dataset includes information about:

  • Customers who left within the last month: The column is called Churn
  • Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
  • Customer account information: how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
  • Demographic info about customers: gender, age range, and if they have partners and dependents
Deep Learning With Keras (What We Did With The Data)

In this example we show you how to use keras to develop a sophisticated and highly accurate deep learning model in R. We walk you through the preprocessing steps, investing time into how to format the data for Keras. We inspect the various classification metrics, and show that an un-tuned ANN model can easily get 82% accuracy on the unseen data. Here’s the deep learning training history visualization.

We have some fun with preprocessing the data (yes, preprocessing can actually be fun and easy!). We use the new recipes package to simplify the preprocessing workflow.

We end by showing you how to explain the ANN with the lime package. Neural networks used to be frowned upon because of the “black box” nature meaning these sophisticated models (ANNs are highly accurate) are difficult to explain using traditional methods. Not no more with LIME! Here’s the feature importance visualization.

We also cross-checked the LIME results with a Correlation Analysis using the corrr package. Here’s the correlation visualization.

We even built an ML-Powered Interactive PowerBI Web Application with a Customer Scorecard to monitor customer churn risk and to make recommendations on how to improve customer health! Feel free to take it for a spin.

View in Full Screen Mode for best experience


We saw that just last week the same Telco customer churn dataset was used in the article, Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest. We thought the article was excellent.

This article takes a different approach with Keras, LIME, Correlation Analysis, and a few other cutting edge packages. We encourage the readers to check out both articles because, although the problem is the same, both solutions are beneficial to those learning data science and advanced modeling.


We use the following libraries in this tutorial:

Install the following packages with install.packages().

pkgs <- c("keras", "lime", "tidyquant", "rsample", "recipes", "yardstick", "corrr") install.packages(pkgs) Load Libraries

Load the libraries.

# Load libraries library(keras) library(lime) library(tidyquant) library(rsample) library(recipes) library(yardstick) library(corrr)

If you have not previously run Keras in R, you will need to install Keras using the install_keras() function.

# Install Keras if you have not installed before install_keras() Import Data

Download the IBM Watson Telco Data Set here. Next, use read_csv() to import the data into a nice tidy data frame. We use the glimpse() function to quickly inspect the data. We have the target “Churn” and all other variables are potential predictors. The raw data set needs to be cleaned and preprocessed for ML.

# Import data churn_data_raw <- read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv") glimpse(churn_data_raw) ## Observations: 7,043 ## Variables: 21 ## $ customerID "7590-VHVEG", "5575-GNVDE", "3668-QPYBK"... ## $ gender "Female", "Male", "Male", "Male", "Femal... ## $ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ Partner "Yes", "No", "No", "No", "No", "No", "No... ## $ Dependents "No", "No", "No", "No", "No", "No", "Yes... ## $ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, ... ## $ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", ... ## $ MultipleLines "No phone service", "No", "No", "No phon... ## $ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic... ## $ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "... ## $ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Y... ## $ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "... ## $ TechSupport "No", "No", "No", "Yes", "No", "No", "No... ## $ StreamingTV "No", "No", "No", "No", "No", "Yes", "Ye... ## $ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No... ## $ Contract "Month-to-month", "One year", "Month-to-... ## $ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", ... ## $ PaymentMethod "Electronic check", "Mailed check", "Mai... ## $ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65... ## $ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65,... ## $ Churn "No", "No", "Yes", "No", "Yes", "Yes", "... Preprocess Data

We’ll go through a few steps to preprocess the data for ML. First, we “prune” the data, which is nothing more than removing unnecessary columns and rows. Then we split into training and testing sets. After that we explore the training set to uncover transformations that will be needed for deep learning. We save the best for last. We end by preprocessing the data with the new recipes package.

Prune The Data

The data has a few columns and rows we’d like to remove:

  • The “customerID” column is a unique identifier for each observation that isn’t needed for modeling. We can de-select this column.
  • The data has 11 NA values all in the “TotalCharges” column. Because it’s such a small percentage of the total population (99.8% complete cases), we can drop these observations with the drop_na() function from tidyr. Note that these may be customers that have not yet been charged, and therefore an alternative is to replace with zero or -99 to segregate this population from the rest.
  • My preference is to have the target in the first column so we’ll include a final select operation to do so.

We’ll perform the cleaning operation with one tidyverse pipe (%>%) chain.

# Remove unnecessary data churn_data_tbl <- churn_data_raw %>% select(-customerID) %>% drop_na() %>% select(Churn, everything()) glimpse(churn_data_tbl) ## Observations: 7,032 ## Variables: 20 ## $ Churn "No", "No", "Yes", "No", "Yes", "Yes", "... ## $ gender "Female", "Male", "Male", "Male", "Femal... ## $ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ Partner "Yes", "No", "No", "No", "No", "No", "No... ## $ Dependents "No", "No", "No", "No", "No", "No", "Yes... ## $ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, ... ## $ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", ... ## $ MultipleLines "No phone service", "No", "No", "No phon... ## $ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic... ## $ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "... ## $ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Y... ## $ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "... ## $ TechSupport "No", "No", "No", "Yes", "No", "No", "No... ## $ StreamingTV "No", "No", "No", "No", "No", "Yes", "Ye... ## $ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No... ## $ Contract "Month-to-month", "One year", "Month-to-... ## $ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", ... ## $ PaymentMethod "Electronic check", "Mailed check", "Mai... ## $ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65... ## $ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65,... Split Into Train/Test Sets

We have a new package, rsample, which is very useful for sampling methods. It has the initial_split() function for splitting data sets into training and testing sets. The return is a special rsplit object.

# Split test/training sets set.seed(100) train_test_split <- initial_split(churn_data_tbl, prop = 0.8) train_test_split ## <5626/1406/7032>

We can retrieve our training and testing sets using training() and testing() functions.

# Retrieve train and test sets train_tbl <- training(train_test_split) test_tbl <- testing(train_test_split) Exploration: What Transformation Steps Are Needed For ML?

This phase of the analysis is often called exploratory analysis, but basically we are trying to answer the question, “What steps are needed to prepare for ML?” The key concept is knowing what transformations are needed to run the algorithm most effectively. Artificial Neural Networks are best when the data is one-hot encoded, scaled and centered. In addition, other transformations may be beneficial as well to make relationships easier for the algorithm to identify. A full exploratory analysis is not practical in this article. With that said we’ll cover a few tips on transformations that can help as they relate to this dataset. In the next section, we will implement the preprocessing techniques.

Discretize The “tenure” Feature

Numeric features like age, years worked, length of time in a position can generalize a group (or cohort). We see this in marketing a lot (think “millennials”, which identifies a group born in a certain timeframe). The “tenure” feature falls into this category of numeric features that can be discretized into groups.

We can split into six cohorts that divide up the user base by tenure in roughly one year (12 month) increments. This should help the ML algorithm detect if a group is more/less susceptible to customer churn.

Transform The “TotalCharges” Feature

What we don’t like to see is when a lot of observations are bunched within a small part of the range.

We can use a log transformation to even out the data into more of a normal distribution. It’s not perfect, but it’s quick and easy to get our data spread out a bit more.

Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

  • correlate(): Performs tidy correlations on numeric data
  • focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.
  • fashion(): Makes the formatting aesthetically easier to read.
# Determine if log transformation improves correlation # between TotalCharges and Churn train_tbl %>% select(Churn, TotalCharges) %>% mutate( Churn = Churn %>% as.factor() %>% as.numeric(), LogTotalCharges = log(TotalCharges) ) %>% correlate() %>% focus(Churn) %>% fashion() ## rowname Churn ## 1 TotalCharges -.20 ## 2 LogTotalCharges -.25

The correlation between “Churn” and “LogTotalCharges” is greatest in magnitude indicating the log transformation should improve the accuracy of the ANN model we build. Therefore, we should perform the log transformation.

One-Hot Encoding

One-hot encoding is the process of converting categorical data to sparse data, which has columns of only zeros and ones (this is also called creating “dummy variables” or a “design matrix”). All non-numeric data will need to be converted to dummy variables. This is simple for binary Yes/No data because we can simply convert to 1’s and 0’s. It becomes slightly more complicated with multiple categories, which requires creating new columns of 1’s and 0`s for each category (actually one less). We have four features that are multi-category: Contract, Internet Service, Multiple Lines, and Payment Method.

Feature Scaling

ANN’s typically perform faster and often times with higher accuracy when the features are scaled and/or normalized (aka centered and scaled, also known as standardizing). Because ANNs use gradient descent, weights tend to update faster. According to Sebastian Raschka, an expert in the field of Deep Learning, several examples when feature scaling is important are:

  • k-nearest neighbors with an Euclidean distance measure if want all features to contribute equally
  • k-means (see k-nearest neighbors)
  • logistic regression, SVMs, perceptrons, neural networks etc. if you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others
  • linear discriminant analysis, principal component analysis, kernel principal component analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you’d emphasize variables on “larger measurement scales” more. There are many more cases than I can possibly list here … I always recommend you to think about the algorithm and what it’s doing, and then it typically becomes obvious whether we want to scale your features or not.

The interested reader can read Sebastian Raschka’s article for a full discussion on the scaling/normalization topic. Pro Tip: When in doubt, standardize the data.

Preprocessing With Recipes

Let’s implement the preprocessing steps/transformations uncovered during our exploration. Max Kuhn (creator of caret) has been putting some work into Rlang ML tools lately, and the payoff is beginning to take shape. A new package, recipes, makes creating ML data preprocessing workflows a breeze! It takes a little getting used to, but I’ve found that it really helps manage the preprocessing steps. We’ll go over the nitty gritty as it applies to this problem.

Step 1: Create A Recipe

A “recipe” is nothing more than a series of steps you would like to perform on the training, testing and/or validation sets. Think of preprocessing data like baking a cake (I’m not a baker but stay with me). The recipe is our steps to make the cake. It doesn’t do anything other than create the playbook for baking.

We use the recipe() function to implement our preprocessing steps. The function takes a familiar object argument, which is a modeling function such as object = Churn ~ . meaning “Churn” is the outcome (aka response, predictor, target) and all other features are predictors. The function also takes the data argument, which gives the “recipe steps” perspective on how to apply during baking (next).

A recipe is not very useful until we add “steps”, which are used to transform the data during baking. The package contains a number of useful “step functions” that can be applied. The entire list of Step Functions can be viewed here. For our model, we use:

  1. step_discretize() with the option = list(cuts = 6) to cut the continuous variable for “tenure” (number of years as a customer) to group customers into cohorts.
  2. step_log() to log transform “TotalCharges”.
  3. step_dummy() to one-hot encode the categorical data. Note that this adds columns of one/zero for categorical data with three or more categories.
  4. step_center() to mean-center the data.
  5. step_scale() to scale the data.

The last step is to prepare the recipe with the prep() function. This step is used to “estimate the required parameters from a training set that can later be applied to other data sets”. This is important for centering and scaling and other functions that use parameters defined from the training set.

Here’s how simple it is to implement the preprocessing steps that we went over!

# Create recipe rec_obj <- recipe(Churn ~ ., data = train_tbl) %>% step_discretize(tenure, options = list(cuts = 6)) %>% step_log(TotalCharges) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep(data = train_tbl) ## step 1 discretize training ## step 2 log training ## step 3 dummy training ## step 4 center training ## step 5 scale training

We can print the recipe object if we ever forget what steps were used to prepare the data. Pro Tip: We can save the recipe object as an RDS file using saveRDS(), and then use it to bake() (discussed next) future raw data into ML-ready data in production!

# Print the recipe object rec_obj ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 19 ## ## Training data contained 5626 data points and no missing data. ## ## Steps: ## ## Dummy variables from tenure [trained] ## Log transformation on TotalCharges [trained] ## Dummy variables from ~gender, ~Partner, ... [trained] ## Centering for SeniorCitizen, ... [trained] ## Scaling for SeniorCitizen, ... [trained] Step 2: Baking With Your Recipe

Now for the fun part! We can apply the “recipe” to any data set with the bake() function, and it processes the data following our recipe steps. We’ll apply to our training and testing data to convert from raw data to a machine learning dataset. Check our training set out with glimpse(). Now that’s an ML-ready dataset prepared for ANN modeling!!

# Predictors x_train_tbl <- bake(rec_obj, newdata = train_tbl) x_test_tbl <- bake(rec_obj, newdata = test_tbl) glimpse(x_train_tbl) ## Observations: 5,626 ## Variables: 35 ## $ SeniorCitizen -0.4351959, -0.4351... ## $ MonthlyCharges -1.1575972, -0.2601... ## $ TotalCharges -2.275819130, 0.389... ## $ gender_Male -1.0016900, 0.99813... ## $ Partner_Yes 1.0262054, -0.97429... ## $ Dependents_Yes -0.6507747, -0.6507... ## $ tenure_bin1 2.1677790, -0.46121... ## $ tenure_bin2 -0.4389453, -0.4389... ## $ tenure_bin3 -0.4481273, -0.4481... ## $ tenure_bin4 -0.4509837, 2.21698... ## $ tenure_bin5 -0.4498419, -0.4498... ## $ tenure_bin6 -0.4337508, -0.4337... ## $ PhoneService_Yes -3.0407367, 0.32880... ## $ 3.0407367, -0.32880... ## $ MultipleLines_Yes -0.8571364, -0.8571... ## $ InternetService_Fiber.optic -0.8884255, -0.8884... ## $ InternetService_No -0.5272627, -0.5272... ## $ OnlineSecurity_No.internet.service -0.5272627, -0.5272... ## $ OnlineSecurity_Yes -0.6369654, 1.56966... ## $ OnlineBackup_No.internet.service -0.5272627, -0.5272... ## $ OnlineBackup_Yes 1.3771987, -0.72598... ## $ DeviceProtection_No.internet.service -0.5272627, -0.5272... ## $ DeviceProtection_Yes -0.7259826, 1.37719... ## $ TechSupport_No.internet.service -0.5272627, -0.5272... ## $ TechSupport_Yes -0.6358628, -0.6358... ## $ StreamingTV_No.internet.service -0.5272627, -0.5272... ## $ StreamingTV_Yes -0.7917326, -0.7917... ## $ StreamingMovies_No.internet.service -0.5272627, -0.5272... ## $ StreamingMovies_Yes -0.797388, -0.79738... ## $ Contract_One.year -0.5156834, 1.93882... ## $ Contract_Two.year -0.5618358, -0.5618... ## $ PaperlessBilling_Yes 0.8330334, -1.20021... ## $ PaymentMethod_Credit.card..automatic. -0.5231315, -0.5231... ## $ PaymentMethod_Electronic.check 1.4154085, -0.70638... ## $ PaymentMethod_Mailed.check -0.5517013, 1.81225... Step 3: Don’t Forget The Target

One last step, we need to store the actual values (truth) as y_train_vec and y_test_vec, which are needed for modeling our ANN. We convert to a series of numeric ones and zeros which can be accepted by the Keras ANN modeling functions. We add “vec” to the name so we can easily remember the class of the object (it’s easy to get confused when working with tibbles, vectors, and matrix data types).

# Response variables for training and testing sets y_train_vec <- ifelse(pull(train_tbl, Churn) == "Yes", 1, 0) y_test_vec <- ifelse(pull(test_tbl, Churn) == "Yes", 1, 0) Model Customer Churn With Keras (Deep Learning)

This is super exciting!! Finally, Deep Learning with Keras in R! The team at RStudio has done fantastic work recently to create the keras package, which implements Keras in R. Very cool!

Background On Artifical Neural Networks

For those unfamiliar with Neural Networks (and those that need a refresher), read this article. It’s very comprehensive, and you’ll leave with a general understanding of the types of deep learning and how they work.

Source: Xenon Stack

Deep Learning has been available in R for some time, but the primary packages used in the wild have not (this includes Keras, Tensor Flow, Theano, etc, which are all Python libraries). It’s worth mentioning that a number of other Deep Learning packages exist in R including h2o, mxnet, and others. The interested reader can check out this blog post for a comparison of deep learning packages in R.

Building A Deep Learning Model

We’re going to build a special class of ANN called a Multi-Layer Perceptron (MLP). MLPs are one of the simplest forms of deep learning, but they are both highly accurate and serve as a jumping-off point for more complex algorithms. MLPs are quite versatile as they can be used for regression, binary and multi classification (and are typically quite good at classification problems).

We’ll build a three layer MLP with Keras. Let’s walk-through the steps before we implement in R.

  1. Initialize a sequential model: The first step is to initialize a sequential model with keras_model_sequential(), which is the beginning of our Keras model. The sequential model is composed of a linear stack of layers.

  2. Apply layers to the sequential model: Layers consist of the input layer, hidden layers and an output layer. The input layer is the data and provided it’s formatted correctly there’s nothing more to discuss. The hidden layers and output layers are what controls the ANN inner workings.

    • Hidden Layers: Hidden layers form the neural network nodes that enable non-linear activation using weights. The hidden layers are created using layer_dense(). We’ll add two hidden layers. We’ll apply units = 16, which is the number of nodes. We’ll select kernel_initializer = "uniform" and activation = "relu" for both layers. The first layer needs to have the input_shape = 35, which is the number of columns in the training set. Key Point: While we are arbitrarily selecting the number of hidden layers, units, kernel initializers and activation functions, these parameters can be optimized through a process called hyperparameter tuning that is discussed in Next Steps.

    • Dropout Layers: Dropout layers are used to control overfitting. This eliminates weights below a cutoff threshold to prevent low weights from overfitting the layers. We use the layer_dropout() function add two drop out layers with rate = 0.10 to remove weights below 10%.

    • Output Layer: The output layer specifies the shape of the output and the method of assimilating the learned information. The output layer is applied using the layer_dense(). For binary values, the shape should be units = 1. For multi-classification, the units should correspond to the number of classes. We set the kernel_initializer = "uniform" and the activation = "sigmoid" (common for binary classification).

  3. Compile the model: The last step is to compile the model with compile(). We’ll use optimizer = "adam", which is one of the most popular optimization algorithms. We select loss = "binary_crossentropy" since this is a binary classification problem. We’ll select metrics = c("accuracy") to be evaluated during training and testing. Key Point: The optimizer is often included in the tuning process.

Let’s codify the discussion above to build our Keras MLP-flavored ANN model.

# Building our Artificial Neural Network model_keras <- keras_model_sequential() model_keras %>% # First hidden layer layer_dense( units = 16, kernel_initializer = "uniform", activation = "relu", input_shape = ncol(x_train_tbl)) %>% # Dropout to prevent overfitting layer_dropout(rate = 0.1) %>% # Second hidden layer layer_dense( units = 16, kernel_initializer = "uniform", activation = "relu") %>% # Dropout to prevent overfitting layer_dropout(rate = 0.1) %>% # Output layer layer_dense( units = 1, kernel_initializer = "uniform", activation = "sigmoid") %>% # Compile ANN compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = c('accuracy') ) model_keras ## Model ## ______________________________________________________________________ ## Layer (type) Output Shape Param # ## ====================================================================== ## dense_1 (Dense) (None, 16) 576 ## ______________________________________________________________________ ## dropout_1 (Dropout) (None, 16) 0 ## ______________________________________________________________________ ## dense_2 (Dense) (None, 16) 272 ## ______________________________________________________________________ ## dropout_2 (Dropout) (None, 16) 0 ## ______________________________________________________________________ ## dense_3 (Dense) (None, 1) 17 ## ====================================================================== ## Total params: 865 ## Trainable params: 865 ## Non-trainable params: 0 ## ______________________________________________________________________

We use the fit() function to run the ANN on our training data. The object is our model, and x and y are our training data in matrix and numeric vector forms, respectively. The batch_size = 50 sets the number samples per gradient update within each epoch. We set epochs = 35 to control the number training cycles. Typically we want to keep the batch size high since this decreases the error within each training cycle (epoch). We also want epochs to be large, which is important in visualizing the training history (discussed below). We set validation_split = 0.30 to include 30% of the data for model validation, which prevents overfitting. The training process should complete in 15 seconds or so.

# Fit the keras model to the training data fit_keras <- fit( object = model_keras, x = as.matrix(x_train_tbl), y = y_train_vec, batch_size = 50, epochs = 35, validation_split = 0.30 )

We can inspect the final model. We want to make sure there is minimal difference between the validation accuracy and the training accuracy.

# Print the final model fit_keras ## Trained on 3,938 samples, validated on 1,688 samples (batch_size=50, epochs=35) ## Final epoch (plot to see history): ## val_loss: 0.4215 ## val_acc: 0.8057 ## loss: 0.399 ## acc: 0.8101

We can visualize the Keras training history using the plot() function. What we want to see is the validation accuracy and loss leveling off, which means the model has completed training. We see that there is some divergence between training loss/accuracy and validation loss/accuracy. This model indicates we can possibly stop training at an earlier epoch. Pro Tip: Only use enough epochs to get a high validation accuracy. Once validation accuracy curve begins to flatten or decrease, it’s time to stop training.

# Plot the training/validation history of our Keras model plot(fit_keras) + theme_tq() + scale_color_tq() + scale_fill_tq() + labs(title = "Deep Learning Training Results")

Making Predictions

We’ve got a good model based on the validation accuracy. Now let’s make some predictions from our keras model on the test data set, which was unseen during modeling (we use this for the true performance assessment). We have two functions to generate predictions:

  • predict_classes: Generates class values as a matrix of ones and zeros. Since we are dealing with binary classification, we’ll convert the output to a vector.
  • predict_proba: Generates the class probabilities as a numeric matrix indicating the probability of being a class. Again, we convert to a numeric vector because there is only one column output.
# Predicted Class yhat_keras_class_vec <- predict_classes(object = model_keras, x = as.matrix(x_test_tbl)) %>% as.vector() # Predicted Class Probability yhat_keras_prob_vec <- predict_proba(object = model_keras, x = as.matrix(x_test_tbl)) %>% as.vector() Inspect Performance With Yardstick

The yardstick package has a collection of handy functions for measuring performance of machine learning models. We’ll overview some metrics we can use to understand the performance of our model.

First, let’s get the data formatted for yardstick. We create a data frame with the truth (actual values as factors), estimate (predicted values as factors), and the class probability (probability of yes as numeric). We use the fct_recode() function from the forcats package to assist with recoding as Yes/No values.

# Format test data and predictions for yardstick metrics estimates_keras_tbl <- tibble( truth = as.factor(y_test_vec) %>% fct_recode(yes = "1", no = "0"), estimate = as.factor(yhat_keras_class_vec) %>% fct_recode(yes = "1", no = "0"), class_prob = yhat_keras_prob_vec ) estimates_keras_tbl ## # A tibble: 1,406 x 3 ## truth estimate class_prob ## ## 1 yes no 0.328355074 ## 2 yes yes 0.633630514 ## 3 no no 0.004589651 ## 4 no no 0.007402068 ## 5 no no 0.049968336 ## 6 no no 0.116824441 ## 7 no yes 0.775479317 ## 8 no no 0.492996633 ## 9 no no 0.011550998 ## 10 no no 0.004276015 ## # ... with 1,396 more rows

Now that we have the data formatted, we can take advantage of the yardstick package.

Confusion Table

We can use the conf_mat() function to get the confusion table. We see that the model was by no means perfect, but it did a decent job of identifying customers likely to churn.

# Confusion Table estimates_keras_tbl %>% conf_mat(truth, estimate) ## Truth ## Prediction no yes ## no 950 161 ## yes 99 196 Accuracy

We can use the metrics() function to get an accuracy measurement from the test set. We are getting roughly 82% accuracy.

# Accuracy estimates_keras_tbl %>% metrics(truth, estimate) ## # A tibble: 1 x 1 ## accuracy ## ## 1 0.8150782 AUC

We can also get the ROC Area Under the Curve (AUC) measurement. AUC is often a good metric used to compare different classifiers and to compare to randomly guessing (AUC_random = 0.50). Our model has AUC = 0.85, which is much better than randomly guessing. Tuning and testing different classification algorithms may yield even better results.

# AUC estimates_keras_tbl %>% roc_auc(truth, class_prob) ## [1] 0.8523951 Precision And Recall

Precision is when the model predicts “yes”, how often is it actually “yes”. Recall (also true positive rate or specificity) is when the actual value is “yes” how often is the model correct. We can get precision() and recall() measurements using yardstick.

# Precision tibble( precision = estimates_keras_tbl %>% precision(truth, estimate), recall = estimates_keras_tbl %>% recall(truth, estimate) ) ## # A tibble: 1 x 2 ## precision recall ## ## 1 0.8550855 0.9056244

Precision and recall are very important to the business case: The organization is concerned with balancing the cost of targeting and retaining customers at risk of leaving with the cost of inadvertently targeting customers that are not planning to leave (and potentially decreasing revenue from this group). The threshold above which to predict Churn = “Yes” can be adjusted to optimize for the business problem. This becomes an Customer Lifetime Value optimization problem that is discussed further in Next Steps.

F1 Score

We can also get the F1-score, which is a weighted average between the precision and recall. Machine learning classifier thresholds are often adjusted to maximize the F1-score. However, this is often not the optimal solution to the business problem.

# F1-Statistic estimates_keras_tbl %>% f_meas(truth, estimate, beta = 1) ## [1] 0.8796296 Explain The Model With LIME

LIME stands for Local Interpretable Model-agnostic Explanations, and is a method for explaining black-box machine learning model classifiers. For those new to LIME, this YouTube video does a really nice job explaining how LIME helps to identify feature importance with black box machine learning models (e.g. deep learning, stacked ensembles, random forest).


The lime package implements LIME in R. One thing to note is that it’s not setup out-of-the-box to work with keras. The good news is with a few functions we can get everything working properly. We’ll need to make two custom functions:

  • model_type: Used to tell lime what type of model we are dealing with. It could be classification, regression, survival, etc.

  • predict_model: Used to allow lime to perform predictions that its algorithm can interpret.

The first thing we need to do is identify the class of our model object. We do this with the class() function.

class(model_keras) ## [1] "keras.models.Sequential" ## [2] "" ## [3] "keras.engine.topology.Container" ## [4] "keras.engine.topology.Layer" ## [5] "python.builtin.object"

Next we create our model_type() function. It’s only input is x the keras model. The function simply returns “classification”, which tells LIME we are classifying.

# Setup lime::model_type() function for keras model_type.keras.models.Sequential <- function(x, ...) { return("classification") }

Now we can create our predict_model() function, which wraps keras::predict_proba(). The trick here is to realize that it’s inputs must be x a model, newdata a dataframe object (this is important), and type which is not used but can be use to switch the output type. The output is also a little tricky because it must be in the format of probabilities by classification (this is important; shown next).

# Setup lime::predict_model() function for keras predict_model.keras.models.Sequential <- function(x, newdata, type, ...) { pred <- predict_proba(object = x, x = as.matrix(newdata)) return(data.frame(Yes = pred, No = 1 - pred)) }

Run this next script to show you what the output looks like and to test our predict_model() function. See how it’s the probabilities by classification. It must be in this form for model_type = "classification".

# Test our predict_model() function predict_model(x = model_keras, newdata = x_test_tbl, type = 'raw') %>% tibble::as_tibble() ## # A tibble: 1,406 x 2 ## Yes No ## ## 1 0.328355074 0.6716449 ## 2 0.633630514 0.3663695 ## 3 0.004589651 0.9954103 ## 4 0.007402068 0.9925979 ## 5 0.049968336 0.9500317 ## 6 0.116824441 0.8831756 ## 7 0.775479317 0.2245207 ## 8 0.492996633 0.5070034 ## 9 0.011550998 0.9884490 ## 10 0.004276015 0.9957240 ## # ... with 1,396 more rows

Now the fun part, we create an explainer using the lime() function. Just pass the training data set without the “Attribution column”. The form must be a data frame, which is OK since our predict_model function will switch it to an keras object. Set model = automl_leader our leader model, and bin_continuous = FALSE. We could tell the algorithm to bin continuous variables, but this may not make sense for categorical numeric data that we didn’t change to factors.

# Run lime() on training set explainer <- lime::lime( x = x_train_tbl, model = model_keras, bin_continuous = FALSE)

Now we run the explain() function, which returns our explanation. This can take a minute to run so we limit it to just the first ten rows of the test data set. We set n_labels = 1 because we care about explaining a single class. Setting n_features = 4 returns the top four features that are critical to each case. Finally, setting kernel_width = 0.5 allows us to increase the “model_r2” value by shrinking the localized evaluation.

# Run explain() on explainer explanation <- lime::explain( x_test_tbl[1:10,], explainer = explainer, n_labels = 1, n_features = 4, kernel_width = 0.5) Feature Importance Visualization

The payoff for the work we put in using LIME is this feature importance plot. This allows us to visualize each of the first ten cases (observations) from the test data. The top four features for each case are shown. Note that they are not the same for each case. The green bars mean that the feature supports the model conclusion, and the red bars contradict. A few important features based on frequency in first ten cases:

  • Tenure (7 cases)
  • Senior Citizen (5 cases)
  • Online Security (4 cases)
plot_features(explanation) + labs(title = "LIME Feature Importance Visualization", subtitle = "Hold Out (Test) Set, First 10 Cases Shown")

Another excellent visualization can be performed using plot_explanations(), which produces a facetted heatmap of all case/label/feature combinations. It’s a more condensed version of plot_features(), but we need to be careful because it does not provide exact statistics and it makes it less easy to investigate binned features (Notice that “tenure” would not be identified as a contributor even though it shows up as a top feature in 7 of 10 cases).

plot_explanations(explanation) + labs(title = "LIME Feature Importance Heatmap", subtitle = "Hold Out (Test) Set, First 10 Cases Shown")

Check Explanations With Correlation Analysis

One thing we need to be careful with the LIME visualization is that we are only doing a sample of the data, in our case the first 10 test observations. Therefore, we are gaining a very localized understanding of how the ANN works. However, we also want to know on from a global perspective what drives feature importance.

We can perform a correlation analysis on the training set as well to help glean what features correlate globally to “Churn”. We’ll use the corrr package, which performs tidy correlations with the function correlate(). We can get the correlations as follows.

# Feature correlations to Churn corrr_analysis <- x_train_tbl %>% mutate(Churn = y_train_vec) %>% correlate() %>% focus(Churn) %>% rename(feature = rowname) %>% arrange(abs(Churn)) %>% mutate(feature = as_factor(feature)) corrr_analysis ## # A tibble: 35 x 2 ## feature Churn ## ## 1 gender_Male -0.006690899 ## 2 tenure_bin3 -0.009557165 ## 3 -0.016950072 ## 4 PhoneService_Yes 0.016950072 ## 5 MultipleLines_Yes 0.032103354 ## 6 StreamingTV_Yes 0.066192594 ## 7 StreamingMovies_Yes 0.067643871 ## 8 DeviceProtection_Yes -0.073301197 ## 9 tenure_bin4 -0.073371838 ## 10 PaymentMethod_Mailed.check -0.080451164 ## # ... with 25 more rows

The correlation visualization helps in distinguishing which features are relavant to Churn.

# Correlation visualization corrr_analysis %>% ggplot(aes(x = Churn, y = fct_reorder(feature, desc(Churn)))) + geom_point() + # Positive Correlations - Contribute to churn geom_segment(aes(xend = 0, yend = feature), color = palette_light()[[2]], data = corrr_analysis %>% filter(Churn > 0)) + geom_point(color = palette_light()[[2]], data = corrr_analysis %>% filter(Churn > 0)) + # Negative Correlations - Prevent churn geom_segment(aes(xend = 0, yend = feature), color = palette_light()[[1]], data = corrr_analysis %>% filter(Churn < 0)) + geom_point(color = palette_light()[[1]], data = corrr_analysis %>% filter(Churn < 0)) + # Vertical lines geom_vline(xintercept = 0, color = palette_light()[[5]], size = 1, linetype = 2) + geom_vline(xintercept = -0.25, color = palette_light()[[5]], size = 1, linetype = 2) + geom_vline(xintercept = 0.25, color = palette_light()[[5]], size = 1, linetype = 2) + # Aesthetics theme_tq() + labs(title = "Churn Correlation Analysis", subtitle = "Positive Correlations (contribute to churn), Negative Correlations (prevent churn)", y = "Feature Importance")

The correlation analysis helps us quickly disseminate which features that the LIME analysis may be excluding. We can see that the following features are highly correlated (magnitude > 0.25):

Increases Likelihood of Churn (Red):

  • Tenure = Bin 1 (<12 Months)
  • Internet Service = “Fiber Optic”
  • Payment Method = “Electronic Check”

Decreases Likelihood of Churn (Blue):

  • Contract = “Two Year”
  • Total Charges (Note that this may be a biproduct of additional services such as Online Security)
Feature Investigation

We can investigate features that are most frequent in the LIME feature importance visualization along with those that the correlation analysis shows an above normal magnitude. We’ll investigate:

  • Tenure (7/10 LIME Cases, Highly Correlated)
  • Contract (Highly Correlated)
  • Internet Service (Highly Correlated)
  • Payment Method (Highly Correlated)
  • Senior Citizen (5/10 LIME Cases)
  • Online Security (4/10 LIME Cases)
Tenure (7/10 LIME Cases, Highly Correlated)

LIME cases indicate that the ANN model is using this feature frequently and high correlation agrees that this is important. Investigating the feature distribution, it appears that customers with lower tenure (bin 1) are more likely to leave. Opportunity: Target customers with less than 12 month tenure.

Contract (Highly Correlated)

While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with one and two year contracts are much less likely to churn. Opportunity: Offer promotion to switch to long term contracts.

Internet Service (Highly Correlated)

While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with fiber optic service are more likely to churn while those with no internet service are less likely to churn. Improvement Area: Customers may be dissatisfied with fiber optic service.

Payment Method (Highly Correlated)

While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with electronic check are more likely to leave. Opportunity: Offer customers a promotion to switch to automatic payments.

Senior Citizen (5/10 LIME Cases)

Senior citizen appeared in several of the LIME cases indicating it was important to the ANN for the 10 samples. However, it was not highly correlated to Churn, which may indicate that the ANN is using in an more sophisticated manner (e.g. as an interaction). It’s difficult to say that senior citizens are more likely to leave, but non-senior citizens appear less at risk of churning. Opportunity: Target users in the lower age demographic.

Online Security (4/10 LIME Cases)

Customers that did not sign up for online security were more likely to leave while customers with no internet service or online security were less likely to leave. Opportunity: Promote online security and other packages that increase retention rates.

Next Steps: Business Science University

We’ve just scratched the surface with the solution to this problem, but unfortunately there’s only so much ground we can cover in an article. Here are a few next steps that I’m pleased to announce will be covered in a Business Science University course coming in 2018!

Customer Lifetime Value

Your organization needs to see the financial benefit so always tie your analysis to sales, profitability or ROI. Customer Lifetime Value (CLV) is a methodology that ties the business profitability to the retention rate. While we did not implement the CLV methodology herein, a full customer churn analysis would tie the churn to an classification cutoff (threshold) optimization to maximize the CLV with the predictive ANN model.

The simplified CLV model is:



  • GC is the gross contribution per customer
  • d is the annual discount rate
  • r is the retention rate
ANN Performance Evaluation and Improvement

The ANN model we built is good, but it could be better. How we understand our model accuracy and improve on it is through the combination of two techniques:

  • K-Fold Cross-Fold Validation: Used to obtain bounds for accuracy estimates.
  • Hyper Parameter Tuning: Used to improve model performance by searching for the best parameters possible.

We need to implement K-Fold Cross Validation and Hyper Parameter Tuning if we want a best-in-class model.

Distributing Analytics

It’s critical to communicate data science insights to decision makers in the organization. Most decision makers in organizations are not data scientists, but these individuals make important decisions on a day-to-day basis. The PowerBI application below includes a Customer Scorecard to monitor customer health (risk of churn). The application walks the user through the machine learning journey for how the model was developed, what it means to stakeholders, and how it can be used in production.

View in Full Screen Mode for best experience

For those seeking options for distributing analytics, two good options are:

  • Shiny Apps for rapid prototyping: Shiny web applications offer the maximum flexibility with R algorithms built in. Shiny is more complex to learn, but Shiny applications are incredible / limitless.

  • Microsoft PowerBI and Tableau for Visualization: Enable distributed analytics with the advantage of intuitive structure but with some flexibilty sacrificed. Can be difficult to build ML into.

Business Science University

You’re probably wondering why we are going into so much detail on next steps. We are happy to announce a new project for 2018: Business Science University, an online school dedicated to helping data science learners improve in the areas of:

  • Advanced machine learning techniques
  • Developing interactive data products and applications using Shiny and other tools
  • Distributing data science within an organization

Learning paths will be focused on business and financial applications. We’ll keep you posted via social media and our blog (please follow us / subscribe to stay updated).

Please let us know if you are interested in joining Business Science University. Let us know what you think in the Disqus Comments below.


Customer churn is a costly problem. The good news is that machine learning can solve churn problems, making the organization more profitable in the process. In this article, we saw how Deep Learning can be used to predict customer churn. We built an ANN model using the new keras package that achieved 82% predictive accuracy (without tuning)! We used three new machine learning packages to help with preprocessing and measuring performance: recipes, rsample and yardstick. Finally we used lime to explain the Deep Learning model, which traditionally was impossible! We checked the LIME results with a Correlation Analysis, which brought to light other features to investigate. For the IBM Telco dataset, tenure, contract type, internet service type, payment menthod, senior citizen status, and online security status were useful in diagnosing customer churn. We hope you enjoyed this article!

About Business Science

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business and financial applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science can help to expand into predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

Follow Business Science on Social Media var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: - Articles. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

changes: easy Git-based version control from R

Tue, 11/28/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Are you new to version control and always running into trouble with Git?
Or are you a seasoned user, haunted by the traumas of learning Git and reliving them whilst trying to teach it to others?
Yeah, us too.

Git is a version control tool designed for software development, and it is extraordinarily powerful. It didn’t actually dawn on me quite how amazing Git is until I spent a weekend in Melbourne with a group of Git whizzes using Git to write a package targeted toward Git beginners. Whew, talk about total Git immersion! I was taking part in the 2017 rOpenSci ozunconf, in which forty-odd developers, scientists, researchers, nerds, teachers, starving students, cat ladies, and R users of all descriptions form teams to create new R packages fulfilling some new and useful function. Many of the groups used Git for their collaborative workflows all weekend.

Unfortunately, just like many a programming framework, Git can often be a teensy bit (read: extremely, prohibitively) intimidating, especially for beginners who don’t need all of Git’s numerous and baffling features.
It’s one of those platforms that makes your life a million times better once you know how to use it, but if you’re trying to teach yourself the basics using the internet, or—heaven forbid—trying to untangle yourself from some Git-branch tangle that you’ve unwittingly become snarled in… (definitely done that one…) well, let’s just say using your knuckles to break a brick wall can sometimes seem preferable.
Just ask the Git whizzes.
They laugh, because they’ve been there, done that.

The funny thing is, doing basic version control in Git only requires a few commands.
After browsing through the available project ideas and settling into teams, a group of eight of us made a list of the commands that we use on a daily basis, and the list came to about a dozen.
We looked up our Git histories and compiled a Git vocabulary, which came out to less than 50 commands, including combination commands.

As Nick Golding so shrewdly recognized in the lead up to this year’s unconference, the real obstacle for new Git users is not the syntax, it’s actually (a) the scary, scary terminal window and (b) the fact that Git terminology was apparently chosen by randomly opening a verb dictionary and blindly pointing to a spot on the page.
(Ok, I’m exaggerating, but the point is that the terminology is pretty confusing).
We decided to address these two problems by making a package that uses the R console and reimagining the version control vocabulary and workflow for people who are new to version control and only need some of its many features.

Somewhat ironically, nine people worked for two days on a dozen branches, using Git and GitHub to seamlessly merge our workflows.
It was wonderful to see how so many people’s various talents can be combined to make something that no group members could have done all on their own.

Enter, changes ( repo, website – made using pkgdown), our new R package to do version control with a few simple commands.
It uses Git and Git2r under the hood, but new users don’t need to know any Git to begin using version control with changes.
Best of all, it works seamlessly with regular Git. So if a user thinks they’re ready to expand their horizons they can start using git commands via the Githug package, RStudio’s git interface, or on the command line.

Here is an overview of some of the ways we’ve made simple version control easy with changes:

Simple terminology

It uses simple and deliberately un-git-like terminology:

  • You start a new version control project with create_repo(), which is like git init but it can set up a nice project directory structure for you, automatically ignoring things like output folders.
  • All of the steps involved in commiting edits have been compressed into one function: record(). All files that aren’t ignored will be committed, so users don’t need to know the difference between tracking, staging and committing files.
  • It’s easy to set which files to omit from version control with ignore(), and to change your mind with unignore().
  • changes() lets you know which files have changed since the last record, like a hybrid of git status and git diff.
  • You can look back in history with timeline() (a simplified version of git log), go_to() a previous record (like git checkout), and scrub() any unwanted changes since the last record (like git reset --hard).
It’s linear

After a long discussion, we decided that changes won’t provide an interface to Git branches (at least not yet), as the merge conflicts it leads to are one of the scariest things about version control for beginners.
With linear version control, users can can easily go_to() a past record with a version number, rather than unfamiliar SHA’s. These numbers appear in the a lovely visual representation of their timeline():

(1) initial commit | 2017-11-18 02:55 | (2) set up project structure | 2017-11-18 02:55 | (3) added stuff to readme 2017-11-18 02:55

If you want to roll your project back to a previous record, you can retrieve() it, and changes will simply append that record at the top of your timeline (storing all the later records, just in case).

Readable messages and automatic reminders

Some of Git’s messages and helpfiles are totally cryptic to all but the most hardened computer scientists.
Having been confronted with our fair share of detached HEADs and offers to update remote refs along with associated objects, we were keen to make sure all the error messages and helpfiles in changes are as intuitive and understandable as possible.

It can also be hard to get into the swing of recording edits, so changes will give you reminders to encourage you to use record() regularly. You can change the time interval for reminders, or switch them off, using remind_me().

Coming soon

We made a lot of progress in two days, but there’s plenty more we’re planning to add soon:

  1. Simplified access to GitHub with a sync() command to automagically handle most uses of git fetch, git merge, and git push.
  2. A Git training-wheels mode, so that people who want to move use Git can view the Git commands changes is using under the hood.
  3. Added flexibility – we are working on adding functionality to handle simple deviations from the defaults, such as recording changes only to named files, or to all except some excluded files.

We’d be really keen to hear your suggestions too, so please let us know your ideas via the changes issue tracker!

I have only recently started using Git and GitHub, and this year’s rOpenSci ozunconf was a big eye-opener for me, in several ways.
Beyond finally understanding to power of proper version control, I met a group of wonderful people dedicated to participating in the R community.
Now as it turns out, R users take the word “community” very seriously.
Each and every person I met during the event was open and friendly.
Each person had ideas for attracting new users to R, making it easier to learn, making methods and data more readily available, and creating innovative new functionality.
Even before the workshop began, dozens of ideas for advancement circulated on GitHub Issues.
Throughout the conference, it was a pleasure to be a part of the ongoing conversation and dialogue about growing and improving the R community.
That’s right, you can delete any lingering ‘introverted computer geek’ stereotypes you might still be harbouring in a cobwebbed attic of your mind.
In today’s day and age, programming is as much about helping each other, communicating, learning, and networking as it is about solving problems.
And building the community is a group effort.

R users come from all sorts of backgrounds, but I was gratified to see scientists and researchers well-represented at the unconference.
Gone are the days when I need to feel like the ugly duckling for being the only R user in my biology lab!
If you still find yourself isolated, join the blooming online R users community, or any one of a number of meetups and clubs that are popping up everywhere.
I have dipped my toe in those waters, and boy am I glad I did!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sliced Wasserstein estimation of mixtures

Tue, 11/28/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A paper by Soheil Kolouri and co-authors was arXived last week about using Wasserstein distance for inference on multivariate Gaussian mixtures. The basic concept is that the parameter is estimated by minimising the p-Wasserstein distance to the empirical distribution, smoothed by a Normal kernel. As the general Wasserstein distance is quite costly to compute, the approach relies on a sliced version, which means computing the Wasserstein distance between one-dimensional projections of the distributions. Optimising over the directions is an additional computational constraint.

“To fit a finite GMM to the observed data, one is required to answer the following questions: 1) how to estimate the number of mixture components needed to represent the data, and 2) how to estimate the parameters of the mixture components.”

The paper contains a most puzzling comment opposing maximum likelihood estimation to minimum Wasserstein distance estimation on the basis that the later would not suffer from multimodality. This sounds incorrect as the multimodality of a mixture model (likelihood) stems from the lack of identifiability of the parameters. If all permutations of these parameters induce exactly the same distribution, they all stand at the same distance from the data distribution, whatever the distance is. Furthermore, the above tartan-like picture clashes with the representation of the log-likelihood of a Normal mixture, as exemplified by the picture below based on a 150 sample with means 0 and 2, same unit variance, and weights 0.3 and 0.7, which shows a smooth if bimodal structure:And for the same dataset, my attempt at producing a Wasserstein “energy landscape” does return a multimodal structure (this is the surface of minus the logarithm of the 2-Wasserstein distance):“Jin et al. proved that with random initialization, the EM algorithm will converge to a bad critical point with high probability.”

This statement is most curious in that the “probability” in the assessment must depend on the choice of the random initialisation, hence on a sort of prior distribution that is not explicited in the paper. Which remains blissfully unaware of Bayesian approaches.

Another [minor mode] puzzling statement is that the p-Wasserstein distance is defined on the space of probability measures with finite p-th moment, which does not make much sense when what matters is rather the finiteness of the expectation of the distance d(X,Y) raised to the power p. A lot of the maths details either do not make sense or seem superfluous.

Filed under: Books, pictures, R, Statistics Tagged: arXiv, EM algorithm, finite mixtures, label switching, log-likelihood, multimodality, Wasserstein distance

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A two-stage workflow for data science projects

Mon, 11/27/2017 - 22:27

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

If you are a data scientist who primarily works with R, chances are you had no formal training in software development. I certainly did not pick up many skills in that direction during my statistics masters. For years my workflow was basically load a dataset and hack away on it. In the best case my R-script came to some kind of conclusion or final data set, but usually it abruptly ended. Complex projects could result in a great number of scripts and data exports. Needless to say, reproducability was typically low and stress could run high when some delivery went wrong.

Picking up software skills

Colleagues pointed me to some customs in software design, such as creating functions and classes (and documenting them), writing unit tests, and packaging-up your work. I started to read about these concepts and was soon working my way through Hadley’s R packages to see how it was done in R. I came to a adopt a rigorous system of solely writing functions, and thoroughly documenting and unit testing them. No matter the nature of the project I was working on, I treated is as a package that should be published on CRAN. This slowed me down considerably, though. I needed increasing amounts of time to answer relatively simple questions. Maybe now I was over engineering, restricting myself in such a way I was losing productivity.

Data science product

In my opinion, the tension between efficiency and rigor in data analysis, is because the data scientist’s product usually is not software, but insight. This goal is not always served best by writing code that meets the highest software standards, because this can make an insight more expensive than necessary, timewise. A software engineer’s job is much more clear cut. His product is software, and the better his software, the better he does his job. However, when a data scientist is asked if she can deliver an exploratory analysis on a new data set, should she create a full software stack for it right away? I would say not. Of course clean, well documented scripts will lead to greater reproducibility of and confidence in her results. However, writing solely functions and classes and put a load of unit tests on them is overkill in this stage.

A two-stage workflow

I came to a workflow, with which I am quite happy. It has a hard split between the exploratory stage and the product stage of a project.

Exploratory stage

Starting an analysis, I want to get to first results as quickly as possible. R markdown files are great for this, in which you can take tons of notes about the quality of the data, assumptions tested, relations found, and things that are not yet clear. Creating functions in this stage can be already worthwhile. Functions to do transformations, create specific plots or test for specific relationships. I aim for creating functions that are as generic as is feasible from the start. It is my experience that data is often refreshed or adjusted during a project, even when this was not anticipated at the start of the project. Don’t hardcode names of dataframes and columns, but rather give them as arguments to functions. Optionally with the current names as their default values. These functions can be quickly lifted to the product stage. Functions are only written though, if they facilitate the exploration. If not, I am happy to analyse the data with ordinary R code. Typically, the exploratory stage is wrapped-up by a report in which the major insights and next-steps are described. More often than not, the exploratory stage is the only part of a project as it is concluded to take no further action on the topic researched. The quicker we can get to such a conclusion, the less resources are wasted.

Product stage

A product, to me, is everything designed to share insights, model predictions, and data with others. Whenever a product gets built, whether its a Shiny app, an export of model productions, or a daily updated report, the product stage is entered. In more ignorant days I used to gradually slide from exploratory stage into product stage. In the worst cases commenting out the parts of code that were not needed for building the product. This got increasingly messy of course, yielding non reproducable results and projects that grinded to a halt. Nowadays, as soon as a product gets built I completely start over. Here the software rigor kicks in. All code written in this part only serves the final product, no explorations or asides allowed. I use R’s package structure, writing functions that work together to create the product, including documentation and unit tests. When functions are used in the exploratory stage, many components of the product are already available at the start of this stage.

Further cycles

When during product stage it becomes evident that further research is needed, switch back to the exploratory code base. I think it is crucial to be very strict about this. No mess in the product repository! The product directory only contains the code needed for the product to work. When new insight requires the product to change, maybe updating a model or add an extra tab to a dashboard, adjust the product. Again only code necessary for the product to function allowed. Legacy code no longer needed should be removed, not be commented out. This workflow prevents a large set of scripts with a great number of models, many exported data sets, several versions of a Shiny app etc. The product repository always contains the latest version and nothing more.

I hope these hard-learned lessons might be to some value for some of you. I am very curious what others designed as their system. Please share it in the comments of this blog, on Twitter, or in you own blog and notify me.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

voteogram Is Now On CRAN

Mon, 11/27/2017 - 21:38

(This article was first published on R –, and kindly contributed to R-bloggers)

Earlier this year, I made a package that riffed off of ProPublica’s really neat voting cartograms (maps) for the U.S. House and Senate. You can see one for disaster relief spending in the House and one for the ACA “Skinny Repeal” in the Senate.

We can replicate both here with the voteogram package (minus the interactivity, for now):

library(voteogram) library(ggplot2) hr_566 <- roll_call(critter="house", number=115, session=1, rcall=566) house_carto(hr_566) + coord_equal() + theme_voteogram()

sen_179 <- roll_call(critter="senate", number=115, session=1, rcall=179) senate_carto(sen_179) + coord_equal() + theme_voteogram()

I think folks might have more fun with the roll_call() objects though:

str(hr_566) ## List of 29 ## $ vote_id : chr "H_115_1_566" ## $ chamber : chr "House" ## $ year : int 2017 ## $ congress : chr "115" ## $ session : chr "1" ## $ roll_call : int 566 ## $ needed_to_pass : int 282 ## $ date_of_vote : chr "October 12, 2017" ## $ time_of_vote : chr "03:23 PM" ## $ result : chr "Passed" ## $ vote_type : chr "2/3 YEA-AND-NAY" ## $ question : chr "On Motion to Suspend the Rules and Agree" ## $ description : chr "Providing for the concurrence by the House in the Senate amendment to H.R. ## 2266, with an amendment" ## $ nyt_title : chr "On Motion to Suspend the Rules and Agree" ## $ total_yes : int 353 ## $ total_no : int 69 ## $ total_not_voting : int 11 ## $ gop_yes : int 164 ## $ gop_no : int 69 ## $ gop_not_voting : int 7 ## $ dem_yes : int 189 ## $ dem_no : int 0 ## $ dem_not_voting : int 5 ## $ ind_yes : int 0 ## $ ind_no : int 0 ## $ ind_not_voting : int 0 ## $ dem_majority_position: chr "Yes" ## $ gop_majority_position: chr "Yes" ## $ votes :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 435 obs. of 11 variables: ## ..$ bioguide_id : chr [1:435] "A000374" "A000370" "A000055" "A000371" ... ## ..$ role_id : int [1:435] 274 294 224 427 268 131 388 320 590 206 ... ## ..$ member_name : chr [1:435] "Ralph Abraham" "Alma Adams" "Robert B. Aderholt" "Pete Aguilar" ... ## ..$ sort_name : chr [1:435] "Abraham" "Adams" "Aderholt" "Aguilar" ... ## ..$ party : chr [1:435] "R" "D" "R" "D" ... ## ..$ state_abbrev : chr [1:435] "LA" "NC" "AL" "CA" ... ## ..$ display_state_abbrev: chr [1:435] "La." "N.C." "Ala." "Calif." ... ## ..$ district : int [1:435] 5 12 4 31 12 3 2 19 36 2 ... ## ..$ position : chr [1:435] "Yes" "Yes" "Yes" "Yes" ... ## ..$ dw_nominate : num [1:435] 0.493 -0.462 0.36 -0.273 0.614 0.684 0.388 NA 0.716 NA ... ## ..$ pp_id : chr [1:435] "LA_5" "NC_12" "AL_4" "CA_31" ... ## - attr(*, "class")= chr [1:2] "pprc" "list"

as they hold tons of info on the votes.

I need to explore the following a bit more but there are some definite “patterns” in the way the 115th Senate has voted this year:

library(hrbrthemes) # I made a mistake in how I exposed these that I'll correct next month # but we need to munge it a bit anyway for this view fills <- voteogram:::vote_carto_fill names(fills) <- tolower(names(fills)) rcalls <- map(1:280, ~voteogram::roll_call(critter="senate", session=1, number=115, rcall=.x)) # save it off so you don't have to waste those calls again write_rds(rcalls, "2017-115-1-sen-280-roll-calls.rds") # do a bit of wrangling map_df(rcalls, ~{ mutate(.x$votes, vote_id = .x$vote_id) %>% arrange(party, position) %>% mutate(fill = tolower(sprintf("%s-%s", party, position))) %>% mutate(ques = .x$question) %>% mutate(x = 1:n()) }) -> votes_df # plot it ggplot(votes_df, aes(x=x, y=vote_id, fill=fill)) + geom_tile() + scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0)) + scale_fill_manual(name=NULL, values=fills) + labs(x=NULL, y=NULL, title="Senate Roll Call Votes", subtitle="2017 / 115th, Session 1, Votes 1-280", caption="Note free-Y scales") + facet_wrap(~ques, scales="free_y", ncol=3) + theme_ipsum_rc(grid="") + theme(axis.text = element_blank()) + theme(legend.position="right")

Hopefully I’ll get some time to dig into the differences and report on anything interesting. If you get to it before me definitely link to your blog post in a comment!


I still want to make an htmlwidgets version of the plots and also add the ability to get the index of roll call votes by Congress number and session to make it easier to iterate.

I’m also seriously considering creating different palettes. I used the ones from the source interactive site but am not 100% happy with them. Suggestions/PRs welcome.

Hopefully this package will make it easier for U.S. folks to track what’s going on in Congress and keep their representatives more accountable to the truth.

Everything’s on GitHub so please file issues, questions or PRs there.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scatter plots in survey sampling

Mon, 11/27/2017 - 21:35

(This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers)

You can find this post in Blogdown format by clicking here

When it comes to analyzing survey data, you have to take into account the stochastic structure of the sample that was selected to obtain the data. Plots and graphics should not be an exception. The main aim of such studies is to try to infer about how the behavior of the outcomes of interest in the finite population.

For example, you may want to know how many people are in poverty. How about counting the poor people in the sample? I know, it is not a good idea. You have to use the sampling weights to estimate the whole number of poor people in the population. The total error paradigm follows the same approach when estimating any parameter of the finite population: you have to use the sampling weights in order to obtain unbiased estimators.

The idea behind this paradigm is the representative principle. If a person $k$ is included in the sample with a sampling weight $w_k$, she/he represents to himself and $w_k-1$ remaining people. That way, you obtain design-unbiasedness.

I am not a big fan of using descriptive sample statistics to analyze survey data because it can mask the reality, and the fact is that person $k$ is included in the sample, but he/she is not alone. Behind person $k$ there are other people, not included in the sample, and you have to realize that.

So, let’s apply that principle to scatter plots. I am using the Lucy population from the TeachingSampling package to recreate my idea. The following code is used to draw a $\pi PS$ sample.

# The inclusion probability is proportional to Income
# The selected sample of size n=400
n <- 400
res <- S.piPS(n, Lucy$Income)
sam <- res[,1]
# The sample is stored in an data.sample
data.sample <- Lucy[sam, ]

The sampling weights will be stored in the data.sample data in the column wk. They will be useful to reproduce our finite popultaion from the sample data.

# Pik.s is the inclusion probability of units in the sample
data.sample$Pik.s <- res[,2]
# wk is the sampling weight
data.sample$wk <- 1/data.sample$Pik.s

Now, let`s make a plot of the sampling data (with just 400 observations). I recall that this scenario is somehow misleading, because we want to know the behavior of the variables in the finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees)) +

The first option that comes to mind is to include the sampling weights in the points of the scatter plot. However, this approach is not appealing to me, because it is not straightforward from this plot to visuzlize the entire finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees, size = wk)) +

In order to make the finite population scatter plot from the survey sample, I will replicate the rows of the data.sample object as many times as the sampling weight wk. I am using the mefa::rep function to achieve this goal. So, the newLucy object is an intent to mimic the finite population by using the selected sample.

newLucy <- NULL

for(i in 1:nrow(data.sample)){
newLucy <- rbind(newLucy,
rep(data.sample[i, ],

newLucy <-

Now, with the newLucy population, I will make a scatter plot. Now, as I am replicating the rows of the sample data, I will add a jitter to avoid overplotting of the points in the scatter plot. This way, this plot (with 2396 observations) looks as if it would come from the finite population.

ggplot(data = newLucy, aes(x = Income, y = Employees)) +
geom_point() + geom_jitter(width = 15, height = 15)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Image Processing and Manipulation with magick in R

Mon, 11/27/2017 - 16:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

‘ImageMagick’ is one of the famous open source libraries available for editing and manipulating Images of different types (Raster & Vector Images). magick is an R-package binding to ‘ImageMagick’ for Advanced Image-Processing in R, authored by Jeroen Ooms.

magick supports many common image formats like png, jpeg, tiff and manipulations like rotate, scale, crop, trim, blur, flip, annotate and much more.

This post is to help you get started with magick to process, edit and manipulate images in R that could be anything from just file format conversion (eg. png to jpeg) or annotating R graphics output.

magick is available on CRAN and also on ropensci’s github.

#installing magick package from CRAN install.packages('magick') #from Github ropensci library - note: requires RTools devtools::install_github('ropensci/magick')

Let us load the library and read our first image or images from the internet with image_read()

#Loading magick package library(magick) #reading a png image frink image frink <- image_read("") #reading a jpg image ex-President Barack Obama from Wikipedia obama <- image_read('')

Make sure you have got an updated version of curl package installed for the successfully reading image as mentioned above.

We can get basic details of the read images with image_info()

#image details image_info(obama) image_info(frink) image_info(obama) format width height colorspace filesize 1 JPEG 800 999 sRGB 151770 image_info(frink) format width height colorspace filesize 1 PNG 220 445 sRGB 73494

Communicating with RStudio, magick lets you print/display image in RStudio Viewer pane.

#displaying the image print(obama)

Whether you are a web developer or you are putting together a powerpoint deck, Image File Format Conversion is one of the operations that we would end up doing and this is literally an one-liner using magick‘s image_write().

#Rendering JPG image into SVG image_write(obama, path = 'obama.svg', format = 'svg')

Below is the PNG format of Obama Image that we read in from Wikipedia:

But you might be wondering, “Hey! This is just basic Image processing. Didn’t you tell this is advanced?” And Yes, Here comes the advanced stuff along with a good news that magick supports pipe %>% operator.

This is what we are going to do:

    1. Take the Original Obama Image and Apply Charcoal Effect to it
    2. Composite Frink Image along with it
    3. Annotate some text – considering Obama would’ve dealt with Confidential data – Let it be CONFIDENTIAL stamped on the image
    4. Finally, Rotate the image slightly and write/save it.
#Applying Charcoal effect to Obama's image #and compositing it with frink's image #and finally annotating it with a text image_charcoal(obama) %>% image_composite(frink) %>% image_annotate("CONFIDENTIAL", size = 50, color = "red", boxcolor = "pink", degrees = 30, location = "+100+100") %>% image_rotate(30) %>% image_write('obama_with_frink.png','png')

Gives this output:

How does the output image look? Artistic isn’t it ;)!

But this is not Data Scientists want to do every day, instead, we play with Plots and most of the time want to annotate the R Graphics output and here’s how you can do that with magick‘s image_annotate()

library(ggplot2) img <- image_graph(600, 400, res = 96) p <- ggplot(iris) + geom_point(aes(Sepal.Length,Petal.Length)) print(p) img %>% image_annotate("CONFIDENTIAL", size = 50, color = "red", boxcolor = "pink", degrees = 30, location = "+100+100") %>% image_write('conf_ggplot.png','png')

Gives this output image:

This is nothing but a glimpse of what magick can do. Literally, there is a lot of magic left in magick. Try it out yourself and share your thoughts in comments. The code used here can be found on my github.


    Related Post

    1. Analyzing the Bible and the Quran using Spark
    2. Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest
    3. Find Your Best Customers with Customer Segmentation in Python
    4. Principal Component Analysis – Unsupervised Learning
    5. ARIMA models and Intervention Analysis
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...