Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 2 min ago

R Programmers: What is your biggest problem when working with Census data?

Wed, 11/29/2017 - 20:18

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

A few weeks ago I announced my latest project: Data Science instruction at the Census Bureau.

In addition to announcing the project, I also snuck a bit of market research into the post. I asked people the types of analyses they do when working with Census data. I also asked what packages they use when solving those problems.

23 people left comments, and they have been very helpful in shaping the curriculum of the course. Thank you to everyone who left a comment!

That was such an effective way to learn about the community of R Census users that I’ve decided to do it again. If you are an R programmer who has worked with Census data, please leave a comment with an answer to this question:

What is your biggest problem when working with Census data in R?

Understanding the obstacles people face has the potential to help us design better courses.

Leave your answer as a comment below!

The post R Programmers: What is your biggest problem when working with Census data? appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Win-Vector LLC announces new “big data in R” tools

Wed, 11/29/2017 - 15:25

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN):

  • partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
  • if_else_device(): provides a dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.


Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link

For “big data in R” users these two function families (plus the included support functions and examples) are simple, yet game changing. These tools were developed by Win-Vector LLC to fill gaps identified by Win-Vector and our partners when standing-up production scale R plus Apache Spark projects.

We are happy to share these tools as open source, and very interested in consulting with your teams on developing R/Spark solutions (including porting existing SAS code). For more information please reach out to Win-Vector.

To teams get started we are supplying the following initial documentation, discussion, and examples:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gender Diversity Analysis of Data Science Industry using Kaggle Survey Dataset in R

Wed, 11/29/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Kaggle recently released the dataset of an industry-wide survey that it conducted with 16K respondents.

This article aims to understand how the argument of Gender Diversity plays out in Data Science Practice.

Disclaimer: Yes, I understand this dataset is not the output of a Randomized Experiment hence cannot be a representative of the entire Data Science Practitioners and also contains Selection bias, which I’m well aware. Let us proceed further with this disclaimer in mind.

Loading required R packages:

#Loading Required Libraries library(dplyr) library(stringr) library(ggplot2) library(ggthemes) library(tidyr) library(scales)

We are going to deal with only multipleChoiceResponses.csv dataset from the downloaded files and let us load it into our R environment.

#Load Input Data complete_data <- read.csv(".../multipleChoiceResponses.csv",header =T, stringsAsFactors = F)

The very first thing we would check is how is the Gender distribution of the respondents.

#Gender Distribution complete_data %>% filter(GenderSelect!='') %>% group_by(GenderSelect) %>% count() %>% ggplot(aes(x = GenderSelect,y = (n / sum(n))*100))+ geom_bar(stat = 'identity') + ylab('Percent') + theme_solarized() + theme(axis.text = element_text(size = 6)) + ggtitle('Gender Distribution of Kaggle Survey Respondents')

Gives this plot:

With no surprise like many other Technical domain, Data Science is also dominated by Male Gender with more than 80% respondents being Male and less than 20% being Female. How about the split of Male and Female across countries of the participants.

complete_data %>% filter(GenderSelect %in% c('Male','Female')) %>% group_by(Country,GenderSelect) %>% summarise(count = n()) %>% arrange(desc(count)) %>% ggplot() + geom_bar(aes(Country,count, fill = GenderSelect), stat = 'identity') + #facet_grid(.~GenderSelect) + theme_solarized() + theme(axis.text = element_text(size = 9), axis.text.x = element_text(angle = 40, vjust = 0.5, hjust = 0.5)) + ggtitle('Country wise Survey Respondends - M/F')

Gives this plot:

With the US being the top country of Respondents followed by India, We can see how Male-Female split across Countries look like. While this chart very well represents the distribution of countries, it doesn’t reflect which country is doing good in terms of Female Gender in Data Science. To understand that, Let us create a new KPI – F2M Ratio (Female to Male Ratio % – which could be interpreted as the number of Female to 100 Male).

complete_data %>% filter(GenderSelect %in% c('Male','Female') & Country!="") %>% group_by(Country,GenderSelect) %>% summarise(count = n()) %>% spread(GenderSelect,count) %>% mutate(F2M = (Female/Male)*100) %>% arrange(desc(F2M)) %>% #mutate(F2M = percent(F2M)) %>% ggplot() + geom_bar(aes(Country,F2M, fill = F2M), stat = 'identity') + theme_solarized() + theme(axis.text = element_text(size = 9), axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + ggtitle('Female to Male Ratio - Country Wise') + scale_fill_continuous_tableau()

Gives this plot:

It turns out that no Country has got this Female-to-Male ratio more than 50% which is not a very healthy insight, but there are three countries that fared above 40% – Ireland, Malaysia and Egypt are those. Ironically, The US and India that were on the top overall have got lesser than 30% and 20% respectively.

Let us see how the Age distribution looks between Male and Female:

complete_data %>% filter(GenderSelect %in% c('Male','Female')) %>% ggplot() + geom_histogram(aes(Age),binwidth = 1) + theme_solarized() + facet_grid(.~GenderSelect) + ggtitle('Age Distribution - Male vs Female')

Gives this plot:

It could be inferred from the above Plot that the central tendency between Male and Female is similar, but it is very clear that Men seemed to start earlier than their Female counterparts – which could be a big factor to establish themselves in this industry with confidence.

And finally, What is the language that’s been familiar among Data Enthusiasts.

complete_data %>% group_by(LanguageRecommendationSelect) %>% summarize(count = n()) # A tibble: 14 x 2 LanguageRecommendationSelect count 1 5718 2 C/C++/C# 307 3 F# 4 4 Haskell 17 5 Java 138 6 Julia 30 7 Matlab 238 8 Other 85 9 Python 6941 10 R 2643 11 SAS 88 12 Scala 94 13 SQL 385 14 Stata 28

The answer becomes the obvious one – Python followed by R (Remember the Selection bias disclaimer). So, what’s the language that does well with our Female Data Enthusiasts:

complete_data %>% filter(GenderSelect %in% c('Male','Female')) %>% group_by(LanguageRecommendationSelect,GenderSelect) %>% summarize(count = n()) %>% spread(GenderSelect,count) %>% mutate(F2M = Female/Male) %>% arrange(desc(F2M)) %>% mutate(F2M = F2M * 100) %>% ggplot() + geom_bar(aes(LanguageRecommendationSelect,F2M, fill = F2M), stat = 'identity') + theme_solarized() + theme(axis.text = element_text(size = 9), axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + scale_fill_continuous_tableau() + ggtitle('F2M Ratio of Languages used by Kagglers')

Gives this plot:

Stata has very well outperformed R and Python with Female Data Enthusiasts and the possible explanation for this could be the increased penetration of Stata as a language in Academia and Research.

Well, that’s a simple Gender Diversity Analysis of Data Science Industry with Kaggle Dataset. The entire code used here is available on my Github.

References:

    Related Post

    1. How Happy is Your Country? — Happy Planet Index Visualized
    2. Exploring, Clustering, and Mapping Toronto’s Crimes
    3. Spring Budget 2017: Circle Visualisation
    4. Qualitative Research in R
    5. Multi-Dimensional Reduction and Visualisation with t-SNE
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Flip axes in plotly histograms – few approaches

    Wed, 11/29/2017 - 09:00

    (This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

    This seems easy as it requires just to change x parameter to y in the plot specification. Well, there are some edge cases where R users might get a in trouble!

    library(plotly) packageVersion('plotly') [1] '4.7.1'

    Before you go, let me just explain myself. I have just started learning R interface to plotly library and I am really amazed by the effort people done to make those wires connected and possible to be used for a broad audience.

    set.seed(2); values <- rnorm(50) p1 <- plot_ly(x = values, type = "histogram", name = "default") p2 <- plot_ly(y = values, type = "histogram", name = "flipped") subplot(p1, p2)

    What if the plotly object specification is longer than several dozen of lines and one would like to have this feature parametrized in a function’s call? Like in the shiny application, to have the flip as an option?

    The quickest solution is to provide an if statement like

    #' @param values Numeric values to be used in the plot. #' @param flip Logical: should the plot be a horizontal histogram? to_flip_or_not_to_flip <- function(values, flip = FALSE){ if (flip) { p <- plot_ly(y = values, type = "histogram", name = "flipped") } else { p <- plot_ly(x = values, type = "histogram", name = "default") } p } #' @examples #' to_flip_or_not_to_flip(p) #' to_flip_or_not_to_flip(p, flip = TRUE)

    This is a typical redundancy of the code. Imagine the plot being created in a loop with many extra layers, where in the end the specification is longer than 50 lines. Would you copy and paste all 50 lines just to substitute x with y?

    orientation parameter

    Maybe orientation parameter is an option? After the reference: https://plot.ly/r/reference/#histogram

    p3 <- plot_ly(x = values, type = "histogram", name = "orient v", orientation = "v") p4 <- plot_ly(x = values, type = "histogram", name = "orient h", orientation = "h") subplot(p3, p4)

    one get a wrong plot. values should be assigned to y parameter again.

    Of course there is a plotly guide book for R users (where I’ve learned subplot()) but one is not going to read the whole documentation just to create a simple horizontal histogram (I suppose).

    The solution?

    One can create the list with all possible parameters that he/she would like to specify (besides default parameters) and then
    change the name of x parameter to y in that list depending on flip requirements?

    parameters <- list( x = values, type = "histogram", name = "x" ) p5 <- do.call(plot_ly, parameters) # if (flip) { names(parameters)[1] <- "y" parameters$name <- "y" # } p6 <- do.call(plot_ly, parameters) subplot(p5, p6)

    My personal best?

    Change the object directly after you have specified the plot.
    One can easily guess what needs to be changed after looking to str(plot) call.
    We would change data attributes and will rename x object to y. See that we can also modify values, not only names of parameters.

    p7 <- p5 str(p7$x$attrs) # if (flip) { names(p7$x$attrs[[1]])[1] <- "y" p7$x$attrs[[1]]$name <- "y - change object directly" # } subplot(p5, p7)

    Other solutions?

    Do you know cleaner approach? Please don’t hesitate to leave comments at mine blog.

    I suppose one could create a plot in ggplot2 and then apply ggplotly() function but I am not convinced this function translates all possible further plots to the client ready format, so I’d like to stick to plotly interface.

    Note: sorry for print screens. I could not get interactive results for plotly in the Rmarkdown document compiled with a jekyll and knitr.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Rcpp now used by 1250 CRAN packages

    Wed, 11/29/2017 - 02:23

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Earlier today Rcpp passed 1250 reverse-dependencies on CRAN as another big milestone. The graph is on the left depicts the growth of Rcpp usage (as measured by Depends, Imports and LinkingTo, but excluding Suggests) over time.

    Rcpp cleared 300 packages in November 2014. It passed 400 packages in June 2015 (when I only tweeted about it), 500 packages in late October 2015, 600 packages last March, 700 packages last July, 800 packages last October, 900 packages early January, and 1000 packages in April. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. The next part uses manually saved entries. The core (and by far largest) part of the data set was generated semi-automatically via a short script appending updates to a small file-based backend. A list of packages using Rcpp is kept on this page.

    Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four per-cent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent July of 2015, seven percent just before Christmas 2015, eight percent last summer, nine percent mid-December 2016 and then cracked ten percent this summer.

    1250 user packages is staggering. We can use the progression of CRAN itself compiled by Henrik in a series of posts and emails to the main development mailing list. A decade ago CRAN itself did not have 1250 packages, and here we are approaching 12k with Rcpp at 10% and growing steadily. Amazeballs.

    This puts a whole lot of responsibility on us in the Rcpp team as we continue to keep Rcpp as performant and reliable as it has been.

    And with that, and as always, a very big Thank You! to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    An introduction to Monte Carlo Tree Search

    Wed, 11/29/2017 - 01:00

    (This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

    Introduction

    We recently witnessed one of the biggest game AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found here. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights.

    Perfect Information Games

    Monte Carlo Tree Search is an algorithm used when playing a so-called perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or Tic-Tac-Toe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over 10^{170}.
    sticky noteSource

    Every perfect information game can be represented in the form of a tree data structure in the following way. At first, you have a root which encapsulates the beginning state of the game. For Chess that would be 16 white figures and 16 black figures placed in the proper places on the chessboard. For Tic-Tac-Toe it would be simply 3×3 empty matrix. The first player has some number n_1 of possible choices to make. In the case of Tic-Tac-Toe this would be 9 possible places to draw a circle. Each such move changes the state of the game. These outcome states are the children of the root node. Then, for each of n_1 children, the next player has n_2 possible moves to consider, each of them generating another state of the game – generating a child node. Note that n_2 might differ for each of n_1 nodes. For instance in chess you might make a move which forces your enemy to make a move with their king or consider another move which leaves your opponent with many other options.

    An outcome of a play is a path from the root node to one of the leaves. Each leaf consist a definite information which player (or players) have won and which have lost the game.

    Making a decision based on a tree

    There are two main problems we face when making a decision in perfect information game. The first, and main one is the size of the tree.

    This doesn’t bother us with very limited games such as Tic-Tac-Toe. We have at most 9 children nodes (at the beginning) and this number gets smaller and smaller as we continue playing. It’s a completely different story with Chess or Go. Here the corresponding tree is so huge that you cannot hope to search the entire tree. The way to approach this is to do a random walk on the tree for some time and get a subtree of the original decision tree.

    This, however, creates a second problem. If every time we play we just walk randomly down the tree, we don’t care at all about the efficiency of our move and do not learn from our previous games. Whoever played Chess during his or her life knows that making random moves on a chessboard won’t get him too far. It might be good for a beginner to get an understanding of how the pieces move. But game after game it’s better to learn how to distinguish good moves from bad ones.

    So, is there a way to somehow use the facts contained in the previously built decision trees to reason about our next move? As it turns out, there is.

    Multi-Armed Bandit Problem

    Imagine that you are at a casino and would like to play a slot machine. You can choose one randomly and enjoy your game. Later that night, another gambler sits next to you and wins more in 10 minutes than you have during the last few hours. You shouldn’t compare yourself to the other guy, it’s just luck. But still, it’s normal to ask whether the next time you can do better. Which slot machine should you choose to win the most? Maybe you should play more than one machine at a time?

    The problem you are facing is the Multi-Armed Bandit Problem. It was already known during II World War, but the most commonly known version today was formulated by Herbert Robbins in 1952. There are N slot machines, each one with a different expected return value (what you expect to net from a given machine). You don’t know the expected return values for any machine. You are allowed to change machines at any time and play on each machine as many times as you’d like. What is the optimal strategy for playing?

    What does “optimal” mean in this scenario? Clearly your best option would be to play only on the machine with highest return value. An optimal strategy is a strategy for which you do as well as possible compared to the best machine.

    It was actually proven that you cannot do better than O( \ln n ) on average. So that’s the best you can hope for. Luckily, it was also proven that you can achieve this bound (again – on average). One way to do this is to do the following.

    sticky noteRead this paper if you are interested in the proof.

    For each machine i we keep track of two things: how many times we have tried this machine (n_i) and what the mean return value (x_i) was. We also keep track of how many times (n) we have played in general. Then for each i we compute the confidence interval around x_i:

    x_i \pm \sqrt{ 2 \cdot \frac{\ln n}{n_i} }

    All the time we choose to play on the machine with the highest upper bound for x_i (so “+” in the formula above).

    This is a solution to Multi-Armed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our pay-out. For simplicity, I will assume that we are only interested in winning, so pay-out is 1 if we have won and 0 otherwise.

    Real world application example

    MAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: 99$, 100$, 115$ and 120$. Now you want to test how those prices will perform and which to choose eventually.
    During first day of your experiment 4000 people visited your shop when the first price (99$) was tested and 368 bought the product,
    for the rest of the prices we have the following outcome:

    • 100$ 4060 visits and 355 purchases,
    • 115$ 4011 visits and 373 purchases,
    • 120$ 4007 visits and 230 purchases.

    Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment.

    library(bandit) set.seed(123) visits_day1 <- c(4000, 4060, 4011, 4007) purchase_day1 <- c(368, 355, 373, 230) prices <- c(99, 100, 115, 120) post_distribution = sim_post(purchase_day1, visits_day1, ndraws = 10000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.3960 0.0936 0.5104 0.0000

    We calculated the Bayesian probability that the price performed the best and can see that the price 115$ has the highest probability (0.5). On the other hand 120$ seems bit too much for the customers.

    The experiment continues for a few more days.

    Day 2 results:

    visits_day2 <- c(8030, 8060, 8027, 8037) purchase_day2 <- c(769, 735, 786, 420) post_distribution = sim_post(purchase_day2, visits_day2, ndraws = 1000000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.308623 0.034632 0.656745 0.000000

    After the second day price 115$ still shows the best results, with 99$ and 100$ performing very similar.

    Using bandit package we can also perform significant analysis, which is handy for overall proportion comparison using prop.test.

    significance_analysis(purchase_day2, visits_day2) ## successes totals estimated_proportion lower upper ## 1 769 8030 0.09576588 -0.004545319 0.01369494 ## 2 735 8060 0.09119107 0.030860453 0.04700507 ## 3 786 8027 0.09791952 -0.007119595 0.01142688 ## 4 420 8037 0.05225831 NA NA ## significance rank best p_best ## 1 3.322143e-01 2 1 3.086709e-01 ## 2 1.437347e-21 3 1 2.340515e-06 ## 3 6.637812e-01 1 1 6.564434e-01 ## 4 NA 4 0 1.548068e-39

    At this point we can see that 120$ is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the p_best are very small (p_best has negligible value).

    Day 3 results:

    visits_day3 <- c(15684, 15690, 15672, 8037) purchase_day3 <- c(1433, 1440, 1495, 420) post_distribution = sim_post(purchase_day3, visits_day3, ndraws = 1000000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.087200 0.115522 0.797278 0.000000 value_remaining = value_remaining(purchase_day3, visits_day3) potential_value = quantile(value_remaining, 0.95) potential_value ## 95% ## 0.02670002

    Day 3 results led us to conclude that 115$ will generate the highest conversion rate and revenue.
    We are still unsure about the conversion probability for the best price 115$, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining).

    The histograms below show what happens to the value-remaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price $115.

    If this example was interesting to you, checkout our another post about dynamic pricing.

    Monte Carlo Tree Search

    We are ready to learn how the Monte Carlo Tree Search algorithm works.

    As long as we have enough information to treat child nodes as slot machines, we choose the next node (move) as we would have when solving Multi-Armed Bandit Problem. This can be done when we have some information about the pay-outs for each child node.

    At some point we reach the node where we can no longer proceed in this manner because there is at least one node with no statistic present. It’s time to explore the tree to get new information. This can be done either completely randomly or by applying some heuristic methods when choosing child nodes (in practice this might be necessary for games with high branching factor – like chess or Go – if we want to achieve good results).

    Finally, we arrive at a leaf. Here we can check whether we have won or lost.

    It’s time to update the nodes we have visited on our way to the leaf. If the player making a move at the corresponding node turned out to be the winner we increase the number of wins by one. Otherwise we keep it the same. Whether we have won or not, we always increase the number of times the node was played (in the corresponding picture we can automatically deduce it from the number of loses and wins).

    That’s it! We repeat this process until some condition is met: timeout is reached or the confidence intervals we mentioned earlier stabilize (convergence). Then we make our final decision based on the information we have gathered during the search. We can choose a node with the highest upper bound for pay-out (as we would have in each iteration of the Multi-Armed Bandit). Or, if you prefer, choose the one with the highest mean pay-out.

    The decision is made, a move was chosen. Now it’s time for our opponent (or opponents). When they’ve finished we arrive at a new node, somewhere deeper in the tree, and the story repeats.

    Not just for games

    As you might have noticed, Monte Carlo Tree Search can be considered as a general technique for making decisions in perfect information scenarios. Therefore it’s use does not have to be restrained to games only. The most amazing use case I have heard of is to use it for planning interplanetary flights. You can read about it at this website but I will summarize it briefly.

    Think of an interplanetary flight as of a trip during which you would like to visit more than one planet. For instance, you are flying from Earth to Jupiter via Mars.

    An efficient way to do this is to make use of these planets gravitational field (like they did in ‘The Martian’ movie) so you can take less fuel with you. The question is when the best time to arrive and leave from each planets surface or orbit is (for the first and last planet it’s only leave and arrive, respectively).

    You can treat this problem as a decision tree. If you divide time into intervals, at each planet you make a decision: in which time slot I should arrive and in which I should leave. Each such choice determines the next. First of all, you cannot leave before you arrive. Second – your previous choices determine how much fuel you have left and consequently – what places in the universe you can reach.

    Each such set of consecutive choices determines where you arrive at the end. If you visited all required checkpoints – you’ve won. Otherwise, you’ve lost. It’s like a perfect information game. There is no opponent and you make a move by determining the timeslot for leave/arrival. This can be treated using the above Monte Carlo Tree Search. As you can read here, it fares quite well in comparison with other known approaches to this problem. And at the same time – it does not require any domain-specific knowledge to implement it.

    Write your question and comments below. We’d love to hear what you think.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Announcing a New rOpenSci Software Review Collaboration

    Wed, 11/29/2017 - 01:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    rOpenSci is pleased to announce a new collaboration with the Methods and Ecology and Evolution (MEE), a journal of the British Ecological Society, published by Wiley press 1. Publications destined for MEE that include the development of a scientific R package will now have the option of a joint review process whereby the R package is reviewed by rOpenSci, followed by fast-tracked review of the manuscript by MEE. Authors opting for this process will be recognized via a mark on both web and print versions of their paper.

    We are very excited for this partnership to improve the rigor of both scientific software and software publications and to provide greater recognition to developers in the fields of ecology and evolution. It is a natural outgrowth of our interest in supporting scientists in developing and maintaining software, and of MEE’s mission of vetting and disseminating tools and methods for the research community. The collaboration formalizes and eases a path already pursued by researchers: The rotl, RNexML, and treebase packages were all developed or reviewed by rOpenSci and subsequently had associated manuscripts published in MEE.

    About rOpenSci software review

    rOpenSci is a diverse community of researchers from academia, non-profit, government, and industry who collaborate to develop and maintain tools and practices around open data and reproducible research. The rOpenSci suite of tools is made of core infrastructure software developed and maintained by the project staff. The suite also contains numerous packages that are contributed by members of the broader R community. The volume of community submissions has grown considerably over the years necessitating a formal system of review quite analogous to that of a peer reviewed academic journal.

    rOpenSci welcomes full software submissions that fit within our aims and scope, with the option of a pre-submission inquiry in cases when the scope of a submission is not immediately obvious. This software peer review framework, known as the rOpenSci Onboarding process, operates with three editors and one editor in chief who carefully vet all incoming submissions. After an editorial review, editors solicit detailed, public and signed reviews from two reviewers, and the path to acceptance from then on is similar to a standard journal review process. Details about the system are described in various blog posts by the editorial team.

    Collaboration with journals

    This is our second collaboration with a journal. Since late 2015, rOpenSci has partnered with the Journal of Open Source software (JOSS), an open access journal that publishes brief articles on research software. Packages accepted to rOpenSci can be submitted for fast-track publication at JOSS, in which JOSS editors may evaluate based on rOpenSci’s reviews alone. As rOpenSci’s review criteria is significantly more stringent and designed to be compatible with JOSS, these packages are generally accepted without additional review. We have had great success with this partnership providing rOpenSci authors with an additional venue to publicize and archive their work. Given this success, we are keen on expanding to other journals and fields where there is potential for software reviewed and created by rOpenSci to play a significant role in supporting scientific findings.

    The details

    Our new partnership with MEE broadly resembles that with JOSS, with the major difference that MEE, rather than rOpenSci, leads review of the manuscript component. Authors with R packages and associated manuscripts that fit the Aims and Scope for both rOpenSci and MEE are encouraged to first submit to rOpenSci. The rotl, RNexML, and treebase packages are all great examples of such packages. MEE editors may also refer authors to this option if authors submit an appropriate manuscript to MEE first.

    On submission to rOpenSci, authors can use our updated submission template to choose MEE as a publication venue. Following acceptance by rOpenSci, the associated manuscript will be reviewed by an expedited process at MEE, with reviewers and editors having the knowledge that the software has already been reviewed and the public reviews available to them.

    Should the manuscript be accepted, a footnote will appear in the web version and the first page of the print version of the MEE article indicating that the software as well as the manuscript has been peer-reviewed, with a link to the rOpenSci open reviews.

    As with any collaboration, there may be a few hiccups early on and we welcome ideas to make the process more streamlined and efficient. We look forward to the community’s submissions and to your participation in this process.

    Many thanks to MEE’s Assistant Editor Chris Grieves and Senior Editor Bob O’Hara for working with us on this collaboration.

    1. See also MEE’s post from today at https://methodsblog.wordpress.com/2017/11/29/software-review/
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Some new time series packages

    Wed, 11/29/2017 - 01:00

    (This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

    This week I have finished preliminary versions of two new R packages for time series analysis. The first (tscompdata) contains several large collections of time series that have been used in forecasting competitions; the second (tsfeatures) is designed to compute features from univariate time series data. For now, both are only on github. I will probably submit them to CRAN after they’ve been tested by a few more people.
    tscompdata There are already two packages containing forecasting competition data: Mcomp (containing the M and M3 competition data) and Tcomp (containing the tourism competition data).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Competition: StanCon 2018 ticket

    Wed, 11/29/2017 - 01:00

    (This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

    Today we are happy to announce our Stan contest. Something we feel very strongly at Jumping Rivers is giving back to the community. We have benefited immensely from hard work by numerous people, so when possible, we try to give something back.

    This year we’re sponsoring StanCon 2018. If you don’t know,

    Stan is freedom-respecting, open-source software for facilitating statistical inference at the frontiers of applied statistics.

    Or to put it another way, it makes Bayesian inference fast and (a bit) easier. StanCon is the premier conference for all things Stan related and this year it will take place at the Asilomar Conference Grounds, a National Historic Landmark on the Monterey Peninsula right on the beach.

    Our interaction with Stan is via the excellent R package, rstan and the introduction to Rstan course we run.

    The prize

    One of the benefits of sponsoring a conference, is free tickets. Unfortunately, we can’t make StanCon this year, so we’ve decided to give away the ticket via a competition.

    How do I enter?

    Entering the contest is straightforward. We want a nice image to use in our introduction to Rstan course. This can be a cartoon, some neat looking densities, whatever you think!

    For example, my first attempt at a logo resulted in

    Pretty enough, but lacking something.

    Once you’ve come up with your awesome image, simply email info@jumpingrivers.com with a link to the image. If the image was generated in R (or other language), than the associated code would be nice. The deadline is the 6th December.

    FAQ
    • Can I submit more than one image? Feel free.
    • Who owns the copyright? Our intention is to make the image CC0. However we’re happy to change the copyright to suit. The only proviso is that we can use the image for our rstan course.
    • Another question. Contact us via the contact page.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Cyber Week Only: Save 50% on DataCamp!

    Tue, 11/28/2017 - 23:59

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    For Cyber Week only, DataCamp offers the readers of R-bloggers over $150 off for unlimited access to its data science library. That’s over 90 courses and 5,200 exercises of which a large chunk are R-focused, plus access to the mobile app, Practice Challenges, and hands-on Projects. All by expert instructors such as Hadley Wickham (RStudio), Matt Dowle (data.table), Garrett Grolemund (RStudio), and Max Kuhn (caret)! Clair your offer now! Offer ends 12/5!

    Skills track:

    • Data Analyst with R

      • Learn how to translate numbers into plain English for businesses, whether it’s sales figures, market research, logistics, or transportation costs get the most of your data.

    • Data Scientist with R

      • Learn how to combine statistical and machine learning techniques with R programming to analyze and interpret complex data.

    • Quantitative Analyst with R

      • Learn how to ensure portfolios are risk balanced, help find new trading opportunities, and evaluate asset prices using mathematical models.

    Skills track:

    • Data Visualization in R

      • Communicate the most important features of your data by creating beautiful visualizations using ggplot2 and base R graphics.

    • Importing and Cleaning Data

      • Learn how to parse data in any format. Whether it’s flat files, statistical software, databases, or web data, you’ll learn to handle it all.

    • Statistics with R

      • Learn key statistical concepts and techniques like exploratory data analysis, correlation, regression, and inference.

    • Applied Finance

      • Apply your R skills to financial data, including bond valuation, financial trading, and portfolio analysis.

    Individual courses:

    A word about DataCamp For those of you who don’t know DataCamp: it’s the most intuitive way out there to learn data science thanks to its combination of short expert videos and immediate hands-on-the-keyboard exercises as well as its instant personalized feedback on every exercise. They focus only on data science to offer the best learning experience possible and rely on expert instructors to teach.  var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to make Python easier for the R user: revoscalepy

    Tue, 11/28/2017 - 17:30

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    by Siddarth Ramesh, Data Scientist, Microsoft

    I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.

    Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python. 

    Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy

    Why revoscalepy?

    Revoscalepy is the Python implementation of the R package RevoScaleR included with Microsoft Machine Learning Server.

    The methods in ‘revoscalepy’ are the same, and more importantly, the way the R user can view data is the same. The reason this is so important is that for an R programmer, being able to understand the data shape and structure is one of the challenges with getting used to Python. In Python, data types are different, preprocessing the data is different, and the criteria to feed the processed dataset into a model is different.

    To understand how revoscalepy eases the transition from R to Python, the following section will compare building a decision tree using revoscalepy with building a decision tree using sklearn. The Titanic dataset from Kaggle will be used for this example. To be clear, this post is written from an R user’s perspective, as many of the challenges this post will outline are standard practices for native Python users.

    revoscalepy versus sklearn

    revoscalepy works on Python 3.5, and can be downloaded as a part of Microsoft Machine Learning Server. Once downloaded, set the Python environment path to the python executable in the MML directory, and then import the packages.

    The first chunk of code imports the revoscalepy, numpy, pandas, and sklearn packages, and imports the Titatic data. Pandas has some R roots in that it has its own implementation of DataFrames as well as methods that resemble R’s exploratory methods. 

    import revoscalepy as rp; import numpy as np; import pandas as pd; import sklearn as sk; titanic_data = pd.read_csv('titanic.csv') titanic_data.head() Preprocessing with sklearn

    One of the challenges as an R user with using sklearn is that the decision tree model for sklearn can only handle the numeric datatype. Pandas has a categorical type that looks like factors in R, but sklearn’s Decision Tree does not integrate with this. As a result, numerically encoding the categorical data becomes a mandatory step. This example will use a one-hot encoder to shape the categories in a way that sklearn’s decision tree understands.

    The side effect of having to one-hot encode the variable is that if the dataset contains high cardinality features, it can be memory intensive and computationally expensive because each category becomes its own binary column. While implementing one-hot encoding itself is not a difficult transformation in Python and provides good results, it is still an extra step for an R programmer to have to manually implement. The following chunk of code detaches the categorical columns, label and one-hot encodes them, and then reattaches the encoded columns to the rest of the dataset.

    from sklearn import tree le = sk.preprocessing.LabelEncoder() x = titanic_data.select_dtypes(include=[object]) x = x.drop(['Name', 'Ticket', 'Cabin'], 1) x = pd.concat([x, titanic_data['Pclass']], axis = 1) x['Pclass'] = x['Pclass'].astype('object') x = pd.DataFrame(x) x = x.fillna('Missing') x_cats = x.apply(le.fit_transform) enc = sk.preprocessing.OneHotEncoder() enc.fit(x_cats) onehotlabels = enc.transform(x_cats).toarray() encoded_titanic_data = pd.concat([pd.DataFrame(titanic_data.select_dtypes(include=[np.number])), pd.DataFrame(onehotlabels)], axis = 1)

    At this point, there are more columns than before, and the columns no longer have semantic names (they have been enumerated). This means that if a decision tree is visualized, it will be difficult to understand without going through the extra step of renaming these columns. There are techniques in Python to help with this, but it is still an extra step that must be considered.

    Preprocessing with revoscalepy

    Unlike sklearn, revoscalepy reads pandas’ ‘category’ type like factors in R. This section of code iterates through the DataFrame, finds the string types, and converts those types to ‘category’. In pandas, there is an argument to set the order to False, to prevent ordered factors.

    titanic_data_object_types = titanic_data.select_dtypes(include = ['object']) titanic_data_object_types_columns = np.array(titanic_data_object_types.columns) for column in titanic_data_object_types_columns: titanic_data[column] = titanic_data[column].astype('category', ordered = False) titanic_data['Pclass'] = titanic_data['Pclass'].astype('category', ordered = False)

    This dataset is already ready to be fed into the revoscalepy model.

    Training models with sklearn

    One difference between implementing a model in R and in sklearn in Python is that sklearn does not use formulas.

    Formulas are important and useful for modeling because they provide a consistent framework to develop models with varying degrees of complexity. With formulas, users can easily apply different types of variable cases, such as ‘+’ for separate independent variables, ‘:’ for interaction terms, and ‘*’ to include both the variable and its interaction terms, along with many other convenient calculations. Within a formula, users can do mathematical calculations, create factors, and include more complex entities third order interactions. Furthermore, formulas allow for building highly complex models such as mixed effect models, which are next to impossible build without them. In Python, there are packages such as ‘statsmodels’ which have more intuitive ways to build certain statistical models. However, statsmodels has a limited selection of models, and does not include tree based models.

    With sklearn, model.fit expects the independent and dependent terms to be columns from the DataFrame. Interactions must be created manually as a preprocessing step for more complex examples. The code below trains the decision tree:

    model = tree.DecisionTreeClassifier(max_depth = 50) x = encoded_titanic_data.drop(['Survived'], 1) x = x.fillna(-1) y = encoded_titanic_data['Survived'] model = model.fit(x,y) Training models with revoscalepy

    revoscalepy brings back formulas. Granted, users cannot view the formula the same way as they can in R, because formulas are strings in Python. However, importing code from R to Python is an easy transition because formulas are read the same way in the revoscalepy functions as the model fit functions in R. The below code fits the Decision Tree in revoscalepy.

    #rx_dtree works with formulas, just like models in R form = 'Survived ~ Pclass + Sex + Age + Parch + Fare + Embarked' titanic_data_tree = rp.rx_dtree(form, titanic_data, max_depth = 50)

    The resulting object, titanic_data_tree, is the same structural object that RxDTree() would create in R. Because the individual elements that make up the rx_dtree() object are the same as RxDTree(), it allows R users to easily understand the decision tree without having to translate between two object structures.

    Conclusion

    From the workflow, it should be clear how revoscalepy can help with transliteration between R and Python. Sklearn has different preprocessing considerations because the data must be fed into the model differently. The advantage to revoscalepy is that R programmers can easily convert their R code to Python without thinking too much about the ‘Pythonic way’ of implementing their R code. Categories replace factors, rx_dtree() reads the R-like formula, and the arguments are similar to the R equivalent. Looking at the big picture, revoscalepy is one way to ease Python for the R user and future posts will cover other ways to transition between R and Python.

    Microsoft Docs: Introducing revoscalepy

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

    Tue, 11/28/2017 - 12:33

    (This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

    Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).

    My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using networkx and the pygraphviz output:

    The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.

    The R diagram package has a couple of functions that can generate similar sorts of network diagram using its plotweb() function. For example, a simple flow graph:

    Or something that looks a bit more like a finite state machine diagram:

    (In passing, I also note the R diagram package can be used to draw electrical circuit diagrams/schematics.)

    Another way of visualising this blocked data might be to use a simple two dimensional flow diagram, such as a transition plot from the R Gmisc package.

    For example, the following could describe total flow from one module to another over a given period of time:

    If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).

    Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

    The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

    Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.

    The sankeyview package is actually part of a family of packages that includes the d3-sankey-diagram and  the ipysankeywidget.

    We can use the ipysankeywidget to render a simple graph data structure of the sort that can be generated by networkx.

    One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?

    If we can linearise the flow by using module_presentation keyed nodes, rather than just module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial  diagram from the R alluvial package.

    With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).

    So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python sankeyview and ipysankeywidget  packages.

    First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from igraph and then exported into the JSON format expected by the package:

    If you prefer Google charts, the googleVis R package has a gvisSankey function (that I’ve used elsewhere).

    The R riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.

    The R sankey package generates a simple Sankey diagram from a simple data table:

    Back in the Python world, the pySankey package can generate a simple Sankey diagram from a pandas dataframe.

    matplotlib also support for sankey diagrams as matplotlib.sankey() (see also this tutorial):

    What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…

    If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Secret Santa is a graph traversal problem

    Tue, 11/28/2017 - 07:00

    (This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

    Last week at Thanksgiving, my family drew names from a hat for our annual game
    of Secret Santa. Actually, it wasn’t a hat but you know what I mean. (Now that I
    think about it, I don’t think I’ve ever seen names drawn from a literal hat
    before!) In our family, the rules of Secret Santa are pretty simple:

    • The players’ names are put in “a hat”.
    • Players randomly draw a name from a hat, become that person’s Secret Santa,
      and get them a gift.
    • If a player draws their own name, they draw again.

    Once again this year, somebody asked if we could just use an app or a website to
    handle the drawing for Secret Santa. Or I could write a script to do it I
    thought to myself. The problem nagged at the back of my mind for the past few
    days. You could just shuffle the names… no, no, no. It’s trickier than that.

    In this post, I describe a couple of algorithms for Secret Santa sampling using
    R and directed graphs. I use the
    DiagrammeR package which creates
    graphs from dataframes of nodes and edges, and I liberally use
    dplyr verbs to manipulate tables of
    edges.

    If you would like a more practical way to use R for Secret Santa, including
    automating the process of drawing names and emailing players, see
    this blog post.

    Making a graph, connecting nodes twice

    Let’s start with a subset of my family’s game of just five players. I assign
    each name a unique ID number.

    library(DiagrammeR) library(magrittr) library(dplyr, warn.conflicts = FALSE) players <- tibble::tribble( ~ Name, ~ Number, "Jeremy", 1, "TJ", 2, "Jonathan", 3, "Alex", 4, "Marissa", 5)

    We can think of the players as nodes in a directed graph. An edge connecting two
    players indicates a “gives-to” (Secret Santa) relationship. Suppose I drew
    Marissa’s name. Then the graph will have an edge connecting me to her. In the
    code below, I use DiagrammeR to create a graph by combining a node dataframe
    (create_node_df()) and an edge dataframe (create_edge_df()).

    nodes <- create_node_df( n = nrow(players), type = players$Name, label = players$Name) tj_drew_marissa <- create_edge_df( from = 2, to = 5, rel = "gives-to", color = "#FF4136", penwidth = 1) create_graph(nodes, tj_drew_marissa) %>% render_graph()





    %0

    2->5


    1

    Jeremy



    2

    TJ



    3

    Jonathan



    4

    Alex



    5

    Marissa


    Before the game starts, anyone could draw anyone else’s name, so let’s visualize
    all possible gives-to relations. We can do this by using combn(n, 2) to
    generate all n-choose-2 pairs and creating two edges for each pair.

    combn(players$Name, 2) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] #> [1,] "Jeremy" "Jeremy" "Jeremy" "Jeremy" "TJ" "TJ" "TJ" #> [2,] "TJ" "Jonathan" "Alex" "Marissa" "Jonathan" "Alex" "Marissa" #> [,8] [,9] [,10] #> [1,] "Jonathan" "Jonathan" "Alex" #> [2,] "Alex" "Marissa" "Marissa" # All the edge-manipulating functions in this post take an optional take `...` # argument for setting the style of edges. create_all_giving_edges <- function(xs, ...) { aes_options <- quos(...) pairs <- combn(seq_along(xs), 2) # Each column from `combn()` is a pair. We make an edge moving down the column # and another edge up the column by having each row as a `from` index and a # `to` index. from <- c(pairs[1, ], pairs[2, ]) to <- c(pairs[2, ], pairs[1, ]) create_edge_df(from = from, to = to) %>% mutate(!!! aes_options) %>% as_tibble() } all_possible_edges <- create_all_giving_edges( players$Name, rel = "potential-gift", penwidth = .5, color = "#CCCCCC90") create_graph(nodes, all_possible_edges) %>% render_graph()





    %0

    1->2


    1->3


    1->4


    1->5


    2->1


    2->3


    2->4


    2->5


    3->1


    3->2


    3->4


    3->5


    4->1


    4->2


    4->3


    4->5


    5->1


    5->2


    5->3


    5->4


    1

    Jeremy



    2

    TJ



    3

    Jonathan



    4

    Alex



    5

    Marissa


    A fast, simple solution is a Hamiltonian path

    Do you need to organize a gift-giving drawing for a group of people? The easiest
    solution is to shuffle the names and have the first name give to the second name,
    the second to the third, and so on with last name giving looping back around to
    the first name. This solution is equivalent to walking through the graph and
    visiting every node just once. Such a path is called a
    Hamiltonian path.

    Here we find a Hamiltonian path and create a helper function overwrite_edges()
    to update the edges that fall on the path.

    overwrite_edges <- function(old_df, new_df) { old_df %>% anti_join(new_df, by = c("from", "to")) %>% bind_rows(new_df) } create_hamiltonian_gift_edges <- function(xs, ...) { loop_from <- sample(seq_along(xs)) # last name gives to first loop_to <- c(loop_from[-1], loop_from[1]) create_edge_df(from = loop_from, to = loop_to, ...) } # For reproducible blogging set.seed(11282017) hamiltonian_edges <- create_hamiltonian_gift_edges( players$Name, rel = "gives-to", color = "#FF4136", penwidth = 1 ) all_possible_edges %>% overwrite_edges(hamiltonian_edges) %>% create_graph(nodes, .) %>% render_graph()





    %0

    1->2


    1->3


    1->4


    1->5


    2->1


    2->3


    2->4


    2->5


    3->1


    3->2


    3->4


    3->5


    4->1


    4->2


    4->3


    4->5


    5->1


    5->2


    5->3


    5->4


    1

    Jeremy



    2

    TJ



    3

    Jonathan



    4

    Alex



    5

    Marissa


    As promised, the red paths loop through all nodes exactly once. No one is their
    own gift giver :white_check_mark:, and everyone has an incoming red path
    :white_check_mark: and an outgoing red path :white_check_mark:. Very nice.
    Actually… let’s put that checklist into a validation function.

    # Check for valid gift-giving edges has_valid_gift_edges <- function(edge_df, indices) { indices <- sort(unique(indices)) pairs <- edge_df %>% filter(rel == "gives-to") no_self_loop <- !any(pairs$from == pairs$to) exhaustive_from <- isTRUE(all.equal(sort(pairs$from), indices)) exhaustive_to <- isTRUE(all.equal(sort(pairs$to), indices)) all(no_self_loop, exhaustive_from, exhaustive_to) } has_valid_gift_edges(hamiltonian_edges, all_possible_edges$from) #> [1] TRUE

    Despite its elegance, this solution does not simulate drawing names from a
    hat! Because each node is visited only once, there is no backtracking, so there
    there is no reciprocal gift-giving or other sub-circuits in the graph.

    Whether you think this is a bad thing is a matter of preference. Personally, I
    would like all remaining pairs to be equally probable at each step of the
    drawing. This is not the case when backtracking is not allowed. (For example, if
    I draw Marissa, then all of the remaining edges are not equally likely because
    P(Marissa draws TJ | TJ draws Marissa) = 0.)

    Okay, do the hat-drawing thing already

    Let’s think about what happens when I draw Marissa’s name from a nice big red
    Santa hat.

    • The edge from TJ to Marissa is fixed. (I drew her name.)
    • All other edges from TJ become illegal. (I can’t draw any more names.)
    • All other edges onto Marissa become illegal. (No one else can draw her name
      either.)

    To simulate a single hat-drawing, we randomly select a legal edge, fix it, and
    delete all illegal edges. Let’s work through a couple of examples.

    First, we need some helper functions.

    draw_secret_santa_edge <- function(edge_df, ...) { aes_options <- quos(...) edge_df %>% filter(rel != "gives-to") %>% sample_n(1) %>% mutate(!!! aes_options) } find_illegal_edges <- function(edge_df, edge, ...) { aes_options <- quos(...) outgoing <- edge_df %>% filter(from %in% edge$from) incoming <- edge_df %>% filter(to %in% edge$to) # The one edge that is not illegal is in both # the incoming and outgoing sets to_keep <- dplyr::intersect(outgoing, incoming) outgoing %>% bind_rows(incoming) %>% anti_join(to_keep, by = c("from", "to")) %>% mutate(!!! aes_options) }

    Here we draw a single edge (red with fat arrow). All of the other edges that
    point to the same node are illegal (navy) as are all of the edges that have the
    same origin as the drawn edge.

    current_pick <- draw_secret_santa_edge( all_possible_edges, rel = "gives-to", color = "#FF4136", penwidth = 1, arrowsize = 1) current_illegal_edges <- all_possible_edges %>% find_illegal_edges(current_pick, color = "#001f3f", penwidth = .5) all_possible_edges %>% overwrite_edges(current_pick) %>% overwrite_edges(current_illegal_edges) %>% create_graph(nodes, .) %>% render_graph(title = "Selected vs. illegal")





    %0 Selected vs. illegal


    1->2


    1->3


    1->4


    1->5


    2->1


    2->3


    2->4


    2->5


    3->1


    3->2


    3->4


    3->5


    4->1


    4->2


    4->3


    4->5


    5->1


    5->2


    5->3


    5->4


    1

    Jeremy



    2

    TJ



    3

    Jonathan



    4

    Alex



    5

    Marissa


    We delete those illegal edges and leaving us with the following graph.

    edges_after_pick1 <- all_possible_edges %>% overwrite_edges(current_pick %>% mutate(arrowsize = NULL)) %>% anti_join(current_illegal_edges, by = "id") create_graph(nodes, edges_after_pick1) %>% render_graph(title = "After one draw")





    %0 After one draw


    1->2


    1->3


    1->4


    1->5


    2->1


    3->2


    3->4


    3->5


    4->2


    4->3


    4->5


    5->2


    5->3


    5->4


    1

    Jeremy



    2

    TJ



    3

    Jonathan



    4

    Alex



    5

    Marissa


    The name has been removed from the hat, and the graph is simpler now!

    Let’s do it again. Draw a random legal edge (fat arrow) and identify all the
    illegal paths (navy).

    current_pick <- edges_after_pick1 %>% draw_secret_santa_edge( rel = "gives-to", color = "#FF4136", penwidth = 1, arrowsize = 1) current_illegal_edges <- edges_after_pick1 %>% find_illegal_edges(edge = current_pick, color = "#001f3f", penwidth = .5) edges_after_pick1 %>% overwrite_edges(current_pick) %>% overwrite_edges(current_illegal_edges) %>% create_graph(nodes, .) %>% render_graph(title = "Selected vs. illegal")





    %0 Selected vs. illegal


    1->2


    1->3


    1->4


    1->5


    2->1


    3->2


    3->4


    3->5


    4->2


    4->3


    4->5


    5->2


    5->3


    5->4


    1

    Jeremy



    2

    TJ



    3

    Jonathan



    4

    Alex



    5

    Marissa


    After deleting illegal edges, the problem simplifies further.

    edges_after_pick2 <- edges_after_pick1 %>% overwrite_edges(current_pick %>% mutate(arrowsize = NULL)) %>% anti_join(current_illegal_edges, by = "id") create_graph(nodes, edges_after_pick2) %>% render_graph(title = "After two draws")





    %0 After two draws


    1->2


    1->4


    1->5


    2->1


    3->2


    3->4


    3->5


    4->3


    5->2


    5->4


    1

    Jeremy



    2

    TJ



    3

    Jonathan



    4

    Alex



    5

    Marissa


    You can tell where this is going… Loop Town!

    To finish up, we are going to repeat this process until there are only
    gift-giving edges left. We will control the loop with this helper function
    which tells us if there are any free edges remaining.

    has_free_edge <- function(edge_df) { edges_left <- edge_df %>% filter(rel != "gives-to") %>% nrow() edges_left != 0 }

    In the function below, the while-loop does the same steps as above: Randomly
    selecting a free edge and removing illegal edges.

    draw_edges_from_hat <- function(edge_df, ...) { aes_options <- quos(...) raw_edge_df <- edge_df indices <- unique(c(raw_edge_df$from, raw_edge_df$to)) while (has_free_edge(edge_df)) { pick <- edge_df %>% draw_secret_santa_edge(!!! aes_options) %>% mutate(rel = "gives-to") illegal_edges <- edge_df %>% find_illegal_edges(pick) edge_df <- edge_df %>% overwrite_edges(pick) %>% anti_join(illegal_edges, by = "id") } if (!has_valid_gift_edges(edge_df, indices)) { warning(call. = FALSE, "Invalid drawing. Trying again.") edge_df <- Recall(raw_edge_df, !!! aes_options) } edge_df }

    After the while-loop, the function checks if it has a valid set of gift edges,
    and if it doesn’t, the function calls itself again. This bit is intended to
    handle more constrained situations. Such as…

    The nibling gift exchange

    I lied above. Secret Santa is not so simple in family. For my generation (with
    me, my siblings and our partners), there’s a rule that a player can’t draw their
    partner’s name. Similarly, my nieces and nephews (and now also my child
    :sparkling_heart:) have
    their own gift exchange with an added constraint: A player can’t give their
    sibling a gift. The elegant and simple Hamiltonian solution fails under these
    constraints unless you write a special shuffling algorithm. Our hat-drawing
    approach, however, handles this situation with minimal effort. Let’s work
    through an example with my nieces and nephews (and :baby:!). To protect the very
    young, I have replaced their names with Pokemon names.

    Below we define the children and their families and do some data-wrangling so
    that we have columns with the family at the start and end of each node.

    niblings <- tibble::tribble( ~ Family, ~ Name, ~ Number, "Water", "Squirtle", 1, "Water", "Wartortle", 2, "Electric", "Pikachu", 3, "Plant", "Bulbasaur", 4, "Plant", "Ivysaur", 5, "Plant", "Venusaur", 6, "Fighting", "Machamp", 7, "Fighting", "Machoke", 8, "Normal", "Ratata", 9, "Normal", "Raticate", 10, "Psychic", "Mew", 11, "Psychic", "Mewtwo", 12) nibling_edges <- create_all_giving_edges( niblings$Name, rel = "potential-gift", penwidth = .5, color = "#CCCCCC90") %>% left_join(niblings, by = c("from" = "Number")) %>% rename(from_fam = Family) %>% select(-Name) %>% left_join(niblings, by = c("to" = "Number")) %>% rename(to_fam = Family) %>% select(-Name) %>% select(id, from, to, rel, from_fam, to_fam, everything()) nibling_edges #> # A tibble: 132 x 8 #> id from to rel from_fam to_fam penwidth color #> #> 1 1 1 2 potential-gift Water Water 0.5 #CCCCCC90 #> 2 2 1 3 potential-gift Water Electric 0.5 #CCCCCC90 #> 3 3 1 4 potential-gift Water Plant 0.5 #CCCCCC90 #> 4 4 1 5 potential-gift Water Plant 0.5 #CCCCCC90 #> 5 5 1 6 potential-gift Water Plant 0.5 #CCCCCC90 #> 6 6 1 7 potential-gift Water Fighting 0.5 #CCCCCC90 #> 7 7 1 8 potential-gift Water Fighting 0.5 #CCCCCC90 #> 8 8 1 9 potential-gift Water Normal 0.5 #CCCCCC90 #> 9 9 1 10 potential-gift Water Normal 0.5 #CCCCCC90 #> 10 10 1 11 potential-gift Water Psychic 0.5 #CCCCCC90 #> # ... with 122 more rows

    In the graph below, we draw an olive edge to connect edge pair of siblings.
    These edges are illegal before we even start drawing names.

    sibling_edges <- nibling_edges %>% filter(from_fam == to_fam) %>% mutate( rel = "sibling", color = "#3D9970", penwidth = 1) # Update edges that represent siblings nibling_edges <- nibling_edges %>% overwrite_edges(sibling_edges) nibling_nodes <- create_node_df( n = nrow(niblings), type = niblings$Name, label = niblings$Name) nibling_edges %>% overwrite_edges(sibling_edges) %>% create_graph(nibling_nodes, .) %>% render_graph(height = 400)





    %0

    1->2


    1->3


    1->4


    1->5


    1->6


    1->7


    1->8


    1->9


    1->10


    1->11


    1->12


    2->1


    2->3


    2->4


    2->5


    2->6


    2->7


    2->8


    2->9


    2->10


    2->11


    2->12


    3->1


    3->2


    3->4


    3->5


    3->6


    3->7


    3->8


    3->9


    3->10


    3->11


    3->12


    4->1


    4->2


    4->3


    4->5


    4->6


    4->7


    4->8


    4->9


    4->10


    4->11


    4->12


    5->1


    5->2


    5->3


    5->4


    5->6


    5->7


    5->8


    5->9


    5->10


    5->11


    5->12


    6->1


    6->2


    6->3


    6->4


    6->5


    6->7


    6->8


    6->9


    6->10


    6->11


    6->12


    7->1


    7->2


    7->3


    7->4


    7->5


    7->6


    7->8


    7->9


    7->10


    7->11


    7->12


    8->1


    8->2


    8->3


    8->4


    8->5


    8->6


    8->7


    8->9


    8->10


    8->11


    8->12


    9->1


    9->2


    9->3


    9->4


    9->5


    9->6


    9->7


    9->8


    9->10


    9->11


    9->12


    10->1


    10->2


    10->3


    10->4


    10->5


    10->6


    10->7


    10->8


    10->9


    10->11


    10->12


    11->1


    11->2


    11->3


    11->4


    11->5


    11->6


    11->7


    11->8


    11->9


    11->10


    11->12


    12->1


    12->2


    12->3


    12->4


    12->5


    12->6


    12->7


    12->8


    12->9


    12->10


    12->11


    1

    Squirtle



    2

    Wartortle



    3

    Pikachu



    4

    Bulbasaur



    5

    Ivysaur



    6

    Venusaur



    7

    Machamp



    8

    Machoke



    9

    Ratata



    10

    Raticate



    11

    Mew



    12

    Mewtwo


    The solution for this trickier, more constrained version of the problem is to
    delete the illegal edges beforehand so that they can never be drawn from the
    hat. After that, the same algorithm works as before.

    nibling_edges %>% filter(rel != "sibling") %>% draw_edges_from_hat(color = "#FF4136") %>% create_graph(nibling_nodes, .) %>% render_graph(height = 500) #> Warning: Invalid drawing. Trying again. #> Warning: Invalid drawing. Trying again.





    %0

    1->4


    2->6


    3->9


    4->3


    5->11


    6->8


    7->1


    8->12


    9->7


    10->2


    11->10


    12->5


    1

    Squirtle



    2

    Wartortle



    3

    Pikachu



    4

    Bulbasaur



    5

    Ivysaur



    6

    Venusaur



    7

    Machamp



    8

    Machoke



    9

    Ratata



    10

    Raticate



    11

    Mew



    12

    Mewtwo


    Like usual, I’m not sure how to close one of these blog posts. I guess: When a
    problem involves relations among individuals, always consider whether there is a
    graph hiding in the background. Even the simple process of drawing names from a
    hat for Secret Santa describes a graph: a graph of gift-giving relations among
    individuals. This graph wasn’t obvious to me until I thought about Hamilitonian
    path solution. Hey, wait a minute, that’s a big gift-giving circle! It’s like
    some kind of network… ooooooh.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Vectorized Block ifelse in R

    Tue, 11/28/2017 - 06:33

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.

    From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure.

    When porting code from one language to another you hope the expressive power and style of the languages are similar.

    • If the source language is too weak then the original code will be very long (and essentially over specified), meaning a direct transliteration will be unlikely to be efficient, as you are not using the higher order operators of the target language.
    • If the source language is too strong you will have operators that don’t have direct analogues in the target language.

    SAS has some strong and powerful operators. One such is what I am calling “the vectorized block if(){}else{}“. From SAS documentation:

    The subsetting IF statement causes the DATA step to continue processing only those raw data records or those observations from a SAS data set that meet the condition of the expression that is specified in the IF statement.

    That is a really wonderful operator!

    R has some available related operators: base::ifelse(), dplyr::if_else(), and dplyr::mutate_if(). However, none of these has the full expressive power of the SAS operator, which can per data row:

    • Conditionally choose where different assignments are made to (not just choose conditionally which values are taken).
    • Conditionally specify blocks of assignments that happen together.
    • Be efficiently nested and chained with other IF statements.

    To help achieve such expressive power in R Win-Vector is introducing seplyr::if_else_device(). When combined with seplyr::partition_mutate_se() you get a good high performance simulation of the SAS power in R. These are now available in the open source R package seplyr.

    For more information please reach out to us here at Win-Vector or try help(if_else_device).

    Also, we will publicize more documentation and examples shortly (especially showing big data scale use with Apache Spark via Sparklyr).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    #11: (Much) Faster Package (Re-)Installation via Caching

    Tue, 11/28/2017 - 03:19

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Welcome to the eleventh post in the rarely rued R rants series, or R4 for short. Time clearly flies as it has been three months since out last post on significantly reducing library size via stripping. I had been meaning to post on today’s topic for quite some time, but somehow something (working on a paper, releasing a package, …) got in the way.

    Just a few days ago Colin (of Efficient R Programming fame) posted about speed(ing up) package installation. His recommendation? Remember that we (usually) have multiple cores and using several of them via options(Ncpus = XX). It is an excellent point, and it bears repeating.

    But it turns I have not one but two salient recommendations too. Today covers the first, we should hopefully get pretty soon to the second. Both have one thing in common: you will be fastest if you avoid doing the work in the first place.

    What?

    One truly outstanding tool for this in the context of the installation of compiled packages is ccache. It is actually a pretty old tool that has been out for well over a decade, and it comes from the folks that gave the Samba filesystem.

    What does it do? Well, it nutshell, it "hashes" a checksum of a source file once the preprocessor has operated on it and stores the resulting object file. In the case of rebuild with unchanged code you get the object code back pretty much immediately. The idea is very similar to memoisation (as implemented in R for example in the excellent little memoise package by Hadley, Jim, Kirill and Daniel. The idea is the same: if you have to do something even moderately expensive a few times, do it once and then recall it the other times.

    This happens (at least to me) more often that not in package development. Maybe you change just one of several source files. Maybe you just change the R code, the Rd documentation or a test file—yet still need a full reinstallation. In all these cases, ccache can help tremdendously as illustrated below.

    How?

    Because essentially all our access to compilation happens through R, we need to set this in a file read by R. I use ~/.R/Makevars for this and have something like these lines on my machines:

    VER= CCACHE=ccache CC=$(CCACHE) gcc$(VER) CXX=$(CCACHE) g++$(VER) CXX11=$(CCACHE) g++$(VER) CXX14=$(CCACHE) g++$(VER) FC=$(CCACHE) gfortran$(VER) F77=$(CCACHE) gfortran$(VER)

    That way, when R calls the compiler(s) it will prefix with ccache. And ccache will then speed up.

    There is an additional issue due to R use. Often we install from a .tar.gz. These will be freshly unpackaged, and hence have "new" timestamps. This would usually lead ccache to skip to file (fear of "false positives") so we have to override this. Similarly, the tarball is usually unpackage in a temporary directory with an ephemeral name, creating a unique path. That too needs to be overwritten. So in my ~/.ccache/ccache.conf I have this:

    max_size = 5.0G # important for R CMD INSTALL *.tar.gz as tarballs are expanded freshly -> fresh ctime sloppiness = include_file_ctime # also important as the (temp.) directory name will differ hash_dir = false Show Me

    A quick illustration will round out the post. Some packages are meatier than others. More C++ with more templates usually means longer build times. Below is a quick chart comparing times for a few such packages (ie RQuantLib, dplyr, rstan) as well as igraph ("merely" a large C package) and lme4 as well as Rcpp. The worst among theseis still my own RQuantLib package wrapping (still just parts of) the genormous and Boost-heavy QuantLib library.

    Pretty dramatic gains. Best of all, we can of course combine these with other methods such as Colin’s use of multiple CPUs, or even a simple MAKE=make -j4 to have multiple compilation units being considered in parallel. So maybe we all get to spend less time on social media and other timewasters as we spend less time waiting for our builds. Or maybe that is too much to hope for…

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Rule Your Data with Tidy Validation Reports. Design

    Tue, 11/28/2017 - 01:00

    (This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

    The story about design of ruler package: dplyr-style exploration and validation of data frame like objects.

    Prologue

    Some time ago I had a task to write data validation code. As for most R practitioners, this led to exploration of present solutions. I was looking for a package with the following features:

    • Relatively small amount of time should be spent learning before comfortable usage. Preferably, it should be built with tidyverse in mind.
    • It should be quite flexible in terms of types of validation rules.
    • Package should offer functionality for both validations (with relatively simple output format) and assertions (with relatively flexible behaviour).
    • Pipe-friendliness.
    • Validating only data frames would be enough.

    After devoting couple of days to research, I didn’t find any package fully (subjectively) meeting my needs (for a composed list look here). So I decided to write one myself. More precisely, it turned out into not one but two packages: ruler and keyholder, which powers some of ruler’s functionality.

    This post is a rather long story about key moments in the journey of ruler’s design process. To learn other aspects see its README (for relatively brief introduction) or vignettes (for more thorough description of package capabilities).

    Overview

    In my mind, the whole process of data validation should be performed in the following steps:

    • Create conditions (rules) for data to meet.
    • Expose data to them and obtain some kind of unified report as a result.
    • Act based on the report.

    The design process went through a little different sequence of definition steps:

    Of course, there was switching between these items in order to ensure they would work well together, but I feel this order was decisive for the end result.

    suppressMessages(library(dplyr)) suppressMessages(library(purrr)) library(ruler) Validation result Dplyr data units

    I started with an attempt of simple and clear formulation of validation: it is a process of checking whether something satisfies certain conditions. As it was enough to be only validating data frames, something should be thought of as parts of data frame which I will call data units. Certain conditions might be represented as functions, which I will call rules, associated with some data unit and which return TRUE, if condition is satisfied, and FALSE otherwise.

    I decided to make dplyr package a default tool for creating rules. The reason is, basically, because it satisfies most conditions I had in mind. Also I tend to use it for interactive validation of data frames, as, I am sure, many more R users. Its pipe-friendliness creates another important feature: interactive code can be transformed into a function just by replacing the initial data frame variable by a dot .. This will create a functional sequence, “a function which applies the entire chain of right-hand sides in turn to its input.”.

    dplyr offers a set of tools for operating with the following data units (see comments):

    is_integerish <- function(x) {all(x == as.integer(x))} z_score <- function(x) {abs(x - mean(x)) / sd(x)} mtcars_tbl <- mtcars %>% as_tibble() # Data frame as a whole validate_data <- . %>% summarise(nrow_low = n() >= 15, nrow_up = n() <= 20) mtcars_tbl %>% validate_data() ## # A tibble: 1 x 2 ## nrow_low nrow_up ## ## 1 TRUE FALSE # Group as a whole validate_groups <- . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7) %>% ungroup() mtcars_tbl %>% validate_groups() ## # A tibble: 4 x 3 ## vs am vs_am_low ## ## 1 0 0 TRUE ## 2 0 1 FALSE ## 3 1 0 TRUE ## 4 1 1 TRUE # Column as a whole validate_columns <- . %>% summarise_if(is_integerish, funs(is_enough_sum = sum(.) >= 14)) mtcars_tbl %>% validate_columns() ## # A tibble: 1 x 6 ## cyl_is_enough_sum hp_is_enough_sum vs_is_enough_sum am_is_enough_sum ## ## 1 TRUE TRUE TRUE FALSE ## # ... with 2 more variables: gear_is_enough_sum , ## # carb_is_enough_sum # Row as a whole validate_rows <- . %>% filter(vs == 1) %>% transmute(is_enough_sum = rowSums(.) >= 200) mtcars_tbl %>% validate_rows() ## # A tibble: 14 x 1 ## is_enough_sum ## ## 1 TRUE ## 2 TRUE ## 3 TRUE ## 4 TRUE ## 5 TRUE ## # ... with 9 more rows # Cell validate_cells <- . %>% transmute_if(is.numeric, funs(is_out = z_score(.) > 1)) %>% slice(-(1:5)) mtcars_tbl %>% validate_cells() ## # A tibble: 27 x 11 ## mpg_is_out cyl_is_out disp_is_out hp_is_out drat_is_out wt_is_out ## ## 1 FALSE FALSE FALSE FALSE TRUE FALSE ## 2 FALSE TRUE TRUE TRUE FALSE FALSE ## 3 FALSE TRUE FALSE TRUE FALSE FALSE ## 4 FALSE TRUE FALSE FALSE FALSE FALSE ## 5 FALSE FALSE FALSE FALSE FALSE FALSE ## # ... with 22 more rows, and 5 more variables: qsec_is_out , ## # vs_is_out , am_is_out , gear_is_out , carb_is_out Tidy data validation report

    After realizing this type of dplyr structure, I noticed the following points.

    In order to use dplyr as tool for creating rules, there should be one extra level of abstraction for the whole functional sequence. It is not a rule but rather a several rules. In other words, it is a function that answers multiple questions about one type of data unit. I decided to call this rule pack or simply pack.

    In order to identify, whether some data unit obeys some rule, one needs to describe that data unit, rule and result of validation. Descriptions of last two are simple: for rule it is a combination of pack and rule names (which should always be defined) and for validation result it is value TRUE or FALSE.

    Description of data unit is trickier. After some thought, I decided that the most balanced way to do it is with two variables:

    • var (character) which represents the variable name of data unit:
      • Value “.all” is reserved for “all columns as a whole”.
      • Value equal to some column name indicates column of data unit.
      • Value not equal to some column name indicates the name of group: it is created by uniting (with delimiter) group levels of grouping columns.
    • id (integer) which represents the row index of data unit:
      • Value 0 is reserved for “all rows as a whole”.
      • Value not equal to 0 indicates the row index of data unit.

    Combinations of these variables describe all mentioned data units:

    • var == '.all' and id == 0: Data as a whole.
    • var != '.all' and id == 0: Group (var shouldn’t be an actual column name) or column (var should be an actual column name) as a whole.
    • var == '.all' and id != 0: Row as a whole.
    • var != '.all' and id != 0: Described cell.

    With this knowledge in mind, I decided that the tidy data validation report should be a tibble with the following columns:

    • pack : Pack name.
    • rule : Rule name inside pack.
    • var : Variable name of data unit.
    • id : Row index of data unit.
    • value : Whether the described data unit obeys the rule.
    Exposure

    Using only described report as validation output is possible if only information about breakers (data units which do not obey respective rules) is interesting. However, reproducibility is a great deal in R community, and keeping information about call can be helpful for future use.

    This idea led to creation of another object in ruler called packs info. It is also a tibble which contains all information about exposure call:

    • name : Name of the rule pack. This column is used to match column pack in tidy report.
    • type : Name of pack type. Indicates which data unit pack checks.
    • fun : List of actually used rule pack functions.
    • remove_obeyers : Value of convenience argument of the future expose function. It indicates whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.

    To fully represent validation, described two tibbles should be returned together. So the actual validation result is decided to be exposure which is basically an S3 class list with two tibbles packs_info and report. This data structure is fairly easy to understand and use. For example, exposures can be binded together which is useful for combining several validation results. Also its elements are regular tibbles which can be filtered, summarised, joined, etc.

    Rules definition Interpretation of dplyr output

    I was willing to use pure dplyr in creating rule packs, i.e. without extra knowledge of data unit to be validated. However, I found it impossible to do without experiencing annoying edge cases. Problem with this approach is that all of dplyr outputs are tibbles with similar structures. The only differentiating features are:

    • summarise without grouping returns tibble with one row and user-defined column names.
    • summarise with grouping returns tibble with number of rows equal to number of summarised groups. Columns consist from grouping and user-defined ones.
    • transmute returns tibble with number of rows as in input data frame and user-defined column names.
    • Scoped variants of summarise and transmute differ from regular ones in another mechanism of creating columns. They apply all supplied functions to all chosen columns. Resulting names are “the shortest … needed to uniquely identify the output”. It means that:
      • In case of one function they are column names.
      • In case of more than one function and one column they are function names.
      • In case of more than one column and function they are combinations of column and function names, pasted with character _ (which, unfortunately, is hardcoded). To force this behaviour in previous cases both columns and functions should be named inside of helper functions vars and funs. To match output columns with combination of validated column and rule, this option is preferred. However, there is a need of different separator between column and function names, as character _ is frequently used in column names.

    The first attempt was to use the following algorithm to interpret (identify validated data unit) the output:

    • If there is at least one non-logical column then groups are validated. The reason is that in most cases grouping columns are character or factor ones. This already introduces edge case with logical grouping columns.
    • Combination of whether number of rows equals 1 (n_rows_one) and presence of name separator in all column names (all_contain_sep) is used to make interpretation:
      • If n_rows_one == TRUE and all_contain_sep == FALSE then data is validated.
      • If n_rows_one == TRUE and all_contain_sep == TRUE then columns are validated.
      • If n_rows_one == FALSE and all_contain_sep == FALSE then rows are validated. This introduces an edge case when output has one row which is intended to be validated. It will be interpreted as ‘data as a whole’.
      • If n_rows_one == FALSE and all_contain_sep == TRUE then cells are validated. This also has edge case when output has one row in which cells are intended to be validated. It will be interpreted as ‘columns as a whole’.

    Despite of having edge cases, this algorithm is good for guessing the validated data unit, which can be useful for interactive use. Its important prerequisite is to have a simple way of forcing extended naming in scoped dplyr verbs with custom rarely used separator.

    Pack creation

    Research of pure dplyr-style way of creating rule packs left no choice but to create a mechanism of supplying information about data unit of interest along with pack functions. It consists of following important principles.

    Use ruler’s function rules() instead of funs(). Its goals are to force usage of full naming in scoped dplyr verbs as much as possible and impute missing rule names (as every rule should be named for validation report). rules is just a wrapper for funs but with extra functionality of naming its every output element and adding prefix to that names (which will be used as a part of separator between column and rule name). By default prefix is a string ._.. It is chosen for its, hopefully, rare usage inside column names and symbolism (it is the Morse code of letter ‘R’).

    funs(mean, sd) ## ## $ mean: mean(.) ## $ sd : sd(.) rules(mean, sd) ## ## $ ._.rule..1: mean(.) ## $ ._.rule..2: sd(.) rules(mean, sd, .prefix = "___") ## ## $ ___rule..1: mean(.) ## $ ___rule..2: sd(.) rules(fn_1 = mean, fn_2 = sd) ## ## $ ._.fn_1: mean(.) ## $ ._.fn_2: sd(.)

    Note that in case of using only one column in scoped verb it should be named within dplyr::vars in order to force full naming.

    Use functions supported by keyholder to build rule packs. One of the main features I was going to implement is a possibility of validating only a subset of all possible data units. For example, validation of only last two rows (or columns) of data frame. There is no problem with columns: they can be specified with summarise_at. However, the default way of specifying rows is by subsetting data frame, after which all information about original row position is lost. To solve this, I needed a mechanism of tracking rows as invisibly for user as possible. This led to creation of keyholder package (which is also on CRAN now). To learn details about it go to its site or read my previous post.

    Use specific rule pack wrappers for certain data units. Their goal is to create S3 classes for rule packs in order to carry information about data unit of interest through exposing process. All of them always return a list with supplied functions but with changed attribute class (with additional group_vars and group_sep for group_packs()). Note that packs might be named inside these functions, which is recommended. If not, names will be imputed during exposing process. Also note that supplied functions are not checked to be correct in terms of validating specified data unit. This is done during exposure (exposing process).

    # Data unit. Rule pack is manually named 'my_data' my_data_packs <- data_packs(my_data = validate_data) map(my_data_packs, class) ## $my_data ## [1] "data_pack" "rule_pack" "fseq" "function" # Group unit. Need to supply grouping variables explicitly my_group_packs <- group_packs(validate_groups, .group_vars = c("vs", "am")) map(my_group_packs, class) ## [[1]] ## [1] "group_pack" "rule_pack" "fseq" "function" # Column unit. Need to be rewritten using `rules` my_col_packs <- col_packs( my_col = . %>% summarise_if(is_integerish, rules(is_enough_sum = sum(.) >= 14)) ) map(my_col_packs, class) ## $my_col ## [1] "col_pack" "rule_pack" "fseq" "function" # Row unit. One can supply several rule packs my_row_packs <- row_packs( my_row_1 = validate_rows, my_row_2 = . %>% transmute(is_vs_one = vs == 1) ) map(my_row_packs, class) ## $my_row_1 ## [1] "row_pack" "rule_pack" "fseq" "function" ## ## $my_row_2 ## [1] "row_pack" "rule_pack" "fseq" "function" # Cell unit. Also needs to be rewritten using `rules`. my_cell_packs <- cell_packs( my_cell = . %>% transmute_if(is.numeric, rules(is_out = z_score(.) > 1)) %>% slice(-(1:5)) ) map(my_cell_packs, class) ## $my_cell ## [1] "cell_pack" "rule_pack" "fseq" "function" Exposing process

    After sorting things out with formats of validation result and rule packs it was time to combine them in the main ruler’s function: expose(). I had the following requirements:

    • It should be insertable inside common %>% pipelines as smoothly and flexibly as possible. Two main examples are validating data frame before performing some operations with it and actually obtaining results of validation.
    • There should be possibility of sequential apply of expose with different rule packs. In this case exposure (validation report) after first call should be updated with new exposure. In other words, the result should be as if those rule packs were both supplied in expose by one call.

    These requirements led to the following main design property of expose: it never modifies content of input data frame but possibly creates or updates attribute exposure with validation report. To access validation data there are wrappers get_exposure(), get_report() and get_packs_info(). The whole exposing process can be described as follows:

    • Apply all supplied rule packs to keyed with keyholder::use_id version of input data frame.
    • Impute names of rule packs based on possible present exposure (from previous use of expose) and validated data units.
    • Bind possible present exposure with new ones and create/update attribute exposure with it.

    Also it was decided (for flexibility and convenience) to add following arguments to expose:

    • .rule_sep. It is a regular expression used to delimit column and function names in the output of scoped dplyr verbs. By default it is a string ._. possibly surrounded by punctuation characters. This is done to account of dplyr’s hardcoded use of _ in scoped verbs. Note that .rule_sep should take into account separator used in rules().
    • .remove_obeyers. It is a logical argument indicating whether to automatically remove elements, which obey rules, from tidy validation report. It can be very useful because the usual result of validation is a handful of rule breakers. Without possibility of setting .remove_obeyers to TRUE (which is default) validation report will grow unnecessary big.
    • .guess. By default expose guesses the type of unsupported rule pack type with algorithm described before. In order to write strict and robust code this can be set to FALSE in which case error will be thrown after detecting unfamiliar pack type.

    Some examples:

    mtcars_tbl %>% expose(my_data_packs, my_col_packs) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 2 x 4 ## name type fun remove_obeyers ## ## 1 my_data data_pack TRUE ## 2 my_col col_pack TRUE ## ## Tidy data validation report: ## # A tibble: 2 x 5 ## pack rule var id value ## ## 1 my_data nrow_up .all 0 FALSE ## 2 my_col is_enough_sum am 0 FALSE # Note that `id` starts from 6 as rows 1:5 were removed from validating mtcars_tbl %>% expose(my_cell_packs, .remove_obeyers = FALSE) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 1 x 4 ## name type fun remove_obeyers ## ## 1 my_cell cell_pack FALSE ## ## Tidy data validation report: ## # A tibble: 297 x 5 ## pack rule var id value ## ## 1 my_cell is_out mpg 6 FALSE ## 2 my_cell is_out mpg 7 FALSE ## 3 my_cell is_out mpg 8 FALSE ## 4 my_cell is_out mpg 9 FALSE ## 5 my_cell is_out mpg 10 FALSE ## # ... with 292 more rows # Note name imputation and guessing mtcars_tbl %>% expose(my_data_packs, .remove_obeyers = FALSE) %>% expose(validate_rows) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 2 x 4 ## name type fun remove_obeyers ## ## 1 my_data data_pack FALSE ## 2 row_pack..1 row_pack TRUE ## ## Tidy data validation report: ## # A tibble: 3 x 5 ## pack rule var id value ## ## 1 my_data nrow_low .all 0 TRUE ## 2 my_data nrow_up .all 0 FALSE ## 3 row_pack..1 is_enough_sum .all 19 FALSE Act after exposure

    After creating data frame with attribute exposure, it is pretty straightforward to design how to perform any action. It is implemented in function act_after_exposure with the following arguments:

    • .tbl which should be the result of using expose().
    • .trigger: function which takes .tbl as argument and returns TRUE if some action needs to be performed.
    • actor: function which takes .tbl as argument and performs the action.

    Basically act_after_exposure() is doing the following:

    • Check that .tbl has a proper exposure attribute.
    • Compute whether to perform intended action by computing .trigger(.tbl).
    • If trigger results in TRUE then .actor(.tbl) is returned. In other case .tbl is returned.

    It is a good idea that .actor should be doing one of two things:

    • Making side effects. For example throwing an error (if condition in .trigger is met), printing some information and so on. In this case it should return .tbl to be used properly inside a pipe.
    • Changing .tbl based on exposure information. In this case it should return the imputed version of .tbl.

    As a main use case, ruler has function assert_any_breaker. It is a wrapper for act_after_exposure with .trigger checking presence of any breaker in exposure and .actor being notifier about it.

    mtcars_tbl %>% expose(my_data_packs) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 my_data nrow_up .all 0 FALSE ## Error: assert_any_breaker: Some breakers found in exposure. Conclusions
    • Design process of a package deserves its own story.
    • Package ruler offers tools for dplyr-style exploration and validation of data frame like objects. With its help validation is done with 3 commands/steps each designed for specific purpose.

    sessionInfo()

    sessionInfo() ## R version 3.4.2 (2017-09-28) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 16.04.3 LTS ## ## Matrix products: default ## BLAS: /usr/lib/openblas-base/libblas.so.3 ## LAPACK: /usr/lib/libopenblasp-r0.2.18.so ## ## locale: ## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8 ## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8 ## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] bindrcpp_0.2 ruler_0.1.0 purrr_0.2.4 dplyr_0.7.4 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.13 knitr_1.17 bindr_0.1 magrittr_1.5 ## [5] tidyselect_0.2.3 R6_2.2.2 rlang_0.1.4 stringr_1.2.0 ## [9] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 rprojroot_1.2 ## [13] digest_0.6.12 assertthat_0.2.0 tibble_1.3.4 bookdown_0.5 ## [17] tidyr_0.7.2 glue_1.2.0 evaluate_0.10.1 rmarkdown_1.7 ## [21] blogdown_0.2 stringi_1.1.5 compiler_3.4.2 keyholder_0.1.1 ## [25] backports_1.1.1 pkgconfig_2.0.1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    MICE (Multiple Imputation by Chained Equations) in R – sketchnotes from MünsteR Meetup

    Tue, 11/28/2017 - 01:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    Last night, the MünsteR R user-group had another great meetup:

    Karin Groothuis-Oudshoorn, Assistant Professor at the University of Twente, presented her R package mice about Multivariate Imputation by Chained Equations.

    It was a very interesting talk and here are my sketchnotes that I took during it:

    MICE talk sketchnotes

    Here is the link to the paper referenced in my notes: https://www.jstatsoft.org/article/view/v045i03

    “The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.”" (https://cran.r-project.org/web/packages/mice/README.html)

    For more information on the package go to http://stefvanbuuren.github.io/mice/.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Building the oomsifyer

    Tue, 11/28/2017 - 01:00

    (This article was first published on Clean Code, and kindly contributed to R-bloggers)

    Today I will show you a quick hack (OK it took me like 4 hours during my travels today yesterday and today),
    on how to add a dancing banana to any picture.

    Now, you might be thinking… Really, why would you add a dancing banana to
    a picture, but I don’t have time for that kind of negativity in my life.

    Why oomsifier?

    Jeroen Ooms is one of those crazy productive people in the R world. The way he
    wraps c++ libraries into packages makes you think his middle name is c-plus-plus.

    At the Sat-R-day in budapest in 2016 (?) Jeroen demonstrated the magick library.
    You can now control images from inside R using dark magic and the bindings to
    the magick library. In honor of this work and because I needed a cool name,
    I will demonstrate THE OOMSIFYER.

    Solving the life long dream of people all around the world … of adding dancing banana gifs to pictures…

    If you are like me, you would have thought this would be really easy, but you would be wrong.

    First the function then the explanation and some stuff that took me waaay too long
    to find out.

    The function

    banana <- image_read("images/banana.gif") # this assumes you have a project with the folder /images/ inside. add_banana <- function(image, offset = NULL, debug = FALSE){ image_in <- magick::image_read(image) banana <- image_read("images/banana.gif") # 365w 360 h image_info <- image_info(image_in) if("gif" %in% image_info$format ){stop("gifs are to difficult for me now")} stopifnot(nrow(image_info)==1) # scale banana to correct size: # take smallest dimension. target_height <- min(image_info$width, image_info$height) # scale banana to 1/3 of the size scaling <- (target_height /3) front <- image_scale(banana, scaling) # place in lower right corner # offset is width and hieight minus dimensions picutre? scaled_dims <- image_info(front) x_c <- image_info$width - scaled_dims$width y_c <- image_info$height - scaled_dims$height offset_value <- ifelse(is.null(offset), paste0("+",x_c,"+",y_c), offset) if(debug) print(offset_value) frames <- lapply(as.list(front), function(x) image_composite(image_in, x, offset = offset_value)) result <- image_animate(image_join(frames), fps = 10) result } So what can it do?

    This function takes an image, f.i. a ggplot2 image that you saved to disk, and adds the dancing banana gif to the bottom right corner.

    I had to combine information from the magick vignette and an earlier blogpost about magick in R.

    Things I learned:

    • the magick package returns image information as a data frame
    • a gif is a succesion of images (frames)
    • a normal picture is a single frame
    • to combine a gif and a single frame you have to have exactly as much frames of your original picture as the number of frames in the gif
    • for every frame you have to merge the gif and image with each other into a composite image
    • the offset is the number of pixels from the left of the frame and from the top of the frame. I wanted to put the dancing banana at the bottom right of the picture AND I wanted to scale the banana so that it would take over the entire image so the offset needed to be responsive to both scaling and the input dimensions
    • the practical dev has many silly o-reilly like book covers that I find hilarious

    In a the following posts I might turn this function into an API, stay tuned!

    Building the oomsifyer was originally published by at Clean Code on November 28, 2017.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Clean Code. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn

    Tue, 11/28/2017 - 01:00

    (This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

    Customer churn is a problem that all companies need to monitor, especially those that depend on subscription-based revenue streams. The simple fact is that most organizations have data that can be used to target these individuals and to understand the key drivers of churn, and we now have Keras for Deep Learning available in R (Yes, in R!!), which predicted customer churn with 82% accuracy. We’re super excited for this article because we are using the new keras package to produce an Artificial Neural Network (ANN) model on the IBM Watson Telco Customer Churn Data Set! As for most business problems, it’s equally important to explain what features drive the model, which is why we’ll use the lime package for explainability. We cross-checked the LIME results with a Correlation Analysis using the corrr package. We’re not done yet. In addition, we use three new packages to assist with Machine Learning (ML): recipes for preprocessing, rsample for sampling data and yardstick for model metrics. These are relatively new additions to CRAN developed by Max Kuhn at RStudio (creator of the caret package). It seems that R is quickly developing ML tools that rival Python. Good news if you’re interested in applying Deep Learning in R! We are so let’s get going!!

    Customer Churn: Hurts Sales, Hurts Company

    Customer churn refers to the situation when a customer ends their relationship with a company, and it’s a costly problem. Customers are the fuel that powers a business. Loss of customers impacts sales. Further, it’s much more difficult and costly to gain new customers than it is to retain existing customers. As a result, organizations need to focus on reducing customer churn.

    The good news is that machine learning can help. For many businesses that offer subscription based services, it’s critical to both predict customer churn and explain what features relate to customer churn. Older techniques such as logistic regression can be less accurate than newer techniques such as deep learning, which is why we are going to show you how to model an ANN in R with the keras package.

    Churn Modeling With Artificial Neural Networks (Keras)

    Artificial Neural Networks (ANN) are now a staple within the sub-field of Machine Learning called Deep Learning. Deep learning algorithms can be vastly superior to traditional regression and classification methods (e.g. linear and logistic regression) because of the ability to model interactions between features that would otherwise go undetected. The challenge becomes explainability, which is often needed to support the business case. The good news is we get the best of both worlds with keras and lime.

    IBM Watson Dataset (Where We Got The Data)

    The dataset used for this tutorial is IBM Watson Telco Dataset. According to IBM, the business challenge is…

    A telecommunications company [Telco] is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.

    The dataset includes information about:

    • Customers who left within the last month: The column is called Churn
    • Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
    • Customer account information: how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
    • Demographic info about customers: gender, age range, and if they have partners and dependents
    Deep Learning With Keras (What We Did With The Data)

    In this example we show you how to use keras to develop a sophisticated and highly accurate deep learning model in R. We walk you through the preprocessing steps, investing time into how to format the data for Keras. We inspect the various classification metrics, and show that an un-tuned ANN model can easily get 82% accuracy on the unseen data. Here’s the deep learning training history visualization.

    We have some fun with preprocessing the data (yes, preprocessing can actually be fun and easy!). We use the new recipes package to simplify the preprocessing workflow.

    We end by showing you how to explain the ANN with the lime package. Neural networks used to be frowned upon because of the “black box” nature meaning these sophisticated models (ANNs are highly accurate) are difficult to explain using traditional methods. Not no more with LIME! Here’s the feature importance visualization.

    We also cross-checked the LIME results with a Correlation Analysis using the corrr package. Here’s the correlation visualization.

    We even built an ML-Powered Interactive PowerBI Web Application with a Customer Scorecard to monitor customer churn risk and to make recommendations on how to improve customer health! Feel free to take it for a spin.

    View in Full Screen Mode for best experience

    Credits

    We saw that just last week the same Telco customer churn dataset was used in the article, Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest. We thought the article was excellent.

    This article takes a different approach with Keras, LIME, Correlation Analysis, and a few other cutting edge packages. We encourage the readers to check out both articles because, although the problem is the same, both solutions are beneficial to those learning data science and advanced modeling.

    Prerequisites

    We use the following libraries in this tutorial:

    Install the following packages with install.packages().

    pkgs <- c("keras", "lime", "tidyquant", "rsample", "recipes", "yardstick", "corrr") install.packages(pkgs) Load Libraries

    Load the libraries.

    # Load libraries library(keras) library(lime) library(tidyquant) library(rsample) library(recipes) library(yardstick) library(corrr)

    If you have not previously run Keras in R, you will need to install Keras using the install_keras() function.

    # Install Keras if you have not installed before install_keras() Import Data

    Download the IBM Watson Telco Data Set here. Next, use read_csv() to import the data into a nice tidy data frame. We use the glimpse() function to quickly inspect the data. We have the target “Churn” and all other variables are potential predictors. The raw data set needs to be cleaned and preprocessed for ML.

    # Import data churn_data_raw <- read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv") glimpse(churn_data_raw) ## Observations: 7,043 ## Variables: 21 ## $ customerID "7590-VHVEG", "5575-GNVDE", "3668-QPYBK"... ## $ gender "Female", "Male", "Male", "Male", "Femal... ## $ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ Partner "Yes", "No", "No", "No", "No", "No", "No... ## $ Dependents "No", "No", "No", "No", "No", "No", "Yes... ## $ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, ... ## $ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", ... ## $ MultipleLines "No phone service", "No", "No", "No phon... ## $ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic... ## $ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "... ## $ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Y... ## $ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "... ## $ TechSupport "No", "No", "No", "Yes", "No", "No", "No... ## $ StreamingTV "No", "No", "No", "No", "No", "Yes", "Ye... ## $ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No... ## $ Contract "Month-to-month", "One year", "Month-to-... ## $ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", ... ## $ PaymentMethod "Electronic check", "Mailed check", "Mai... ## $ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65... ## $ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65,... ## $ Churn "No", "No", "Yes", "No", "Yes", "Yes", "... Preprocess Data

    We’ll go through a few steps to preprocess the data for ML. First, we “prune” the data, which is nothing more than removing unnecessary columns and rows. Then we split into training and testing sets. After that we explore the training set to uncover transformations that will be needed for deep learning. We save the best for last. We end by preprocessing the data with the new recipes package.

    Prune The Data

    The data has a few columns and rows we’d like to remove:

    • The “customerID” column is a unique identifier for each observation that isn’t needed for modeling. We can de-select this column.
    • The data has 11 NA values all in the “TotalCharges” column. Because it’s such a small percentage of the total population (99.8% complete cases), we can drop these observations with the drop_na() function from tidyr. Note that these may be customers that have not yet been charged, and therefore an alternative is to replace with zero or -99 to segregate this population from the rest.
    • My preference is to have the target in the first column so we’ll include a final select operation to do so.

    We’ll perform the cleaning operation with one tidyverse pipe (%>%) chain.

    # Remove unnecessary data churn_data_tbl <- churn_data_raw %>% select(-customerID) %>% drop_na() %>% select(Churn, everything()) glimpse(churn_data_tbl) ## Observations: 7,032 ## Variables: 20 ## $ Churn "No", "No", "Yes", "No", "Yes", "Yes", "... ## $ gender "Female", "Male", "Male", "Male", "Femal... ## $ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ Partner "Yes", "No", "No", "No", "No", "No", "No... ## $ Dependents "No", "No", "No", "No", "No", "No", "Yes... ## $ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, ... ## $ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", ... ## $ MultipleLines "No phone service", "No", "No", "No phon... ## $ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic... ## $ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "... ## $ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Y... ## $ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "... ## $ TechSupport "No", "No", "No", "Yes", "No", "No", "No... ## $ StreamingTV "No", "No", "No", "No", "No", "Yes", "Ye... ## $ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No... ## $ Contract "Month-to-month", "One year", "Month-to-... ## $ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", ... ## $ PaymentMethod "Electronic check", "Mailed check", "Mai... ## $ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65... ## $ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65,... Split Into Train/Test Sets

    We have a new package, rsample, which is very useful for sampling methods. It has the initial_split() function for splitting data sets into training and testing sets. The return is a special rsplit object.

    # Split test/training sets set.seed(100) train_test_split <- initial_split(churn_data_tbl, prop = 0.8) train_test_split ## <5626/1406/7032>

    We can retrieve our training and testing sets using training() and testing() functions.

    # Retrieve train and test sets train_tbl <- training(train_test_split) test_tbl <- testing(train_test_split) Exploration: What Transformation Steps Are Needed For ML?

    This phase of the analysis is often called exploratory analysis, but basically we are trying to answer the question, “What steps are needed to prepare for ML?” The key concept is knowing what transformations are needed to run the algorithm most effectively. Artificial Neural Networks are best when the data is one-hot encoded, scaled and centered. In addition, other transformations may be beneficial as well to make relationships easier for the algorithm to identify. A full exploratory analysis is not practical in this article. With that said we’ll cover a few tips on transformations that can help as they relate to this dataset. In the next section, we will implement the preprocessing techniques.

    Discretize The “tenure” Feature

    Numeric features like age, years worked, length of time in a position can generalize a group (or cohort). We see this in marketing a lot (think “millennials”, which identifies a group born in a certain timeframe). The “tenure” feature falls into this category of numeric features that can be discretized into groups.

    We can split into six cohorts that divide up the user base by tenure in roughly one year (12 month) increments. This should help the ML algorithm detect if a group is more/less susceptible to customer churn.

    Transform The “TotalCharges” Feature

    What we don’t like to see is when a lot of observations are bunched within a small part of the range.

    We can use a log transformation to even out the data into more of a normal distribution. It’s not perfect, but it’s quick and easy to get our data spread out a bit more.

    Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

    • correlate(): Performs tidy correlations on numeric data
    • focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.
    • fashion(): Makes the formatting aesthetically easier to read.
    # Determine if log transformation improves correlation # between TotalCharges and Churn train_tbl %>% select(Churn, TotalCharges) %>% mutate( Churn = Churn %>% as.factor() %>% as.numeric(), LogTotalCharges = log(TotalCharges) ) %>% correlate() %>% focus(Churn) %>% fashion() ## rowname Churn ## 1 TotalCharges -.20 ## 2 LogTotalCharges -.25

    The correlation between “Churn” and “LogTotalCharges” is greatest in magnitude indicating the log transformation should improve the accuracy of the ANN model we build. Therefore, we should perform the log transformation.

    One-Hot Encoding

    One-hot encoding is the process of converting categorical data to sparse data, which has columns of only zeros and ones (this is also called creating “dummy variables” or a “design matrix”). All non-numeric data will need to be converted to dummy variables. This is simple for binary Yes/No data because we can simply convert to 1’s and 0’s. It becomes slightly more complicated with multiple categories, which requires creating new columns of 1’s and 0`s for each category (actually one less). We have four features that are multi-category: Contract, Internet Service, Multiple Lines, and Payment Method.

    Feature Scaling

    ANN’s typically perform faster and often times with higher accuracy when the features are scaled and/or normalized (aka centered and scaled, also known as standardizing). Because ANNs use gradient descent, weights tend to update faster. According to Sebastian Raschka, an expert in the field of Deep Learning, several examples when feature scaling is important are:

    • k-nearest neighbors with an Euclidean distance measure if want all features to contribute equally
    • k-means (see k-nearest neighbors)
    • logistic regression, SVMs, perceptrons, neural networks etc. if you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others
    • linear discriminant analysis, principal component analysis, kernel principal component analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you’d emphasize variables on “larger measurement scales” more. There are many more cases than I can possibly list here … I always recommend you to think about the algorithm and what it’s doing, and then it typically becomes obvious whether we want to scale your features or not.

    The interested reader can read Sebastian Raschka’s article for a full discussion on the scaling/normalization topic. Pro Tip: When in doubt, standardize the data.

    Preprocessing With Recipes

    Let’s implement the preprocessing steps/transformations uncovered during our exploration. Max Kuhn (creator of caret) has been putting some work into Rlang ML tools lately, and the payoff is beginning to take shape. A new package, recipes, makes creating ML data preprocessing workflows a breeze! It takes a little getting used to, but I’ve found that it really helps manage the preprocessing steps. We’ll go over the nitty gritty as it applies to this problem.

    Step 1: Create A Recipe

    A “recipe” is nothing more than a series of steps you would like to perform on the training, testing and/or validation sets. Think of preprocessing data like baking a cake (I’m not a baker but stay with me). The recipe is our steps to make the cake. It doesn’t do anything other than create the playbook for baking.

    We use the recipe() function to implement our preprocessing steps. The function takes a familiar object argument, which is a modeling function such as object = Churn ~ . meaning “Churn” is the outcome (aka response, predictor, target) and all other features are predictors. The function also takes the data argument, which gives the “recipe steps” perspective on how to apply during baking (next).

    A recipe is not very useful until we add “steps”, which are used to transform the data during baking. The package contains a number of useful “step functions” that can be applied. The entire list of Step Functions can be viewed here. For our model, we use:

    1. step_discretize() with the option = list(cuts = 6) to cut the continuous variable for “tenure” (number of years as a customer) to group customers into cohorts.
    2. step_log() to log transform “TotalCharges”.
    3. step_dummy() to one-hot encode the categorical data. Note that this adds columns of one/zero for categorical data with three or more categories.
    4. step_center() to mean-center the data.
    5. step_scale() to scale the data.

    The last step is to prepare the recipe with the prep() function. This step is used to “estimate the required parameters from a training set that can later be applied to other data sets”. This is important for centering and scaling and other functions that use parameters defined from the training set.

    Here’s how simple it is to implement the preprocessing steps that we went over!

    # Create recipe rec_obj <- recipe(Churn ~ ., data = train_tbl) %>% step_discretize(tenure, options = list(cuts = 6)) %>% step_log(TotalCharges) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep(data = train_tbl) ## step 1 discretize training ## step 2 log training ## step 3 dummy training ## step 4 center training ## step 5 scale training

    We can print the recipe object if we ever forget what steps were used to prepare the data. Pro Tip: We can save the recipe object as an RDS file using saveRDS(), and then use it to bake() (discussed next) future raw data into ML-ready data in production!

    # Print the recipe object rec_obj ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 19 ## ## Training data contained 5626 data points and no missing data. ## ## Steps: ## ## Dummy variables from tenure [trained] ## Log transformation on TotalCharges [trained] ## Dummy variables from ~gender, ~Partner, ... [trained] ## Centering for SeniorCitizen, ... [trained] ## Scaling for SeniorCitizen, ... [trained] Step 2: Baking With Your Recipe

    Now for the fun part! We can apply the “recipe” to any data set with the bake() function, and it processes the data following our recipe steps. We’ll apply to our training and testing data to convert from raw data to a machine learning dataset. Check our training set out with glimpse(). Now that’s an ML-ready dataset prepared for ANN modeling!!

    # Predictors x_train_tbl <- bake(rec_obj, newdata = train_tbl) x_test_tbl <- bake(rec_obj, newdata = test_tbl) glimpse(x_train_tbl) ## Observations: 5,626 ## Variables: 35 ## $ SeniorCitizen -0.4351959, -0.4351... ## $ MonthlyCharges -1.1575972, -0.2601... ## $ TotalCharges -2.275819130, 0.389... ## $ gender_Male -1.0016900, 0.99813... ## $ Partner_Yes 1.0262054, -0.97429... ## $ Dependents_Yes -0.6507747, -0.6507... ## $ tenure_bin1 2.1677790, -0.46121... ## $ tenure_bin2 -0.4389453, -0.4389... ## $ tenure_bin3 -0.4481273, -0.4481... ## $ tenure_bin4 -0.4509837, 2.21698... ## $ tenure_bin5 -0.4498419, -0.4498... ## $ tenure_bin6 -0.4337508, -0.4337... ## $ PhoneService_Yes -3.0407367, 0.32880... ## $ MultipleLines_No.phone.service 3.0407367, -0.32880... ## $ MultipleLines_Yes -0.8571364, -0.8571... ## $ InternetService_Fiber.optic -0.8884255, -0.8884... ## $ InternetService_No -0.5272627, -0.5272... ## $ OnlineSecurity_No.internet.service -0.5272627, -0.5272... ## $ OnlineSecurity_Yes -0.6369654, 1.56966... ## $ OnlineBackup_No.internet.service -0.5272627, -0.5272... ## $ OnlineBackup_Yes 1.3771987, -0.72598... ## $ DeviceProtection_No.internet.service -0.5272627, -0.5272... ## $ DeviceProtection_Yes -0.7259826, 1.37719... ## $ TechSupport_No.internet.service -0.5272627, -0.5272... ## $ TechSupport_Yes -0.6358628, -0.6358... ## $ StreamingTV_No.internet.service -0.5272627, -0.5272... ## $ StreamingTV_Yes -0.7917326, -0.7917... ## $ StreamingMovies_No.internet.service -0.5272627, -0.5272... ## $ StreamingMovies_Yes -0.797388, -0.79738... ## $ Contract_One.year -0.5156834, 1.93882... ## $ Contract_Two.year -0.5618358, -0.5618... ## $ PaperlessBilling_Yes 0.8330334, -1.20021... ## $ PaymentMethod_Credit.card..automatic. -0.5231315, -0.5231... ## $ PaymentMethod_Electronic.check 1.4154085, -0.70638... ## $ PaymentMethod_Mailed.check -0.5517013, 1.81225... Step 3: Don’t Forget The Target

    One last step, we need to store the actual values (truth) as y_train_vec and y_test_vec, which are needed for modeling our ANN. We convert to a series of numeric ones and zeros which can be accepted by the Keras ANN modeling functions. We add “vec” to the name so we can easily remember the class of the object (it’s easy to get confused when working with tibbles, vectors, and matrix data types).

    # Response variables for training and testing sets y_train_vec <- ifelse(pull(train_tbl, Churn) == "Yes", 1, 0) y_test_vec <- ifelse(pull(test_tbl, Churn) == "Yes", 1, 0) Model Customer Churn With Keras (Deep Learning)

    This is super exciting!! Finally, Deep Learning with Keras in R! The team at RStudio has done fantastic work recently to create the keras package, which implements Keras in R. Very cool!

    Background On Artifical Neural Networks

    For those unfamiliar with Neural Networks (and those that need a refresher), read this article. It’s very comprehensive, and you’ll leave with a general understanding of the types of deep learning and how they work.

    Source: Xenon Stack

    Deep Learning has been available in R for some time, but the primary packages used in the wild have not (this includes Keras, Tensor Flow, Theano, etc, which are all Python libraries). It’s worth mentioning that a number of other Deep Learning packages exist in R including h2o, mxnet, and others. The interested reader can check out this blog post for a comparison of deep learning packages in R.

    Building A Deep Learning Model

    We’re going to build a special class of ANN called a Multi-Layer Perceptron (MLP). MLPs are one of the simplest forms of deep learning, but they are both highly accurate and serve as a jumping-off point for more complex algorithms. MLPs are quite versatile as they can be used for regression, binary and multi classification (and are typically quite good at classification problems).

    We’ll build a three layer MLP with Keras. Let’s walk-through the steps before we implement in R.

    1. Initialize a sequential model: The first step is to initialize a sequential model with keras_model_sequential(), which is the beginning of our Keras model. The sequential model is composed of a linear stack of layers.

    2. Apply layers to the sequential model: Layers consist of the input layer, hidden layers and an output layer. The input layer is the data and provided it’s formatted correctly there’s nothing more to discuss. The hidden layers and output layers are what controls the ANN inner workings.

      • Hidden Layers: Hidden layers form the neural network nodes that enable non-linear activation using weights. The hidden layers are created using layer_dense(). We’ll add two hidden layers. We’ll apply units = 16, which is the number of nodes. We’ll select kernel_initializer = "uniform" and activation = "relu" for both layers. The first layer needs to have the input_shape = 35, which is the number of columns in the training set. Key Point: While we are arbitrarily selecting the number of hidden layers, units, kernel initializers and activation functions, these parameters can be optimized through a process called hyperparameter tuning that is discussed in Next Steps.

      • Dropout Layers: Dropout layers are used to control overfitting. This eliminates weights below a cutoff threshold to prevent low weights from overfitting the layers. We use the layer_dropout() function add two drop out layers with rate = 0.10 to remove weights below 10%.

      • Output Layer: The output layer specifies the shape of the output and the method of assimilating the learned information. The output layer is applied using the layer_dense(). For binary values, the shape should be units = 1. For multi-classification, the units should correspond to the number of classes. We set the kernel_initializer = "uniform" and the activation = "sigmoid" (common for binary classification).

    3. Compile the model: The last step is to compile the model with compile(). We’ll use optimizer = "adam", which is one of the most popular optimization algorithms. We select loss = "binary_crossentropy" since this is a binary classification problem. We’ll select metrics = c("accuracy") to be evaluated during training and testing. Key Point: The optimizer is often included in the tuning process.

    Let’s codify the discussion above to build our Keras MLP-flavored ANN model.

    # Building our Artificial Neural Network model_keras <- keras_model_sequential() model_keras %>% # First hidden layer layer_dense( units = 16, kernel_initializer = "uniform", activation = "relu", input_shape = ncol(x_train_tbl)) %>% # Dropout to prevent overfitting layer_dropout(rate = 0.1) %>% # Second hidden layer layer_dense( units = 16, kernel_initializer = "uniform", activation = "relu") %>% # Dropout to prevent overfitting layer_dropout(rate = 0.1) %>% # Output layer layer_dense( units = 1, kernel_initializer = "uniform", activation = "sigmoid") %>% # Compile ANN compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = c('accuracy') ) model_keras ## Model ## ______________________________________________________________________ ## Layer (type) Output Shape Param # ## ====================================================================== ## dense_1 (Dense) (None, 16) 576 ## ______________________________________________________________________ ## dropout_1 (Dropout) (None, 16) 0 ## ______________________________________________________________________ ## dense_2 (Dense) (None, 16) 272 ## ______________________________________________________________________ ## dropout_2 (Dropout) (None, 16) 0 ## ______________________________________________________________________ ## dense_3 (Dense) (None, 1) 17 ## ====================================================================== ## Total params: 865 ## Trainable params: 865 ## Non-trainable params: 0 ## ______________________________________________________________________

    We use the fit() function to run the ANN on our training data. The object is our model, and x and y are our training data in matrix and numeric vector forms, respectively. The batch_size = 50 sets the number samples per gradient update within each epoch. We set epochs = 35 to control the number training cycles. Typically we want to keep the batch size high since this decreases the error within each training cycle (epoch). We also want epochs to be large, which is important in visualizing the training history (discussed below). We set validation_split = 0.30 to include 30% of the data for model validation, which prevents overfitting. The training process should complete in 15 seconds or so.

    # Fit the keras model to the training data fit_keras <- fit( object = model_keras, x = as.matrix(x_train_tbl), y = y_train_vec, batch_size = 50, epochs = 35, validation_split = 0.30 )

    We can inspect the final model. We want to make sure there is minimal difference between the validation accuracy and the training accuracy.

    # Print the final model fit_keras ## Trained on 3,938 samples, validated on 1,688 samples (batch_size=50, epochs=35) ## Final epoch (plot to see history): ## val_loss: 0.4215 ## val_acc: 0.8057 ## loss: 0.399 ## acc: 0.8101

    We can visualize the Keras training history using the plot() function. What we want to see is the validation accuracy and loss leveling off, which means the model has completed training. We see that there is some divergence between training loss/accuracy and validation loss/accuracy. This model indicates we can possibly stop training at an earlier epoch. Pro Tip: Only use enough epochs to get a high validation accuracy. Once validation accuracy curve begins to flatten or decrease, it’s time to stop training.

    # Plot the training/validation history of our Keras model plot(fit_keras) + theme_tq() + scale_color_tq() + scale_fill_tq() + labs(title = "Deep Learning Training Results")

    Making Predictions

    We’ve got a good model based on the validation accuracy. Now let’s make some predictions from our keras model on the test data set, which was unseen during modeling (we use this for the true performance assessment). We have two functions to generate predictions:

    • predict_classes: Generates class values as a matrix of ones and zeros. Since we are dealing with binary classification, we’ll convert the output to a vector.
    • predict_proba: Generates the class probabilities as a numeric matrix indicating the probability of being a class. Again, we convert to a numeric vector because there is only one column output.
    # Predicted Class yhat_keras_class_vec <- predict_classes(object = model_keras, x = as.matrix(x_test_tbl)) %>% as.vector() # Predicted Class Probability yhat_keras_prob_vec <- predict_proba(object = model_keras, x = as.matrix(x_test_tbl)) %>% as.vector() Inspect Performance With Yardstick

    The yardstick package has a collection of handy functions for measuring performance of machine learning models. We’ll overview some metrics we can use to understand the performance of our model.

    First, let’s get the data formatted for yardstick. We create a data frame with the truth (actual values as factors), estimate (predicted values as factors), and the class probability (probability of yes as numeric). We use the fct_recode() function from the forcats package to assist with recoding as Yes/No values.

    # Format test data and predictions for yardstick metrics estimates_keras_tbl <- tibble( truth = as.factor(y_test_vec) %>% fct_recode(yes = "1", no = "0"), estimate = as.factor(yhat_keras_class_vec) %>% fct_recode(yes = "1", no = "0"), class_prob = yhat_keras_prob_vec ) estimates_keras_tbl ## # A tibble: 1,406 x 3 ## truth estimate class_prob ## ## 1 yes no 0.328355074 ## 2 yes yes 0.633630514 ## 3 no no 0.004589651 ## 4 no no 0.007402068 ## 5 no no 0.049968336 ## 6 no no 0.116824441 ## 7 no yes 0.775479317 ## 8 no no 0.492996633 ## 9 no no 0.011550998 ## 10 no no 0.004276015 ## # ... with 1,396 more rows

    Now that we have the data formatted, we can take advantage of the yardstick package.

    Confusion Table

    We can use the conf_mat() function to get the confusion table. We see that the model was by no means perfect, but it did a decent job of identifying customers likely to churn.

    # Confusion Table estimates_keras_tbl %>% conf_mat(truth, estimate) ## Truth ## Prediction no yes ## no 950 161 ## yes 99 196 Accuracy

    We can use the metrics() function to get an accuracy measurement from the test set. We are getting roughly 82% accuracy.

    # Accuracy estimates_keras_tbl %>% metrics(truth, estimate) ## # A tibble: 1 x 1 ## accuracy ## ## 1 0.8150782 AUC

    We can also get the ROC Area Under the Curve (AUC) measurement. AUC is often a good metric used to compare different classifiers and to compare to randomly guessing (AUC_random = 0.50). Our model has AUC = 0.85, which is much better than randomly guessing. Tuning and testing different classification algorithms may yield even better results.

    # AUC estimates_keras_tbl %>% roc_auc(truth, class_prob) ## [1] 0.8523951 Precision And Recall

    Precision is when the model predicts “yes”, how often is it actually “yes”. Recall (also true positive rate or specificity) is when the actual value is “yes” how often is the model correct. We can get precision() and recall() measurements using yardstick.

    # Precision tibble( precision = estimates_keras_tbl %>% precision(truth, estimate), recall = estimates_keras_tbl %>% recall(truth, estimate) ) ## # A tibble: 1 x 2 ## precision recall ## ## 1 0.8550855 0.9056244

    Precision and recall are very important to the business case: The organization is concerned with balancing the cost of targeting and retaining customers at risk of leaving with the cost of inadvertently targeting customers that are not planning to leave (and potentially decreasing revenue from this group). The threshold above which to predict Churn = “Yes” can be adjusted to optimize for the business problem. This becomes an Customer Lifetime Value optimization problem that is discussed further in Next Steps.

    F1 Score

    We can also get the F1-score, which is a weighted average between the precision and recall. Machine learning classifier thresholds are often adjusted to maximize the F1-score. However, this is often not the optimal solution to the business problem.

    # F1-Statistic estimates_keras_tbl %>% f_meas(truth, estimate, beta = 1) ## [1] 0.8796296 Explain The Model With LIME

    LIME stands for Local Interpretable Model-agnostic Explanations, and is a method for explaining black-box machine learning model classifiers. For those new to LIME, this YouTube video does a really nice job explaining how LIME helps to identify feature importance with black box machine learning models (e.g. deep learning, stacked ensembles, random forest).

    Setup

    The lime package implements LIME in R. One thing to note is that it’s not setup out-of-the-box to work with keras. The good news is with a few functions we can get everything working properly. We’ll need to make two custom functions:

    • model_type: Used to tell lime what type of model we are dealing with. It could be classification, regression, survival, etc.

    • predict_model: Used to allow lime to perform predictions that its algorithm can interpret.

    The first thing we need to do is identify the class of our model object. We do this with the class() function.

    class(model_keras) ## [1] "keras.models.Sequential" ## [2] "keras.engine.training.Model" ## [3] "keras.engine.topology.Container" ## [4] "keras.engine.topology.Layer" ## [5] "python.builtin.object"

    Next we create our model_type() function. It’s only input is x the keras model. The function simply returns “classification”, which tells LIME we are classifying.

    # Setup lime::model_type() function for keras model_type.keras.models.Sequential <- function(x, ...) { return("classification") }

    Now we can create our predict_model() function, which wraps keras::predict_proba(). The trick here is to realize that it’s inputs must be x a model, newdata a dataframe object (this is important), and type which is not used but can be use to switch the output type. The output is also a little tricky because it must be in the format of probabilities by classification (this is important; shown next).

    # Setup lime::predict_model() function for keras predict_model.keras.models.Sequential <- function(x, newdata, type, ...) { pred <- predict_proba(object = x, x = as.matrix(newdata)) return(data.frame(Yes = pred, No = 1 - pred)) }

    Run this next script to show you what the output looks like and to test our predict_model() function. See how it’s the probabilities by classification. It must be in this form for model_type = "classification".

    # Test our predict_model() function predict_model(x = model_keras, newdata = x_test_tbl, type = 'raw') %>% tibble::as_tibble() ## # A tibble: 1,406 x 2 ## Yes No ## ## 1 0.328355074 0.6716449 ## 2 0.633630514 0.3663695 ## 3 0.004589651 0.9954103 ## 4 0.007402068 0.9925979 ## 5 0.049968336 0.9500317 ## 6 0.116824441 0.8831756 ## 7 0.775479317 0.2245207 ## 8 0.492996633 0.5070034 ## 9 0.011550998 0.9884490 ## 10 0.004276015 0.9957240 ## # ... with 1,396 more rows

    Now the fun part, we create an explainer using the lime() function. Just pass the training data set without the “Attribution column”. The form must be a data frame, which is OK since our predict_model function will switch it to an keras object. Set model = automl_leader our leader model, and bin_continuous = FALSE. We could tell the algorithm to bin continuous variables, but this may not make sense for categorical numeric data that we didn’t change to factors.

    # Run lime() on training set explainer <- lime::lime( x = x_train_tbl, model = model_keras, bin_continuous = FALSE)

    Now we run the explain() function, which returns our explanation. This can take a minute to run so we limit it to just the first ten rows of the test data set. We set n_labels = 1 because we care about explaining a single class. Setting n_features = 4 returns the top four features that are critical to each case. Finally, setting kernel_width = 0.5 allows us to increase the “model_r2” value by shrinking the localized evaluation.

    # Run explain() on explainer explanation <- lime::explain( x_test_tbl[1:10,], explainer = explainer, n_labels = 1, n_features = 4, kernel_width = 0.5) Feature Importance Visualization

    The payoff for the work we put in using LIME is this feature importance plot. This allows us to visualize each of the first ten cases (observations) from the test data. The top four features for each case are shown. Note that they are not the same for each case. The green bars mean that the feature supports the model conclusion, and the red bars contradict. A few important features based on frequency in first ten cases:

    • Tenure (7 cases)
    • Senior Citizen (5 cases)
    • Online Security (4 cases)
    plot_features(explanation) + labs(title = "LIME Feature Importance Visualization", subtitle = "Hold Out (Test) Set, First 10 Cases Shown")

    Another excellent visualization can be performed using plot_explanations(), which produces a facetted heatmap of all case/label/feature combinations. It’s a more condensed version of plot_features(), but we need to be careful because it does not provide exact statistics and it makes it less easy to investigate binned features (Notice that “tenure” would not be identified as a contributor even though it shows up as a top feature in 7 of 10 cases).

    plot_explanations(explanation) + labs(title = "LIME Feature Importance Heatmap", subtitle = "Hold Out (Test) Set, First 10 Cases Shown")

    Check Explanations With Correlation Analysis

    One thing we need to be careful with the LIME visualization is that we are only doing a sample of the data, in our case the first 10 test observations. Therefore, we are gaining a very localized understanding of how the ANN works. However, we also want to know on from a global perspective what drives feature importance.

    We can perform a correlation analysis on the training set as well to help glean what features correlate globally to “Churn”. We’ll use the corrr package, which performs tidy correlations with the function correlate(). We can get the correlations as follows.

    # Feature correlations to Churn corrr_analysis <- x_train_tbl %>% mutate(Churn = y_train_vec) %>% correlate() %>% focus(Churn) %>% rename(feature = rowname) %>% arrange(abs(Churn)) %>% mutate(feature = as_factor(feature)) corrr_analysis ## # A tibble: 35 x 2 ## feature Churn ## ## 1 gender_Male -0.006690899 ## 2 tenure_bin3 -0.009557165 ## 3 MultipleLines_No.phone.service -0.016950072 ## 4 PhoneService_Yes 0.016950072 ## 5 MultipleLines_Yes 0.032103354 ## 6 StreamingTV_Yes 0.066192594 ## 7 StreamingMovies_Yes 0.067643871 ## 8 DeviceProtection_Yes -0.073301197 ## 9 tenure_bin4 -0.073371838 ## 10 PaymentMethod_Mailed.check -0.080451164 ## # ... with 25 more rows

    The correlation visualization helps in distinguishing which features are relavant to Churn.

    # Correlation visualization corrr_analysis %>% ggplot(aes(x = Churn, y = fct_reorder(feature, desc(Churn)))) + geom_point() + # Positive Correlations - Contribute to churn geom_segment(aes(xend = 0, yend = feature), color = palette_light()[[2]], data = corrr_analysis %>% filter(Churn > 0)) + geom_point(color = palette_light()[[2]], data = corrr_analysis %>% filter(Churn > 0)) + # Negative Correlations - Prevent churn geom_segment(aes(xend = 0, yend = feature), color = palette_light()[[1]], data = corrr_analysis %>% filter(Churn < 0)) + geom_point(color = palette_light()[[1]], data = corrr_analysis %>% filter(Churn < 0)) + # Vertical lines geom_vline(xintercept = 0, color = palette_light()[[5]], size = 1, linetype = 2) + geom_vline(xintercept = -0.25, color = palette_light()[[5]], size = 1, linetype = 2) + geom_vline(xintercept = 0.25, color = palette_light()[[5]], size = 1, linetype = 2) + # Aesthetics theme_tq() + labs(title = "Churn Correlation Analysis", subtitle = "Positive Correlations (contribute to churn), Negative Correlations (prevent churn)", y = "Feature Importance")

    The correlation analysis helps us quickly disseminate which features that the LIME analysis may be excluding. We can see that the following features are highly correlated (magnitude > 0.25):

    Increases Likelihood of Churn (Red):

    • Tenure = Bin 1 (<12 Months)
    • Internet Service = “Fiber Optic”
    • Payment Method = “Electronic Check”

    Decreases Likelihood of Churn (Blue):

    • Contract = “Two Year”
    • Total Charges (Note that this may be a biproduct of additional services such as Online Security)
    Feature Investigation

    We can investigate features that are most frequent in the LIME feature importance visualization along with those that the correlation analysis shows an above normal magnitude. We’ll investigate:

    • Tenure (7/10 LIME Cases, Highly Correlated)
    • Contract (Highly Correlated)
    • Internet Service (Highly Correlated)
    • Payment Method (Highly Correlated)
    • Senior Citizen (5/10 LIME Cases)
    • Online Security (4/10 LIME Cases)
    Tenure (7/10 LIME Cases, Highly Correlated)

    LIME cases indicate that the ANN model is using this feature frequently and high correlation agrees that this is important. Investigating the feature distribution, it appears that customers with lower tenure (bin 1) are more likely to leave. Opportunity: Target customers with less than 12 month tenure.

    Contract (Highly Correlated)

    While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with one and two year contracts are much less likely to churn. Opportunity: Offer promotion to switch to long term contracts.

    Internet Service (Highly Correlated)

    While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with fiber optic service are more likely to churn while those with no internet service are less likely to churn. Improvement Area: Customers may be dissatisfied with fiber optic service.

    Payment Method (Highly Correlated)

    While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with electronic check are more likely to leave. Opportunity: Offer customers a promotion to switch to automatic payments.

    Senior Citizen (5/10 LIME Cases)

    Senior citizen appeared in several of the LIME cases indicating it was important to the ANN for the 10 samples. However, it was not highly correlated to Churn, which may indicate that the ANN is using in an more sophisticated manner (e.g. as an interaction). It’s difficult to say that senior citizens are more likely to leave, but non-senior citizens appear less at risk of churning. Opportunity: Target users in the lower age demographic.

    Online Security (4/10 LIME Cases)

    Customers that did not sign up for online security were more likely to leave while customers with no internet service or online security were less likely to leave. Opportunity: Promote online security and other packages that increase retention rates.

    Next Steps: Business Science University

    We’ve just scratched the surface with the solution to this problem, but unfortunately there’s only so much ground we can cover in an article. Here are a few next steps that I’m pleased to announce will be covered in a Business Science University course coming in 2018!

    Customer Lifetime Value

    Your organization needs to see the financial benefit so always tie your analysis to sales, profitability or ROI. Customer Lifetime Value (CLV) is a methodology that ties the business profitability to the retention rate. While we did not implement the CLV methodology herein, a full customer churn analysis would tie the churn to an classification cutoff (threshold) optimization to maximize the CLV with the predictive ANN model.

    The simplified CLV model is:

    CLV=GC*\frac{1}{1+d-r}

    Where,

    • GC is the gross contribution per customer
    • d is the annual discount rate
    • r is the retention rate
    ANN Performance Evaluation and Improvement

    The ANN model we built is good, but it could be better. How we understand our model accuracy and improve on it is through the combination of two techniques:

    • K-Fold Cross-Fold Validation: Used to obtain bounds for accuracy estimates.
    • Hyper Parameter Tuning: Used to improve model performance by searching for the best parameters possible.

    We need to implement K-Fold Cross Validation and Hyper Parameter Tuning if we want a best-in-class model.

    Distributing Analytics

    It’s critical to communicate data science insights to decision makers in the organization. Most decision makers in organizations are not data scientists, but these individuals make important decisions on a day-to-day basis. The PowerBI application below includes a Customer Scorecard to monitor customer health (risk of churn). The application walks the user through the machine learning journey for how the model was developed, what it means to stakeholders, and how it can be used in production.

    View in Full Screen Mode for best experience

    For those seeking options for distributing analytics, two good options are:

    • Shiny Apps for rapid prototyping: Shiny web applications offer the maximum flexibility with R algorithms built in. Shiny is more complex to learn, but Shiny applications are incredible / limitless.

    • Microsoft PowerBI and Tableau for Visualization: Enable distributed analytics with the advantage of intuitive structure but with some flexibilty sacrificed. Can be difficult to build ML into.

    Business Science University

    You’re probably wondering why we are going into so much detail on next steps. We are happy to announce a new project for 2018: Business Science University, an online school dedicated to helping data science learners improve in the areas of:

    • Advanced machine learning techniques
    • Developing interactive data products and applications using Shiny and other tools
    • Distributing data science within an organization

    Learning paths will be focused on business and financial applications. We’ll keep you posted via social media and our blog (please follow us / subscribe to stay updated).

    Please let us know if you are interested in joining Business Science University. Let us know what you think in the Disqus Comments below.

    Conclusions

    Customer churn is a costly problem. The good news is that machine learning can solve churn problems, making the organization more profitable in the process. In this article, we saw how Deep Learning can be used to predict customer churn. We built an ANN model using the new keras package that achieved 82% predictive accuracy (without tuning)! We used three new machine learning packages to help with preprocessing and measuring performance: recipes, rsample and yardstick. Finally we used lime to explain the Deep Learning model, which traditionally was impossible! We checked the LIME results with a Correlation Analysis, which brought to light other features to investigate. For the IBM Telco dataset, tenure, contract type, internet service type, payment menthod, senior citizen status, and online security status were useful in diagnosing customer churn. We hope you enjoyed this article!

    About Business Science

    Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business and financial applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science can help to expand into predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

    Follow Business Science on Social Media var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    changes: easy Git-based version control from R

    Tue, 11/28/2017 - 01:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    Are you new to version control and always running into trouble with Git?
    Or are you a seasoned user, haunted by the traumas of learning Git and reliving them whilst trying to teach it to others?
    Yeah, us too.

    Git is a version control tool designed for software development, and it is extraordinarily powerful. It didn’t actually dawn on me quite how amazing Git is until I spent a weekend in Melbourne with a group of Git whizzes using Git to write a package targeted toward Git beginners. Whew, talk about total Git immersion! I was taking part in the 2017 rOpenSci ozunconf, in which forty-odd developers, scientists, researchers, nerds, teachers, starving students, cat ladies, and R users of all descriptions form teams to create new R packages fulfilling some new and useful function. Many of the groups used Git for their collaborative workflows all weekend.

    Unfortunately, just like many a programming framework, Git can often be a teensy bit (read: extremely, prohibitively) intimidating, especially for beginners who don’t need all of Git’s numerous and baffling features.
    It’s one of those platforms that makes your life a million times better once you know how to use it, but if you’re trying to teach yourself the basics using the internet, or—heaven forbid—trying to untangle yourself from some Git-branch tangle that you’ve unwittingly become snarled in… (definitely done that one…) well, let’s just say using your knuckles to break a brick wall can sometimes seem preferable.
    Just ask the Git whizzes.
    They laugh, because they’ve been there, done that.

    The funny thing is, doing basic version control in Git only requires a few commands.
    After browsing through the available project ideas and settling into teams, a group of eight of us made a list of the commands that we use on a daily basis, and the list came to about a dozen.
    We looked up our Git histories and compiled a Git vocabulary, which came out to less than 50 commands, including combination commands.

    As Nick Golding so shrewdly recognized in the lead up to this year’s unconference, the real obstacle for new Git users is not the syntax, it’s actually (a) the scary, scary terminal window and (b) the fact that Git terminology was apparently chosen by randomly opening a verb dictionary and blindly pointing to a spot on the page.
    (Ok, I’m exaggerating, but the point is that the terminology is pretty confusing).
    We decided to address these two problems by making a package that uses the R console and reimagining the version control vocabulary and workflow for people who are new to version control and only need some of its many features.

    Somewhat ironically, nine people worked for two days on a dozen branches, using Git and GitHub to seamlessly merge our workflows.
    It was wonderful to see how so many people’s various talents can be combined to make something that no group members could have done all on their own.

    Enter, changes ( repo, website – made using pkgdown), our new R package to do version control with a few simple commands.
    It uses Git and Git2r under the hood, but new users don’t need to know any Git to begin using version control with changes.
    Best of all, it works seamlessly with regular Git. So if a user thinks they’re ready to expand their horizons they can start using git commands via the Githug package, RStudio’s git interface, or on the command line.

    Here is an overview of some of the ways we’ve made simple version control easy with changes:

    Simple terminology

    It uses simple and deliberately un-git-like terminology:

    • You start a new version control project with create_repo(), which is like git init but it can set up a nice project directory structure for you, automatically ignoring things like output folders.
    • All of the steps involved in commiting edits have been compressed into one function: record(). All files that aren’t ignored will be committed, so users don’t need to know the difference between tracking, staging and committing files.
    • It’s easy to set which files to omit from version control with ignore(), and to change your mind with unignore().
    • changes() lets you know which files have changed since the last record, like a hybrid of git status and git diff.
    • You can look back in history with timeline() (a simplified version of git log), go_to() a previous record (like git checkout), and scrub() any unwanted changes since the last record (like git reset --hard).
    It’s linear

    After a long discussion, we decided that changes won’t provide an interface to Git branches (at least not yet), as the merge conflicts it leads to are one of the scariest things about version control for beginners.
    With linear version control, users can can easily go_to() a past record with a version number, rather than unfamiliar SHA’s. These numbers appear in the a lovely visual representation of their timeline():

    (1) initial commit | 2017-11-18 02:55 | (2) set up project structure | 2017-11-18 02:55 | (3) added stuff to readme 2017-11-18 02:55

    If you want to roll your project back to a previous record, you can retrieve() it, and changes will simply append that record at the top of your timeline (storing all the later records, just in case).

    Readable messages and automatic reminders

    Some of Git’s messages and helpfiles are totally cryptic to all but the most hardened computer scientists.
    Having been confronted with our fair share of detached HEADs and offers to update remote refs along with associated objects, we were keen to make sure all the error messages and helpfiles in changes are as intuitive and understandable as possible.

    It can also be hard to get into the swing of recording edits, so changes will give you reminders to encourage you to use record() regularly. You can change the time interval for reminders, or switch them off, using remind_me().

    Coming soon

    We made a lot of progress in two days, but there’s plenty more we’re planning to add soon:

    1. Simplified access to GitHub with a sync() command to automagically handle most uses of git fetch, git merge, and git push.
    2. A Git training-wheels mode, so that people who want to move use Git can view the Git commands changes is using under the hood.
    3. Added flexibility – we are working on adding functionality to handle simple deviations from the defaults, such as recording changes only to named files, or to all except some excluded files.

    We’d be really keen to hear your suggestions too, so please let us know your ideas via the changes issue tracker!

    I have only recently started using Git and GitHub, and this year’s rOpenSci ozunconf was a big eye-opener for me, in several ways.
    Beyond finally understanding to power of proper version control, I met a group of wonderful people dedicated to participating in the R community.
    Now as it turns out, R users take the word “community” very seriously.
    Each and every person I met during the event was open and friendly.
    Each person had ideas for attracting new users to R, making it easier to learn, making methods and data more readily available, and creating innovative new functionality.
    Even before the workshop began, dozens of ideas for advancement circulated on GitHub Issues.
    Throughout the conference, it was a pleasure to be a part of the ongoing conversation and dialogue about growing and improving the R community.
    That’s right, you can delete any lingering ‘introverted computer geek’ stereotypes you might still be harbouring in a cobwebbed attic of your mind.
    In today’s day and age, programming is as much about helping each other, communicating, learning, and networking as it is about solving problems.
    And building the community is a group effort.

    R users come from all sorts of backgrounds, but I was gratified to see scientists and researchers well-represented at the unconference.
    Gone are the days when I need to feel like the ugly duckling for being the only R user in my biology lab!
    If you still find yourself isolated, join the blooming online R users community, or any one of a number of meetups and clubs that are popping up everywhere.
    I have dipped my toe in those waters, and boy am I glad I did!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages