R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 6 min ago

### Why we wrote wrapr to/unpack

Wed, 01/22/2020 - 19:36

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One reason we are developing the wrapr to/unpack methods is the following: we wanted to spruce up the R vtreat interface a bit.

We had recently back-ported a Python sklearn Pipeline step style interface from the Python vtreat to R (announcement here). But that doesn’t mean we are not continuing to make enhancements to the R style interfaces, using features that are unique to R.

With the upcoming wrapr unpacking features we can re-write use of vtreat’s mkCrossFrame*Experiment() cross-frame interfaces from this:

# remotes::install_github("WinVector/wrapr") library(vtreat) library(wrapr) ... cross_frame_experiment <- mkCrossFrameCExperiment( d, varlist = varlist, outcomename = outcomename, outcometarget = outcometarget) # unpack the items we want from returned list treatments <- cross_frame_experiment$treatments cross_frame <- cross_frame_experiment$crossFrame

To this:

mkCrossFrameCExperiment( d, varlist = varlist, outcomename = outcomename, outcometarget = outcometarget) %.>% to[ treatments <- treatments, cross_frame <- crossFrame ]

In our first unpack note we mentioned there were R functions that returned named lists. The vtreat mkCrossFrame*Experiment() functions are just that: functions that return named lists. Functions that return named lists/classes really call out for a named multiple assignment operator (for a positional multiple assignment operator see zeallot, and for a different named multiple assignment system see also vadr).

The assignments in a unpacking block (currently written as: to, or unpack) can use any of the R local assignment-like operators (<-, ->, =, or :=). And we have the convention that a name alone such as “a” is shorthand for “a <- a“. These unpack operators now look like a full named multi-assignment system for R.

Below is a small stand-alone example of the assignment notation.

# remotes::install_github("WinVector/wrapr") library(wrapr) to[ a_variable <- a_value, b_variable <- b_value ] <- list(a_value = 1, b_value = 2) a_variable #> [1] 1 b_variable #> [1] 2

We are working on a new vignette about named multiple assignment here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Why we wrote wrapr to/unpack

Wed, 01/22/2020 - 19:36

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One reason we are developing the wrapr to/unpack methods is the following: we wanted to spruce up the R vtreat interface a bit.

We had recently back-ported a Python sklearn Pipeline step style interface from the Python vtreat to R (announcement here). But that doesn’t mean we are not continuing to make enhancements to the R style interfaces, using features that are unique to R.

With the upcoming wrapr unpacking features we can re-write use of vtreat’s mkCrossFrame*Experiment() cross-frame interfaces from this:

# remotes::install_github("WinVector/wrapr") library(vtreat) library(wrapr) ... cross_frame_experiment <- mkCrossFrameCExperiment( d, varlist = varlist, outcomename = outcomename, outcometarget = outcometarget) # unpack the items we want from returned list treatments <- cross_frame_experiment$treatments cross_frame <- cross_frame_experiment$crossFrame

To this:

mkCrossFrameCExperiment( d, varlist = varlist, outcomename = outcomename, outcometarget = outcometarget) %.>% to[ treatments <- treatments, cross_frame <- crossFrame ]

In our first unpack note we mentioned there were R functions that returned named lists. The vtreat mkCrossFrame*Experiment() functions are just that: functions that return named lists. Functions that return named lists/classes really call out for a named multiple assignment operator (for a positional multiple assignment operator see zeallot, and for a different named multiple assignment system see also vadr).

The assignments in a unpacking block (currently written as: to, or unpack) can use any of the R local assignment-like operators (<-, ->, =, or :=). And we have the convention that a name alone such as “a” is shorthand for “a <- a“. These unpack operators now look like a full named multi-assignment system for R.

Below is a small stand-alone example of the assignment notation.

# remotes::install_github("WinVector/wrapr") library(wrapr) to[ a_variable <- a_value, b_variable <- b_value ] <- list(a_value = 1, b_value = 2) a_variable #> [1] 1 b_variable #> [1] 2

We are working on a new vignette about named multiple assignment here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### rstudio::conf 2020 –> conference preview

Wed, 01/22/2020 - 12:35

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two weeks ago I flew from New York to Warsaw, Poland to begin a new role at Appsilon Data Science. On Saturday, I’ll be flying right back in the other direction to attend rstudio::conf::2020. This is convenient because my bags aren’t even unpacked yet, and also exciting because rstudio::conf is an all-star weekend for the R community.

If you are planning to attend the conference, please drop by the Appsilon Data Science booth in the Yosemite room and say hello to me and a few other members of the Appsilon team! You can also find us at the Bay Area useR Group meetup at 6:30pm on January 27 at the Hilton San Francisco Union Square. We attend and sponsor a lot of R events in Europe so we’re excited to check in with the community in the San Francisco Bay Area. Conference tickets are not required for the meetup, so you have no excuse not to join us.

And if you want to book a free session with one of our experts to discuss your R/Shiny or data science project, you can do so here.

It would also be a good idea to find some time to attend my colleagues’ presentations. One of our founders, Damian Rodziewicz, is giving a presentation entitled “Building a native iPad dashboard using plumber and RStudio Connect in Pharma” on Jan. 29th at 4:23pm in Room 4. Damian will also present an e-poster: “What does it take to make a Shiny prototype ready for production?”

One of our talented developers, Pedro Silva will present an e-poster “From Front-End to Shiny development. Augmenting the power of your dashboards with modern front-end tools.” E-posters will be shown during the opening reception on Tuesday (Jan. 28) from 5:30pm to 7:30pm.

I’ve asked my colleagues what presentations they’re looking forward to most at rstudio::conf. Here’s what they said:

PAWEL

I am interested in stopping by Reproducible Shiny apps with shinymeta with Carson Sievert. shinymeta was developed fairly recently and offers insights into the shiny development workflow. I would like to learn about its various features.

DAMIAN

I’m pretty excited to attend the Modern Geospatial Data Analysis with R Workshop with Zev Ross. Here is the description: “An explosion of packages for working with spatial data means you can ditch your GIS software and do geospatial analysis in R end-to-end.” I am very interested in analyzing satellite images, so I am looking forward to this presentation.

PEDRO

I might try to catch the Javascript for Shiny Users Workshop. I’m an avid Javascript user and I believe that there is a real place for js in Shiny apps. Some people believe that Javascript should be eliminated completely, but they are wrong. As long as there is an internet, there will be Javascript.

I asked my other colleagues who will be staying in Poland: If you could attend the conference, which presentations would you be most excited about?

MARIA

I would attend Introduction to Data Science in the Tidyverse Workshop with Amelia McNamara and Hadley Wickham. I just started my journey with R and I believe it would be very valuable for me.

KRZYSIEK

I think I would go with “Deep Learning with Keras and TensorFlow in R Workflow” with Brad Boehmke because I know a bit of the theory of Deep Learning, but what I really need is to gain some practice. Confidence in doing, as they say. As a company we have been doing some interesting things with Deep Learning as of late, so I’m looking forward to joining the fun.

TOMASZ

This one looks fun: Learning R with humorous side projects with Ryan Timpe, who actually has a really cool-sounding job — Senior Data Scientist at @LEGO_Group. I would also love to see this one in person: Practical Plumber Patterns with James Blair.  Plumber appears to be on the rise, I’d love to see the production-grade setting with this framework. And, last but not least: Updates on Spark, MLflow, and the broader ML ecosystem with Javier Laraschi looks very interesting – Python is definitely having a head start here, I would love to get to know what’s the state of affairs in R world.

Here are my picks:

Creating custom vector classes with the vctrs package with Jesse Sadler: It’s a new implementation that allows you to paint custom vectors. I’d be curious to hear about the updates to the package.

Asynchronous programming in R with Winston Chang: It’ll be very powerful when we’re talking about its application to Shiny and Plumber. Winston Chang does great work.

R: Then and Now with Jared Lander: it would be really good to go back and see how R looked 10 years ago, and how it’s developed over time.

MARCIN

Kryzysiek and Marcin at WhyR? Warsaw 2019

Modern Geospatial Data Analysis with R Workshop: Like Damian, I am particularly interested in maps and putting data on them. This is something I learned at university and I would like to hear about the recent developments in this area.

What They Forgot to Teach You about R Workshop with Jenny Bryan, Jim Hester, and Kara Woo: I am expecting to learn something extraordinary, something I wouldn’t have discovered myself! That’s an impressive line-up of speakers.

Deploying End-To-End Data Science with Shiny, Plumber, and Pins: Pins was recently introduced by RStudio and I would like to learn more about it!

KRYSTIAN

I’d be happy to attend Deploying End-To-End Data Science with Shiny, Plumber, and Pins by Alex Gold. My daily work is strongly related to the topic so I’d be happy to learn how to improve my approach and make use of tips I could hear during the workshop. I’d also like to get inspiration on different approaches for scaling R solutions so for sure Practical Plumber Patterns by James Blair and Asynchronous programming in R by Winston Chang would be a good choice.

Follow Appsilon Data Science on Social Media

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### rstudio::conf 2020 –> conference preview

Wed, 01/22/2020 - 12:35

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two weeks ago I flew from New York to Warsaw, Poland to begin a new role at Appsilon Data Science. On Saturday, I’ll be flying right back in the other direction to attend rstudio::conf::2020. This is convenient because my bags aren’t even unpacked yet, and also exciting because rstudio::conf is an all-star weekend for the R community.

If you are planning to attend the conference, please drop by the Appsilon Data Science booth in the Yosemite room and say hello to me and a few other members of the Appsilon team! You can also find us at the Bay Area useR Group meetup at 6:30pm on January 27 at the Hilton San Francisco Union Square. We attend and sponsor a lot of R events in Europe so we’re excited to check in with the community in the San Francisco Bay Area. Conference tickets are not required for the meetup, so you have no excuse not to join us.

And if you want to book a free session with one of our experts to discuss your R/Shiny or data science project, you can do so here.

It would also be a good idea to find some time to attend my colleagues’ presentations. One of our founders, Damian Rodziewicz, is giving a presentation entitled “Building a native iPad dashboard using plumber and RStudio Connect in Pharma” on Jan. 29th at 4:23pm in Room 4. Damian will also present an e-poster: “What does it take to make a Shiny prototype ready for production?”

One of our talented developers, Pedro Silva will present an e-poster “From Front-End to Shiny development. Augmenting the power of your dashboards with modern front-end tools.” E-posters will be shown during the opening reception on Tuesday (Jan. 28) from 5:30pm to 7:30pm.

I’ve asked my colleagues what presentations they’re looking forward to most at rstudio::conf. Here’s what they said:

PAWEL

I am interested in stopping by Reproducible Shiny apps with shinymeta with Carson Sievert. shinymeta was developed fairly recently and offers insights into the shiny development workflow. I would like to learn about its various features.

DAMIAN

I’m pretty excited to attend the Modern Geospatial Data Analysis with R Workshop with Zev Ross. Here is the description: “An explosion of packages for working with spatial data means you can ditch your GIS software and do geospatial analysis in R end-to-end.” I am very interested in analyzing satellite images, so I am looking forward to this presentation.

PEDRO

I might try to catch the Javascript for Shiny Users Workshop. I’m an avid Javascript user and I believe that there is a real place for js in Shiny apps. Some people believe that Javascript should be eliminated completely, but they are wrong. As long as there is an internet, there will be Javascript.

I asked my other colleagues who will be staying in Poland: If you could attend the conference, which presentations would you be most excited about?

MARIA

I would attend Introduction to Data Science in the Tidyverse Workshop with Amelia McNamara and Hadley Wickham. I just started my journey with R and I believe it would be very valuable for me.

KRZYSIEK

I think I would go with “Deep Learning with Keras and TensorFlow in R Workflow” with Brad Boehmke because I know a bit of the theory of Deep Learning, but what I really need is to gain some practice. Confidence in doing, as they say. As a company we have been doing some interesting things with Deep Learning as of late, so I’m looking forward to joining the fun.

TOMASZ

This one looks fun: Learning R with humorous side projects with Ryan Timpe, who actually has a really cool-sounding job — Senior Data Scientist at @LEGO_Group. I would also love to see this one in person: Practical Plumber Patterns with James Blair.  Plumber appears to be on the rise, I’d love to see the production-grade setting with this framework. And, last but not least: Updates on Spark, MLflow, and the broader ML ecosystem with Javier Laraschi looks very interesting – Python is definitely having a head start here, I would love to get to know what’s the state of affairs in R world.

Here are my picks:

Creating custom vector classes with the vctrs package with Jesse Sadler: It’s a new implementation that allows you to paint custom vectors. I’d be curious to hear about the updates to the package.

Asynchronous programming in R with Winston Chang: It’ll be very powerful when we’re talking about its application to Shiny and Plumber. Winston Chang does great work.

R: Then and Now with Jared Lander: it would be really good to go back and see how R looked 10 years ago, and how it’s developed over time.

MARCIN

Kryzysiek and Marcin at WhyR? Warsaw 2019

Modern Geospatial Data Analysis with R Workshop: Like Damian, I am particularly interested in maps and putting data on them. This is something I learned at university and I would like to hear about the recent developments in this area.

What They Forgot to Teach You about R Workshop with Jenny Bryan, Jim Hester, and Kara Woo: I am expecting to learn something extraordinary, something I wouldn’t have discovered myself! That’s an impressive line-up of speakers.

Deploying End-To-End Data Science with Shiny, Plumber, and Pins: Pins was recently introduced by RStudio and I would like to learn more about it!

KRYSTIAN

I’d be happy to attend Deploying End-To-End Data Science with Shiny, Plumber, and Pins by Alex Gold. My daily work is strongly related to the topic so I’d be happy to learn how to improve my approach and make use of tips I could hear during the workshop. I’d also like to get inspiration on different approaches for scaling R solutions so for sure Practical Plumber Patterns by James Blair and Asynchronous programming in R by Winston Chang would be a good choice.

Follow Appsilon Data Science on Social Media

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Schön‘ juten Tach & Moin Moin: R trainings in Berlin & Hamburg

Wed, 01/22/2020 - 11:31

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is one of the leading programming languages for data analysis and data visualization. Unlock the potential of data science in our popular R trainings and learn how to turn data into added value with the open source language.

With more than 1,500 satisfied participants, eoda’s R trainings are among the leading courses in the German-speaking world. In April and October 2020, the time has come: We will bring our popular trainings „Introduction to R“ and „Machine Learning with R“ to Berlin and Hamburg.

Course contents:

Introduction to R

With practical tips and exercises, this introductory course serves as a basis for the further use of R in individual applications. In order to be able to lay the foundation for independent work, the focus is on teaching the logic and terminology of R. This course is aimed at beginners and therefore does not require any previous in-depth knowledge.

Machine Learning with R

Use machine learning and data mining algorithms to develop artificial intelligence applications based on data. Also learn about the challenges you face and how you can master them with R.

By means of practical examples and hands-on exercises, this course will give you the skills to independently implement machine learning procedures in R.

Berlin

Introduction to R 21.04. – 22.04.2020 | Machine Learning with R 23.04. – 24.04.2020

Hamburg

Introduction in R 13.10. – 14.10.2020 | Machine Learning with R 15.10. – 16.10.2020

These two introductory courses can also be booked as a bundle for an attractive price.

Save one of the coveted places and become a data science expert with R! We are looking forward to your registration. Here you can find more about the trainings: Berlin and Hamburg.

Even more: In our Python trainings we also give you a practical insight into the functional range of the universal programming language in the data science context. Contact us!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Schön‘ juten Tach & Moin Moin: R trainings in Berlin & Hamburg

Wed, 01/22/2020 - 11:31

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is one of the leading programming languages for data analysis and data visualization. Unlock the potential of data science in our popular R trainings and learn how to turn data into added value with the open source language.

With more than 1,500 satisfied participants, eoda’s R trainings are among the leading courses in the German-speaking world. In April and October 2020, the time has come: We will bring our popular trainings „Introduction to R“ and „Machine Learning with R“ to Berlin and Hamburg.

Course contents:

Introduction to R

With practical tips and exercises, this introductory course serves as a basis for the further use of R in individual applications. In order to be able to lay the foundation for independent work, the focus is on teaching the logic and terminology of R. This course is aimed at beginners and therefore does not require any previous in-depth knowledge.

Machine Learning with R

Use machine learning and data mining algorithms to develop artificial intelligence applications based on data. Also learn about the challenges you face and how you can master them with R.

By means of practical examples and hands-on exercises, this course will give you the skills to independently implement machine learning procedures in R.

Berlin

Introduction to R 21.04. – 22.04.2020 | Machine Learning with R 23.04. – 24.04.2020

Hamburg

Introduction in R 13.10. – 14.10.2020 | Machine Learning with R 15.10. – 16.10.2020

These two introductory courses can also be booked as a bundle for an attractive price.

Save one of the coveted places and become a data science expert with R! We are looking forward to your registration. Here you can find more about the trainings: Berlin and Hamburg.

Even more: In our Python trainings we also give you a practical insight into the functional range of the universal programming language in the data science context. Contact us!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Descriptive statistics in R

Wed, 01/22/2020 - 01:00

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This article explains how to compute the main descriptive statistics in R and how to present them graphically. To learn more about the reasoning behind each descriptive statistics, how to compute them by hand and how to interpret them, read the article “Descriptive statistics by hand”.

To briefly recap what have been said in that article, descriptive statistics (in the broad sense of the term) is a branch of statistics aiming at summarizing, describing and presenting a series of values or a dataset. Descriptive statistics is often the first step and an important part in any statistical analysis. It allows to check the quality of the data and it helps to “understand” the data by having a clear overview of it. If well presented, descriptive statistics is already a good starting point for further analyses. There exists many measures to summarize a dataset. They are divided into two types: (i) location and (ii) dispersion measures. Location measures give an understanding about the central tendency of the data, whereas dispersion measures give an understanding about the spread of the data. In this article, we focus only on the implementation in R of the most common descriptive statistics and their visualizations (when deemed appropriate). See online or in the above mentioned article for more information about the purpose and usage of each measure.

Data

We use the dataset iris throughout the article. This dataset is imported by default in R, you only need to load it by running iris:

dat <- iris # load the iris dataset and renamed it dat

Below a preview of this dataset and its structure:

head(dat) # first 6 observations ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa str(dat) # structure of dataset ## 'data.frame': 150 obs. of 5 variables: ## $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... The dataset contains 150 observations and 5 variables, representing the length and width of the sepal and petal and the species of 150 flowers. Length and width of the sepal and petal are numeric variables and the species is a factor with 3 levels (indicated by num and Factor w/ 3 levels after the name of the variables). See the different variables types in R if you need a refresh. Regarding plots, we present the default graphs and the graphs from the well-known {ggplot2} package. Graphs from the {ggplot2} package usually have a better look but it requires more advanced coding skills. If you need to publish or share your graphs, I suggest using {ggplot2} if you can, otherwise the default graphics will do the job. All plots displayed in this article can be customized. For instance, it is possible to edit the title, x and y-axis labels, color, etc. However, customizing plots is beyond the scope of this article so all plots are presented without any customization. Interested readers will find numerous resources online. Minimum and maximum Minimum and maximum can be found thanks to the min() and max() functions: min(dat$Sepal.Length) ## [1] 4.3 max(dat$Sepal.Length) ## [1] 7.9 Alternatively the range() function: rng <- range(dat$Sepal.Length) rng ## [1] 4.3 7.9

gives you the minimum and maximum directly. Note that the output of the range() function is actually an object containing the minimum and maximum (in that order). This means you can actually access the minimum with:

rng[1] # rng = name of the object specified above ## [1] 4.3

and the maximum with:

rng[2] ## [1] 7.9

This reminds us that, in R, there are often several ways to arrive at the same result. The method that uses the shortest piece of code is usually preferred as a shorter piece of code is less prone to coding errors and more readable.

Range

The range can then be easily computed, as you have guessed, by substracting the minimum from the maximum:

max(dat$Sepal.Length) - min(dat$Sepal.Length) ## [1] 3.6

To my knowledge, there is no default function to compute the range. However, if you are familiar with writing functions in R

, you can create your own function to compute the range:

range2 <- function(x) { range <- max(x) - min(x) return(range) } range2(dat$Sepal.Length) ## [1] 3.6 which is equivalent than $$max – min$$ presented above. Mean The mean can be computed with the mean() function: mean(dat$Sepal.Length) ## [1] 5.843333

Tips:

• if there is at least one missing value in your dataset, use mean(dat$Sepal.Length, na.rm = TRUE) to compute the mean with the NA excluded. This argument can be used for most functions presented in this article, not only the mean • for a truncated mean, use mean(dat$Sepal.Length, trim = 0.10) and change the trim argument to your needs
Median

The median can be computed thanks to the median() function:

median(dat$Sepal.Length) ## [1] 5.8 or with the quantile() function: quantile(dat$Sepal.Length, 0.5) ## 50% ## 5.8

since the quantile of order 0.5 ($$q_{0.5}$$) corresponds to the median.

First and third quartile

As the median, the first and third quartiles can be computed thanks to the quantile() function and by setting the second argument to 0.25 or 0.75:

quantile(dat$Sepal.Length, 0.25) # first quartile ## 25% ## 5.1 quantile(dat$Sepal.Length, 0.75) # third quartile ## 75% ## 6.4

You may have seen that the results above are slightly different than the results you would have found if you compute the first and third quartiles by hand. It is normal, there are many methods to compute them (R actually has 7 methods to compute the quantiles!). However, the methods presented here and in the article “descriptive statistics by hand” are the easiest and most “standard” ones. Furthermore, results do not dramatically change between the two methods.

Other quantiles

As you have guessed, any quantile can also be computed with the quantile() function. For instance, the $$4^{th}$$ decile or the $$98^{th}$$ percentile:

quantile(dat$Sepal.Length, 0.4) # 4th decile ## 40% ## 5.6 quantile(dat$Sepal.Length, 0.98) # 98th percentile ## 98% ## 7.7 Interquartile range

The interquartile range (i.e., the difference between the first and third quartile) can be computed with the IQR() function:

IQR(dat$Sepal.Length) ## [1] 1.3 or alternativaly with the quantile() function again: quantile(dat$Sepal.Length, 0.75) - quantile(dat$Sepal.Length, 0.25) ## 75% ## 1.3 As mentioned earlier, when possible it is usually recommended to use the shortest piece of code to arrive at the result. For this reason, the IQR() function is preferred to compute the interquartile range. Standard deviation and variance The standard deviation and the variance is computed with the sd() and var() functions: sd(dat$Sepal.Length) # standard deviation ## [1] 0.8280661 var(dat$Sepal.Length) # variance ## [1] 0.6856935 Remember from this article that the standard deviation and the variance are different whether we compute it for a sample or a population (see the difference between the two here). In R, the standard deviation and the variance are computed as if the data represent a sample (so the denominator is $$n – 1$$, where $$n$$ is the number of observations). To my knowledge, there is no function by default in R that computes the standard deviation or variance for a population. Tip: to compute the standard deviation (or variance) of multiple variables at the same time, use lapply() with the appropriate statistics as second argument: lapply(dat[, 1:4], sd) ##$Sepal.Length ## [1] 0.8280661 ## ## $Sepal.Width ## [1] 0.4358663 ## ##$Petal.Length ## [1] 1.765298 ## ## $Petal.Width ## [1] 0.7622377 The command dat[, 1:4] selects the variables 1 to 4 as the fifth variable is a qualitative variable and the standard deviation cannot be computed on such type of variable. See a recap of the different data types in R if needed. Summary You can compute the minimum, $$1^{st}$$ quartile, median, mean, $$3^{rd}$$ quartile and the maximum for all numeric variables of a dataset at once using summary(): summary(dat) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## Tip: if you need these descriptive statistics by group use the by() function: by(dat, dat$Species, summary) ## dat$Species: setosa ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 ## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 ## Median :5.000 Median :3.400 Median :1.500 Median :0.200 ## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 ## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 ## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 ## Species ## setosa :50 ## versicolor: 0 ## virginica : 0 ## ## ## ## ------------------------------------------------------------ ## dat$Species: versicolor ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0 ## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50 ## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0 ## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326 ## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500 ## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800 ## ------------------------------------------------------------ ## dat$Species: virginica ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 ## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 ## Median :6.500 Median :3.000 Median :5.550 Median :2.000 ## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026 ## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300 ## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500 ## Species ## setosa : 0 ## versicolor: 0 ## virginica :50 ## ## ## where the arguments are the name of the dataset, the grouping variable and the summary function. Follow this order, or specify the name of the arguments if you do not follow this order. If you need more descriptive statistics, use stat.desc() from the package {pastecs}: library(pastecs) stat.desc(dat) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## nbr.val 150.00000000 150.00000000 150.0000000 150.00000000 NA ## nbr.null 0.00000000 0.00000000 0.0000000 0.00000000 NA ## nbr.na 0.00000000 0.00000000 0.0000000 0.00000000 NA ## min 4.30000000 2.00000000 1.0000000 0.10000000 NA ## max 7.90000000 4.40000000 6.9000000 2.50000000 NA ## range 3.60000000 2.40000000 5.9000000 2.40000000 NA ## sum 876.50000000 458.60000000 563.7000000 179.90000000 NA ## median 5.80000000 3.00000000 4.3500000 1.30000000 NA ## mean 5.84333333 3.05733333 3.7580000 1.19933333 NA ## SE.mean 0.06761132 0.03558833 0.1441360 0.06223645 NA ## CI.mean.0.95 0.13360085 0.07032302 0.2848146 0.12298004 NA ## var 0.68569351 0.18997942 3.1162779 0.58100626 NA ## std.dev 0.82806613 0.43586628 1.7652982 0.76223767 NA ## coef.var 0.14171126 0.14256420 0.4697441 0.63555114 NA You can have even more statistics (i.e., skewness, kurtosis and normality test) by adding the argument norm = TRUE in the previous function. Note that the variable Species is not numeric, so descriptive statistics cannot be computed for this variable and NA are displayed. Coefficient of variation The coefficient of variation can be found with stat.desc() (see the line coef.var in the table above) or by computing manually (remember that the coefficient of variation is the standard deviation divided by the mean): sd(dat$Sepal.Length) / mean(dat$Sepal.Length) ## [1] 0.1417113 Mode To my knowledge there is no function to find the mode of a variable. However, we can easily find it thanks to the functions table() and sort(): tab <- table(dat$Sepal.Length) # number of occurences for each unique value sort(tab, decreasing = TRUE) # sort highest to lowest ## ## 5 5.1 6.3 5.7 6.7 5.5 5.8 6.4 4.9 5.4 5.6 6 6.1 4.8 6.5 4.6 5.2 6.2 6.9 7.7 ## 10 9 9 8 8 7 7 7 6 6 6 6 6 5 5 4 4 4 4 4 ## 4.4 5.9 6.8 7.2 4.7 6.6 4.3 4.5 5.3 7 7.1 7.3 7.4 7.6 7.9 ## 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1

table() gives the number of occurences for each unique value, then sort() with the argument decreasing = TRUE displays the number of occurences from highest to lowest. The mode of the variable Sepal.Length is thus 5. This code to find the mode can also be applied to qualitative variables such as Species:

sort(table(dat$Species), decreasing = TRUE) ## ## setosa versicolor virginica ## 50 50 50 or: summary(dat$Species) ## setosa versicolor virginica ## 50 50 50 Contingency table

table() introduced above can also be used on two qualitative variables to create a contingency table. The dataset iris has only one qualitative variable so we create a new qualitative variable just for this example. We create the variable size which corresponds to small if the length of the petal is smaller than the median of all flowers, big otherwise:

dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length), "small", "big" ) Here is a recap of the occurences by size: table(dat$size) ## ## big small ## 77 73

We now create a contingency table of the two variables Species and size with the table() function:

table(dat$Species, dat$size) ## ## big small ## setosa 1 49 ## versicolor 29 21 ## virginica 47 3

or with the xtabs() function:

xtabs(~ dat$Species + dat$size) ## dat$size ## dat$Species big small ## setosa 1 49 ## versicolor 29 21 ## virginica 47 3

The contingency table gives the number of cases in each subgroup. For instance, there is only one big setosa flower, while there are 49 small setosa flowers in the dataset.

Note that Species are in rows and size in column because we specified Species and then size in table(). Change the order if you want to switch the two variables.

Instead of having the frequencies (i.e.. the number of cases) you can also have the relative frequencies in each subgroup by adding the table() function inside the prop.table() function:

prop.table(table(dat$Species, dat$size)) ## ## big small ## setosa 0.006666667 0.326666667 ## versicolor 0.193333333 0.140000000 ## virginica 0.313333333 0.020000000

Note that you can also compute the percentages by row or by column by adding a second argument to the prop.table() function: 1 for row, or 2 for column:

# percentages by row: round(prop.table(table(dat$Species, dat$size), 1), 2) # round to 2 digits with round() ## ## big small ## setosa 0.02 0.98 ## versicolor 0.58 0.42 ## virginica 0.94 0.06 # percentages by column: round(prop.table(table(dat$Species, dat$size), 2), 2) # round to 2 digits with round() ## ## big small ## setosa 0.01 0.67 ## versicolor 0.38 0.29 ## virginica 0.61 0.04 Barplot

Barplots can only be done on qualitative variables (see the difference with a quantative variable here). A barplot is a tool to visualize the distribution of a qualitative variable. We draw a barplot on the qualitative variable size:

barplot(table(dat$size)) # table() is mandatory You can also draw a barplot of the relative frequencies instead of the frequencies by adding prop.table() as we did earlier: barplot(prop.table(table(dat$size)))

In {ggplot2}:

library(ggplot2) # needed each time you open RStudio # The package ggplot2 must be installed first ggplot(dat) + aes(x = size) + geom_bar()

Histogram

A histogram gives an idea about the distribution of a quantitative variable. The idea is to break the range of values into intervals and count how many observations fall into each interval. Histograms are a bit similar to barplots, but histograms are used for quantitative variables whereas barplots are used for qualitative variables. To draw a histogram in R, use hist():

hist(dat$Sepal.Length) Add the arguments breaks = inside the hist() function if you want to change the number of bins. A rule of thumb (known as Sturges’ law) is that the number of bins should be the rounded value of the square root of the number of observations. The dataset includes 150 observations so in this case the number of bins can be set to 12. In {ggplot2}: ggplot(dat) + aes(x = Sepal.Length) + geom_histogram() By default, the number of bins is 30. You can change this value with geom_histogram(bins = 12) for instance. Boxplot Boxplots are really useful in descriptive statistics and are often underused (mostly because it is not well understood by the public). A boxplot graphically represents the distribution of a quantitative variable by visually displaying five common location summary (minimum, median, first and third quartiles and maximum) and any observation that was classified as a suspected outlier using the interquartile range (IQR) criterion. The IQR criterion means that all observations above $$q_{0.75} + 1.5 \cdot IQR$$ and below $$q_{0.25} – 1.5 \cdot IQR$$ (where $$q_{0.25}$$ and $$q_{0.75}$$ correspond to first and third quartile respectively) are considered as potential outliers by R. The minimum and maximum in the boxplot are represented without these suspected outliers. Seeing all these information on the same plot help to have a good first overview of the dispersion and the location of the data. Before drawing a boxplot of our data, see below a graph explaining the information present on a boxplot: Detailed boxplot. Source: LFSAB1105 at UCLouvain Now an example with our dataset: boxplot(dat$Sepal.Length)

Boxplots are even more informative when presented side-by-side for comparing and contrasting distributions from two or more groups. For instance, we compare the length of the sepal across the different species:

boxplot(dat$Sepal.Length ~ dat$Species)

In {ggplot2}:

ggplot(dat) + aes(x = Species, y = Sepal.Length) + geom_boxplot()

Scatterplot

Scatterplots allow to check whether there is a potential link between two quantitative variables. For instance, when drawing a scatterplot of the length of the sepal and the length of the petal:

plot(dat$Sepal.Length, dat$Petal.Length)

There seems to be a positive association between the two variables.

In {ggplot2}:

ggplot(dat) + aes(x = Sepal.Length, y = Petal.Length) + geom_point()

As boxplots, scatterplots are even more informative when differentiating the points according to a factor, in this case the species:

ggplot(dat) + aes(x = Sepal.Length, y = Petal.Length, colour = Species) + geom_point() + scale_color_hue()

QQ-plot For a single variable

In order to check the normality assumption of a variable (normality means that the data follow a normal distribution, also known as a Gaussion distribution), we usually use histograms and/or QQ-plots.1

Histograms have been presented earlier, so here is how to draw a QQ-plot:

# Draw points on the qq-plot: qqnorm(dat$Sepal.Length) # Draw the reference line: qqline(dat$Sepal.Length)

Or a QQ-plot with confidence bands with the qqPlot() function from the {car} package:

library(car) # package must be installed first qqPlot(dat$Sepal.Length) ## [1] 132 118 If points are close to the reference line (sometimes referred as Henry’s line) and within the confidence bands, the normality assumption can be considered as met. The bigger the deviation between the points and the reference line and the more they lie outside the confidence bands, the less likely that the normality condition is met. The variable Sepal.Length does not seem to follow a normal distribution because several points lie outside the confidence bands. When facing a non-normal distribution, the first step is usually to apply the logarithm transformation on the data and recheck to see whether the log-transformed data are normally distributed. Applying the logarithm transformation can be done with the log() function. In {ggplot2}: # method 1 qplot(sample = Sepal.Length, data = dat) # method 2 ggplot(dat, aes(sample = Sepal.Length)) + stat_qq() By groups For some statistical tests, the normality assumption is required in all groups. One solution is to draw a QQ-plot for each group by manually splitting the dataset into different groups and then draw a QQ-plot for each subset of the data (with the methods shown above). Another (easier) solution is to draw a QQ-plot for each group automatically with the argument groups = in the function qqPlot() from the {car} package: qqPlot(dat$Sepal.Length, groups = dat$size) In {ggplot2}: qplot( sample = Sepal.Length, data = dat, col = size, shape = size ) It is also possible to differentiate groups by only shape or color. For this, remove one of the argument col or shape in the qplot() function above. Density plot Density plot is a smoothed version of the histogram and is used in the same concept, that is, to represent the distribution of a numeric variable. The functions plot() and density() are used together to draw a density plot: plot(density(dat$Sepal.Length))

In {ggplot2}:

ggplot(dat) + aes(x = Sepal.Length) + geom_density()

Thanks for reading. I hope this article helped you to do descriptive statistics in R. If you would like to do the same by hand or understand what these statistics represent, read the article “Descriptive statistics by hand”. As always, if you find a mistake/bug or if you have any questions do not hesitate to let me know in the comment section below, raise an issue on GitHub or contact me. Get updates every time a new article is published by subscribing to this blog.

1. Normality tests such as Shapiro-Wilk or Kolmogorov-Smirnov tests can also be used to test whether the data follow a normal distribution or not. However, in practice, normality tests are often considered as too conservative in the sense that sometimes a very limited number of observations may cause the normality condition to be violated. For this reason, it is often the case that the normality condition is verified only based on histograms and QQ-plots.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### RStudio Connect 1.8.0

Wed, 01/22/2020 - 01:00

RStudio Connect helps data science teams quickly make an impact by enabling them
to publicize reports, models, dashboards, and applications created in R and
Python with business stakeholders. The 1.8.0 release makes it even easier for
teams to start sharing.

For Data Scientists, highlights include:

• Python Support: Improvements that make it easier to share Jupyter Notebooks and mixed R and Python content.
• pins: An easy way to share data, models, and other objects.
• Custom Emails: Use code to create beautiful emails with plots and results inline.

For DevOps / IT Administrators, 1.8.0 makes it easier to support data science teams in
production:

For Data Science Leaders, 1.8.0 simplifies onboarding new team members and eases
collaboration:

website
. An easy way to get started is
with the RStudio Team Quickstart to experience
all of RStudio’s products on an easy-to-use virtual machine, or begin a free 45-day evaluation of RStudio Connect.

Python Support

RStudio Connect has supported both R and Python for over a year, and during this time we’ve made significant improvements. Data scientists can now publish
Jupyter Notebooks with a single click
or
by using version control. Data scientists who use both R and Python also have more
flexibility, helping them deploy mixed content using the reticulate R
package
.

Pins

The pins package makes it easy to share data, models, and other objects on RStudio Connect. Pins are especially useful if you have data that regularly updates; simply schedule an R Markdown document to process your data and pin your results or model. Once pinned, your reports, applications, and APIs can automatically pull the updates. Learn more or see an example.

Custom Emails

Sending plots, tables, and results inline in an email is a powerful way for data scientists to make an impact. RStudio Connect customers use custom emails to send daily reminders, conditional alerts, and to track key metrics. The latest release of the blastula package makes it even easier for data scientists to specify these emails programmatically:

if (demand_forecast > 1000) { render_connect_email(input = "alert-supply-team-email.Rmd") %>% attach_connect_email( subject = sprintf("ALERT: Forecasted increase of %g units", increase), attach_output = TRUE, attachments = c("demand_forecast_data.csv") ) } else { suppress_scheduled_email() }

Jump Start Examples

A common challenge facing data science teams is onboarding new users. Data
scientists have to learn new tools, methods, and often a new domain. We’ve
created a set of examples to help data scientists learn common best practices. A
new tutorial in the product helps users publish different data products, which is
particularly helpful for data scientists exploring new content types, such as
reports, models, or datasets.

Git-centered Deployments

RStudio Connect’s push-button publishing is a convenient and simple way for data
scientists to share their work. However, some teams prefer Git-centric
workflows, especially when deploying content in production. RStudio Connect
supports these
workflows
, making it
effortless for data science teams to adopt version control best practices without
maintaining additional infrastructure or remembering complex workflows. Data
scientists simply commit to Git, and RStudio Connect will update the content,
saving you any extra steps.

Scheduled Content Calendar

For system and application administrators, RStudio Connect makes it simple to
audit data science work. For data science teams, one powerful application of
RStudio Connect is the ability to schedule tasks. These tasks can be everything
from simple ETL jobs to daily reports. In 1.8.0, we’ve made it easier for
administrators to track these tasks across all publishers in a single place.
This new view makes it possible to identify conflicts or times when the server
is being overbooked.

• RStudio Connect no longer supports Ubuntu 14.04 LTS (Trusty Tahr).
• The Postgres.URL database connection URL will always use a nonempty Postgres.Passwordvalue as its password. Previous releases would use the password only when a {$} placeholder was present. Support for the Postgres.URL placeholder {$} is deprecated and will be removed in a future release.
• Duplicate user names are now accepted in some conditions for LDAP, SAML, and proxied authentication providers. See the [release notes] for details.
• The setting HTTPS.ExcludedCiphers has been removed and is no longer supported. The HTTPS.MinimumTLS setting should be used to specify a minimum accepted TLS version. We recommend running a secure proxy when your organization has more complex HTTPS requirements.
• The deprecated setting Applications.DisabledProtocols has been removed. Use Applications.DisabledProtocol instead. Multiple values should be placed one per line with this new setting.
• The deprecated settings OAuth2.AllowedDomains and OAuth2.AllowedEmails have been removed. Use OAuth2.AllowedDomain and OAuth2.AllowedEmail instead. Multiple values should be placed one per line with these new settings.
• TensorFlow Model API deployment supports models created with TensorFlow up to version 1.15.0. A systems administrator will need to update the version of libtensorflow.so installed on your RStudio Connect system.

we recommend consulting the release posts for the 1.7 series.

This release also includes numerous bug fixes, the full release notes document all of the changes. Some of our favorites:

• The “Info” panel for Jupyter notebook content includes a summary of recent usage.
• Python-based content can now use an application RunAs setting.
• Scheduled reports no longer run multiple times if the application and database clocks disagree, or in the case of concurrent rendering attempts in a multi-node installation.

Aside from the deprecations and breaking changes listed above,
there are no other special considerations, and upgrading should require less than
five minutes. If you are upgrading from a version earlier than 1.7.8, be sure to consult
the release notes for the intermediate releases, as well.

Get Started with RStudio Connect

Connect
, we encourage you to do so.
RStudio Connect is the best way to share all the work that you do in R and Python with
collaborators, colleagues, or customers.

You can find more details or download a 45-day evaluation of the product at
https://www.rstudio.com/products/connect/.
Additional resources can be found below:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### a very quick Riddle

Wed, 01/22/2020 - 00:19

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A very quick Riddler’s riddle last week with the question

Find the (integer) fraction with the smallest (integer) denominator strictly located between 1/2020 and 1/2019.

and the brute force resolution

for (t in (2020*2019):2021){ a=ceiling(t/2020) if (a*2019

leading to 2/4039 as the target. Note that

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Using unpack to Manage Your R Environment

Tue, 01/21/2020 - 21:33

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our last note we stated that unpack is a good tool for load R RDS files into your working environment. Here is the idea expanded into a worked example.

# remotes::install_github("WinVector/wrapr") library(wrapr) a <- 5 b <- 7 do_not_want <- 13 # save the elements of our workspace we want saveRDS(as_named_list(a, b), 'example_data.RDS') # clear values out of our workspace for the example rm(list = ls()) ls() #> character(0) # notice workspace environemnt now empty # read back while documenting what we expect to # read in unpack[a, b] <- readRDS('example_data.RDS') # confirm what we have, the extra unpack is a side # effect of the []<- notation. To avoid this instead # use one of: # unpack(readRDS('example_data.RDS'), a, b) # readRDS('example_data.RDS') %.>% unpack(., a, b) # readRDS('example_data.RDS') %.>% unpack[a, b] ls() #> [1] "a" "b" "unpack" # notice do_not_want is not present print(a) #> [1] 5 print(b) #> [1] 7

We have new documentation on the new as_named_list helper function here.

The idea is: this is a case where non-standard evaluation is working for us (and clarity/safety) as it forces the user to document each object they want written out (by explicitly naming them), and exactly what objects they expect to come back (again by explicitly naming them right at the assignment location).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Explanatory Model Analysis with modelStudio

Tue, 01/21/2020 - 15:15

[This article was first published on Stories by Przemyslaw Biecek on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Version 0.2.1 of the modelStudio package reached CRAN this week. Let me shortly overview coolest new features (list of all changes).

Short reminder: The modelStudio package creates an interactive serverless D3.js dashboard for exploration of predictive models, which is based on principles of Explanatory Model Analysis. With modelStudio interface, one can juxtapose instance-level model explanations (Break Down, Shapley values and Ceteris Paribus profiles), dataset-level model explanations (Partial Dependence Plots, Feature Importance, Accumulated Local Effects Plot) and data exploration plots (histogram and scatterplots). Examples below use a GBM model trained on the kaggle FIFA 19 dataset to predict player’s value based on selected 40 player’s characteristics. Play with the live demo here.

New plots for EDA

There are two new plots in modelStudio for exploratory data analysis: Target vs Feature and Average Target vs Feature (useful for classification problems). They are especially helpful when examining Partial Dependence profiles. Both show the relation between target and a selected feature, but first shows the raw relation in the data, while the latter shows a relationship learned by a model.

Better defaults

Some defaults changed in version v0.2.1 to improve the general usability.

If no new instances are provided for local explanations, then by default, a small sample is taken at random from the training data.

When modelStudio is plotted, by default the first panel is set to a Break Down plot while the second panel is clicked. This saves 3 clicks!

All plots for categorical variables now have the same order of levels.

Verbose and stable calculations

modelStudio is serverless, so all computations need to be done in advance, which may take some time. By default, the whole process is more verbose, shows a progress bar and information about the current calculations.

The try-catch blocks reassure that even if some parts fail, the rest will finish and plots will show up in the dashboard.

Feature importance plots now have boxplots that show how stable are calculations for individual variables.

Easy to create

The code chunk below creates a random forest model for the apartments dataset and then creates a modelStudio dashboard.

library(DALEX)
library(randomForest)
model <- randomForest(m2.price ~., data = apartments)
explainer <- explain(model, data = apartments[,-1], y = apartments[,1])

library(modelStudio)
modelStudio(explainer)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Neural Network Machine Learning for NLP

Tue, 01/21/2020 - 11:24

[This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, we updated package ruimtehol on CRAN. The package provides an easy interface for R users to Starspace which is a general purpose neural embedding model for text data.

Notable changes are that the package now also builds fine on Mac OS and runs fine on all CRAN platforms. If you are interested to see what the package can do, have a look at the presentation below or visit the package vignette at https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf

If you like it, give it a star at https://github.com/bnosac/ruimtehol and if you need commercial support on text mining, get in touch.

Upcoming training schedule

Interested in NLP? Then you might as well be interested in the following courses provided in Belgium. Hope to see you there!

• 2020-02-19&20: Advanced R programming: Subscribe here
• 2020-03-12&13: Computer Vision with R and Python: Subscribe here
• 2020-03-16&17: Deep Learning/Image recognition: Subscribe here
• 2020-04-22&23: Text Mining with R: Subscribe here
• 2020-05-06&07: Text Mining with Python: Subscribe here
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial

Tue, 01/21/2020 - 11:23

Product price estimation and prediction is one of the skills I teach frequently – It’s a great way to analyze competitor product information, your own company’s product data, and develop key insights into which product features influence product prices. Learn how to model product car prices and calculate depreciation curves using the brand new tune package for Hyperparameter Tuning Machine Learning Models. This is an Advanced Machine Learning Tutorial.

R Packages Covered

Machine Learning

• tune – Hyperparameter tuning framework
• parsnip – Machine learning framework
• dials – Grid search development framework
• rsample – Cross-validation and sampling framework
• recipes – Preprocessing pipeline framework
• yardstick – Accuracy measurements

EDA

• correlationfunnel – Detect relationships using binary correlation analysis
• DataExplorer – Investigate data cleanliness
• skimr – Summarize data by data type

Hyperparameter tuning and cross-validation have previously been quite difficult using parsnip, the new machine learning framework that is part of the tidymodels ecosystem. Everything has changed with the introduction of the tune package – thetidymodels hyperparameter tuning framework that integrates:

• parsnip for machine learning
• recipes for preprocessing
• dials for grid search
• rsample for cross-validation.

This is a machine learning tutorial where we model auto prices (MSRP) and estimate depreciation curves.

Estimate Depreciation Curves using Machine Learning

To implement the Depreciation Curve Estimation, you develop a machine learning model that is hyperparameter tuned using a 3-Stage Nested Hyper Parameter Tuning Process with 5-Fold Cross-Validation.

3-Stage Nested Hyperparameter Tuning Process

3 Stage Hyperparameter Tuning Process:

1. Find Parameters: Use Hyper Parameter Tuning on a “Training Dataset” that sections your training data into 5-Folds. The output at Stage 1 is the parameter set.

2. Compare and Select Best Model: Evaluate the performance on a hidden “Test Dataset”. The ouput at Stage 2 is that we determine best model.

3. Train Final Model: Once we have selected the best model, we train on the full dataset. This model goes into production.

Need to learn Data Science for Business? This is an advanced tutorial, but you can get the foundational skills, advanced machine learning, business consulting, and web application development using R, Shiny (Apps), H2O (ML), AWS (Cloud), and tidyverse (data manipulation and visualization). I recommend my 4-Course R-Track for Business Bundle.

Why Product Pricing is Important

Product price prediction is an important tool for businesses. There are key pricing actions that machine learning and algorithmic price modeling can be used for.

Which brands will customers pay more for?

Defend against inconsistent pricing

There is nothing more confusing to customers than pricing products inconsistently. Price too high, and customers fail to see the difference between competitor products and yours. Price too low and profit can suffer. Use machine learning to price products consistently based on the key product features that your customers care about.

Learn which features drive pricing and customer purchase decisions

In competitive markets, pricing is based on supply and demand economics. Sellers adjust prices to maximize profitability given market conditions. Use machine learning and explainable ML techniques to interpret the value of features such as brand (luxury vs economy), performance (engine horsepower), age (vehicle year), and more.

Develop price profiles (Appreciation / Depreciation Curves)

Another important concept to products like homes and automobiles is the ability to monitor the effect of time. Homes tend to appreciate and machinery (including automobiles) tend to depreciate in value over time. Use machine learning to develop price curves. We’ll do just that in this tutorial examining the MSRP of vehicles that were manufactured across time.

Depreciation Curve for Dodge Ram 1500 Pickup
Read on to learn how to make this plot

Product Price Tutorial

Onward – To the Product Price Prediction and Hyperparameter Tuning Tutorial.

1.0 Libraries and Data

# Tidymodels library(tune) library(dials) library(parsnip) library(rsample) library(recipes) library(textrecipes) library(yardstick) library(vip) # Parallel Processing library(doParallel) all_cores <- parallel::detectCores(logical = FALSE) registerDoParallel(cores = all_cores) # Data Cleaning library(janitor) # EDA library(skimr) library(correlationfunnel) library(DataExplorer) # ggplot2 Helpers library(gghighlight) library(patchwork) # Core library(tidyverse) library(tidyquant) library(knitr)

Next, get the data used for this tutorial. This data set containing Car Features and MSRP was scraped from “Edmunds and Twitter”.

Car Features and MSRP Dataset

2.0 Data Understanding

Read the data using read_csv() and use clean_names() from the janitor package to clean up the column names.

car_prices_tbl <- read_csv("2020-01-21-tune/data/data.csv") %>% clean_names() %>% select(msrp, everything()) car_prices_tbl %>% head(5) %>% kable() msrp make model year engine_fuel_type engine_hp engine_cylinders transmission_type driven_wheels number_of_doors market_category vehicle_size vehicle_style highway_mpg city_mpg popularity 46135 BMW 1 Series M 2011 premium unleaded (required) 335 6 MANUAL rear wheel drive 2 Factory Tuner,Luxury,High-Performance Compact Coupe 26 19 3916 40650 BMW 1 Series 2011 premium unleaded (required) 300 6 MANUAL rear wheel drive 2 Luxury,Performance Compact Convertible 28 19 3916 36350 BMW 1 Series 2011 premium unleaded (required) 300 6 MANUAL rear wheel drive 2 Luxury,High-Performance Compact Coupe 28 20 3916 29450 BMW 1 Series 2011 premium unleaded (required) 230 6 MANUAL rear wheel drive 2 Luxury,Performance Compact Coupe 28 18 3916 34500 BMW 1 Series 2011 premium unleaded (required) 230 6 MANUAL rear wheel drive 2 Luxury Compact Convertible 28 18 3916

We can get a sense using some ggplot2 visualizations and correlation analysis to detect key features in the dataset.

2.1 Engine Horsepower vs MSRP

First, let’s take a look at two interesting factors to me:

• MSRP (vehicle price) – Our target, what customers on average pay for the vehicle, and
• Engine horsepower – A measure of product performance
car_plot_data_tbl <- car_prices_tbl %>% mutate(make_model = str_c(make, " ", model)) %>% select(msrp, engine_hp, make_model) car_plot_data_tbl %>% ggplot(aes(engine_hp, msrp)) + geom_point(color = palette_light()["blue"], alpha = 0.15) + geom_smooth(method = "lm") + scale_y_log10(label = scales::dollar_format()) + theme_tq() + labs(title = "Engine Horsepower vs MSRP by Vehicle Age", x = "Engine HP", y = "MSRP (log scale)")

We can see that there are two distinct groups in the plots. We can inspect a bit closer with gghighlight and patchwork. I specifically focus on the high end of both groups:

• Vehicles with MSRP greater than $650,000 • Vehicles with Engine HP greater than 350 and MSRP less than$10,000
p1 <- car_plot_data_tbl %>% ggplot(aes(engine_hp, msrp)) + geom_point(color = palette_light()["blue"]) + scale_y_log10(label = scales::dollar_format()) + gghighlight(msrp > 650000, label_key = make_model, unhighlighted_colour = alpha("grey", 0.05), label_params = list(size = 2.5)) + theme_tq() + labs(title = "Vehicles with MSRP > $650K", x = "Engine HP", y = "MSRP (log scale)") p2 <- car_plot_data_tbl %>% ggplot(aes(engine_hp, msrp)) + geom_point(color = palette_light()["blue"]) + scale_y_log10(label = scales::dollar_format()) + gghighlight(msrp < 10000, engine_hp > 350, label_key = make_model, unhighlighted_colour = alpha("grey", 0.05), label_params = list(size = 2.5), ) + theme_tq() + labs(title = "Vehicles with MSRP <$10K and Engine HP > 350", x = "Engine HP", y = "MSRP (log scale)") # Patchwork for stacking ggplots on top of each other p1 / p2

Key Points:

• The first group makes total sense – Lamborghini, Bugatti and Maybach. These are all super-luxury vehicles.
• The second group is more interesting – It’s BMW 8 Series, Mercedes-Benz 600-Class. This is odd to me because these vehicles normally have a higher starting price.
2.2 Correlation Analysis – correlationfunnel

Next, I’ll use my correlationfunnel package to hone in on the low price vehicles. I want to find which features correlate most with low prices.

# correlationfunnel 3-step process car_prices_tbl %>% drop_na(engine_hp, engine_cylinders, number_of_doors, engine_fuel_type) %>% binarize(n_bins = 5) %>% correlate(msrp__-Inf_18372) %>% plot_correlation_funnel()

Key Points:

Ah hah! The reduction in price is related to vehicle age. We can see that Vehicle Year less than 2004 is highly correlated with Vehicle Price (MSRP) less than $18,372. This explains why some of the 600 Series Mercedes-Benz vehicles (luxury brand) are in the low price group. Good – I’m not going crazy. 2.3 Engine HP, MSRP by Vehicle Age Let’s explain this story by redo-ing the visualization from 2.1, this time segmenting by Vehicle Year. I’ll segment by year older than 2000 and newer than 2000. car_plot_data_tbl <- car_prices_tbl %>% mutate(make_model = str_c(make, " ", model), year_lte_2000 = ifelse(year <= 2000, "2000 or Older", "2001 or Newer") ) %>% select(msrp, year_lte_2000, engine_hp, make_model) car_plot_data_tbl %>% ggplot(aes(engine_hp, msrp, color = year_lte_2000)) + geom_point(alpha = 0.05) + scale_color_tq() + scale_y_log10(label = scales::dollar_format()) + geom_smooth(method = "lm") + theme_tq() + labs(title = "Engine Horsepower vs MSRP by Vehicle Age", x = "Engine HP", y = "MSRP (log scale)") Key Points: As Joshua Starmer would say, “Double Bam!” Vehicle Year is the culprit. 3.0 Exploratory Data Analysis Ok, now that I have a sense of what is going on with the data, I need to figure out what’s needed to prepare the data for machine learning. The data set was webscraped. Datasets like this commonly have issues with missing data, unformatted data, lack of cleanliness, and need a lot of preprocessing to get into the format needed for modeling. We’ll fix that up with a preprocessing pipeline. Goal Our goal in this section is to identify data issues that need to be corrected. We will then use this information to develop a preprocessing pipeline. We will make heavy use of skimr and DataExplorer, two EDA packages I highly recommend. 3.1 Data Summary – skimr I’m using the skim() function to breakdown the data by data type so I can assess missing values, number of uniue categories, numeric value distributions, etc. skim(car_prices_tbl) ## Skim summary statistics ## n obs: 11914 ## n variables: 16 ## ## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────── ## variable missing complete n min max empty n_unique ## driven_wheels 0 11914 11914 15 17 0 4 ## engine_fuel_type 3 11911 11914 6 44 0 10 ## make 0 11914 11914 3 13 0 48 ## market_category 0 11914 11914 3 54 0 72 ## model 0 11914 11914 1 35 0 915 ## transmission_type 0 11914 11914 6 16 0 5 ## vehicle_size 0 11914 11914 5 7 0 3 ## vehicle_style 0 11914 11914 5 19 0 16 ## ## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────────────────────── ## variable missing complete n mean sd p0 p25 p50 ## city_mpg 0 11914 11914 19.73 8.99 7 16 18 ## engine_cylinders 30 11884 11914 5.63 1.78 0 4 6 ## engine_hp 69 11845 11914 249.39 109.19 55 170 227 ## highway_mpg 0 11914 11914 26.64 8.86 12 22 26 ## msrp 0 11914 11914 40594.74 60109.1 2000 21000 29995 ## number_of_doors 6 11908 11914 3.44 0.88 2 2 4 ## popularity 0 11914 11914 1554.91 1441.86 2 549 1385 ## year 0 11914 11914 2010.38 7.58 1990 2007 2015 ## p75 p100 hist ## 22 137 ▇▂▁▁▁▁▁▁ ## 6 16 ▁▇▇▃▁▁▁▁ ## 300 1001 ▅▇▃▁▁▁▁▁ ## 30 354 ▇▁▁▁▁▁▁▁ ## 42231.25 2065902 ▇▁▁▁▁▁▁▁ ## 4 4 ▃▁▁▁▁▁▁▇ ## 2009 5657 ▇▅▆▁▁▁▁▂ ## 2016 2017 ▁▁▁▁▁▂▁▇ Key Points: • We have missing values in a few features • We have several categories with low and high unique values • We have pretty high skew in several categories Let’s go deeper with DataExplorer. 3.2 Missing Values – DataExplorer Let’s use the plot_missing() function to identify which columns have missing values. We’ll take care of these missing values using imputation in section 4. car_prices_tbl %>% plot_missing() Key Point: We’ll zap missing values and replace with estimated values using imputation. 3.3 Categorical Data – DataExplorer Let’s use the plot_bar() function to identify the distribution of categories. We have several columns with a lot of categories that have few values. We can lump these into an “Other” Category. car_prices_tbl %>% plot_bar(maxcat = 50, nrow = 2) One category that doesn’t show up in the plot is “market_category”. This has 72 unique values – too many to plot. Let’s take a look. car_prices_tbl %>% count(market_category, sort = TRUE) ## # A tibble: 72 x 2 ## market_category n ## ## 1 N/A 3742 ## 2 Crossover 1110 ## 3 Flex Fuel 872 ## 4 Luxury 855 ## 5 Luxury,Performance 673 ## 6 Hatchback 641 ## 7 Performance 601 ## 8 Crossover,Luxury 410 ## 9 Luxury,High-Performance 334 ## 10 Exotic,High-Performance 261 ## # … with 62 more rows Hmmm. This is actually a “tag”-style category (where one vehicle can have multiple tags). We can clean this up using term-frequency feature engineering. Key Points: • We need to lump categories with few observations • We can use text-based feature engineering on the market categories that have multiple categories. 3.4 Numeric Data – DataExplorer Let’s use plot_histogram() to investigate the numeric data. Some of these features are skewed and others are actually discrete (and best analyzed as categorical data). Skew won’t be much of an issue with tree-based algorithms so we’ll leave those alone (better from an explainability perspective). Discrete features like engine-cylinders and number of doors are better represented as factors. car_prices_tbl %>% plot_histogram() Here’s a closer look at engine cylinders vs MSRP. Definitely need to encode this one as a factor, which will be better because of the non-linear relationship to price. Note that cars with zero engine cylinders are electric. car_prices_tbl %>% ggplot(aes(as.factor(engine_cylinders), msrp)) + geom_violin() + geom_jitter(alpha = 0.05) + scale_y_log10() Key Points: • We’ll encode engine cylinders and number of doors as categorical data to better represent non-linearity • Tree-Based algorithms can handle skew. I’ll leave alone for explainability purposes. Note that you can use transformations like Box Cox to reduce skew for linear algorithms. But this transformation makes it more difficult to explain the results to non-technical stakeholders. 4.0 Machine Learning Strategy You’re probably thinking, “Wow – That’s a lot of work just to get to this step.” Yes, you’re right. That’s why good Data Scientists are in high demand. Data Scientists that understand business, know data wrangling, visualization techniques, can identify how to treat data prior to Machine Learning, and communicate what’s going on to the organization – These data scientists get good jobs. At this point, it makes sense to re-iterate that I have a 4-Course R-Track Curriculum that will turn you into a good Data Scientist in 6-months or less. Yeahhhh! ML Game Plan We have a strategy that we are going to use to do what’s called “Nested Cross Validation”. It involves 3 Stages. Stage 1: Find Parameters We need to create several machine learning models and try them out. To accomplish we do: • Initial Splitting – Separate into random training and test data sets • Preprocessing – Make a pipeline to turn raw data into a dataset ready for ML • Cross Validation Specification – Sample the training data into 5-splits • Model Specification – Select model algorithms and identify key tuning parameters • Grid Specification – Set up a grid using wise parameter choices • Hyperparameter Tuning – Implement the tuning process Stage 2: Select Best Model Once we have the optimal algorithm parameters for each machine learning algorithm, we can move into stage 2. Our goal here is to compare each of the models on “Test Data”, data that were not used during the parameter tuning process. We re-train on the “Training Dataset” then evaluate against the “Test Dataset”. The best model has the best accuracy on this unseen data. Stage 3: Retrain on the Full Dataset Once we have the best model identified from Stage 2, we retrain the model using the best parameters from Stage 1 on the entire dataset. This gives us the best model to go into production with. Ok, let’s get going. 5.0 Stage 1 – Preprocessing, Cross Validation, and Tuning Time for machine learning! Just a few more steps and we’ll make and tune high-accuracy models. 5.1 Initial Train-Test Split – rsample The first step is to split the data into Training and Testing sets. We use an 80/20 random split with the initial_split() function from rsample. The set.seed() function is used for reproducibility. Note that we have 11,914 cars (observations) of which 9,532 are randomly assigned to training and 2,382 are randomly assigned to testing. set.seed(123) car_initial_split <- initial_split(car_prices_tbl, prop = 0.80) car_initial_split ## <9532/2382/11914> 5.2 Preprocessing Pipeline – recipes What the heck is a preprocessing pipeline? A “preprocessing pipeline” (aka a “recipe”) is a set of preprocessing steps that transform raw data into data formatted for machine learning. The key advantage to a preprocessing pipeline is that it can be re-used on new data. So when you go into production, you can use the recipe to process new incoming data. Remember in Section 3.0 when we used EDA to identify issues with our data? Now it’s time to fix those data issues using the Training Data Set. We use the recipe() function from the recipes package then progressively add step_* functions to transform the data. The recipe we implement applies the following transformations: 1. Encoding Character and Discrete Numeric data to Categorical Data. 2. Text-Based Term Frequency Feature Engineering for the Market Category column 3. Consolidate low-frequency categories 4. Impute missing data using K-Nearest Neighbors with 5-neighbors (kNN is a fast and accurate imputation method) 5. Remove unnecessary columns (e.g. model) preprocessing_recipe <- recipe(msrp ~ ., data = training(car_initial_split)) %>% # Encode Categorical Data Types step_string2factor(all_nominal()) %>% step_mutate( engine_cylinders = as.factor(engine_cylinders), number_of_doors = as.factor(number_of_doors) ) %>% # Feature Engineering - Market Category step_mutate(market_category = market_category %>% str_replace_all("_", "")) %>% step_tokenize(market_category) %>% step_tokenfilter(market_category, min_times = 0.05, max_times = 1, percentage = TRUE) %>% step_tf(market_category, weight_scheme = "binary") %>% # Combine low-frequency categories step_other(all_nominal(), threshold = 0.02, other = "other") %>% # Impute missing step_knnimpute(engine_fuel_type, engine_hp, engine_cylinders, number_of_doors, neighbors = 5) %>% # Remove unnecessary columns step_rm(model) %>% prep() preprocessing_recipe ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 15 ## ## Training data contained 9532 data points and 84 incomplete rows. ## ## Operations: ## ## Factor variables from make, model, ... [trained] ## Variable mutation for engine_cylinders, number_of_doors [trained] ## Variable mutation for market_category [trained] ## Tokenization for market_category [trained] ## Text filtering for market_category [trained] ## Term frequency with market_category [trained] ## Collapsing factor levels for make, model, engine_fuel_type, ... [trained] ## K-nearest neighbor imputation for make, model, year, engine_hp, ... [trained] ## Variables removed model [trained] The preprocessing recipe hasn’t yet changed the data. We’ve just come up with the recipe. To transform the data, we use bake(). I create a new variable to hold the preprocessed training dataset. car_training_preprocessed_tbl <- preprocessing_recipe %>% bake(training(car_initial_split)) We can use DataExplorer to verify that the dataset has been processed. First, let’s inspect the Categorical Features. • The category distributions have been fixed – now “other” category present lumping together infrequent categories. • The text feature processing has added several new columns beginning with “tf_market_category_”. • Number of doors and engine cylinders are now categorical. car_training_preprocessed_tbl %>% plot_bar(maxcat = 50) We can review the Numeric Features. The remaining numeric features have been left alone to preserve explainability. car_training_preprocessed_tbl %>% plot_histogram() 5.3 Cross Validation Specification – rsample Now that we have a preprocessing recipe and the initial training / testing split, we can develop the Cross Validation Specification. Standard practice is to use either a 5-Fold or 10-Fold cross validation: • 5-Fold: I prefer 5-fold cross validation to speed up results by using 5 folds and an 80/20 split in each fold • 10-Fold: Others prefer a 10-fold cross validation to use more training data with a 90/10 split in each fold. The downside is that this calculation requires twice as many models as 5-fold, which is already an expensive (time consuming) operation. To implement 5-Fold Cross Validation, we use vfold_cv() from the rsample package. Make sure to use your training dataset (training()) and then apply the preprocessing recipe using bake() before the vfold_cv() cross validation sampling. You now have specified the 5-Fold Cross Validation specification for your training dataset. set.seed(123) car_cv_folds <- training(car_initial_split) %>% bake(preprocessing_recipe, new_data = .) %>% vfold_cv(v = 5) car_cv_folds ## # 5-fold cross-validation ## # A tibble: 5 x 2 ## splits id ## ## 1 Fold1 ## 2 Fold2 ## 3 Fold3 ## 4 Fold4 ## 5 Fold5 5.4 Model Specifications – parnsip We’ll specify two competing models: • glmnet – Uses an Elastic Net, that combines the LASSO and Ridge Regression techniques. This is a linear algorithm, which can have difficulty with skewed numeric data, which is present in our numeric features. • xgboost – A tree-based algorithm that uses gradient boosted trees to develop high-performance models. The tree-based algorithms are not sensitive to skewed numeric data, which can easily be sectioned by the tree-splitting process. 5.4.1 glmnet – Model Spec We use the linear_reg() function from parsnip to set up the initial specification. We use the tune() function from tune to identify the tuning parameters. We use set_engine() from parsnip to specify the engine as the glmnet library. glmnet_model <- linear_reg( mode = "regression", penalty = tune(), mixture = tune() ) %>% set_engine("glmnet") glmnet_model ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = tune() ## mixture = tune() ## ## Computational engine: glmnet 5.4.2 xgboost – Model Spec A similar process is used for the XGBoost model. We specify boost_tree() and identify the tuning parameters. We set the engine to xgboost library. Note that an update to the xgboost library has changed the default objective from reg:linear to reg:squarederror. I’m specifying this by adding a objective argument in set_engine() that get’s passed to the underlying xgb.train(params = list([goes here])). xgboost_model <- boost_tree( mode = "regression", trees = 1000, min_n = tune(), tree_depth = tune(), learn_rate = tune() ) %>% set_engine("xgboost", objective = "reg:squarederror") xgboost_model ## Boosted Tree Model Specification (regression) ## ## Main Arguments: ## trees = 1000 ## min_n = tune() ## tree_depth = tune() ## learn_rate = tune() ## ## Engine-Specific Arguments: ## objective = reg:squarederror ## ## Computational engine: xgboost 5.5 Grid Specification – dials Next, we need to set up the grid that we plan to use for Grid Search. Grid Search is the process of specifying a variety of parameter values to be used with your model. The goal is to to find which combination of parameters yields the best accuracy (lowest prediction error) for each model. We use the dials package to setup the hyperparameters. Key functions: • parameters() – Used to specify ranges for the tuning parameters • grid_*** – Grid functions including max entropy, hypercube, etc 5.5.1 glmnet – Grid Spec For the glmnet model, we specify a parameter set, parameters(), that includes penalty() and mixture(). glmnet_params <- parameters(penalty(), mixture()) glmnet_params ## Collection of 2 parameters for tuning ## ## id parameter type object class ## penalty penalty nparam[+] ## mixture mixture nparam[+] Next, we use the grid_max_entropy() function to make a grid of 20 values using the parameters. I use set.seed() to make this random process reproducible. set.seed(123) glmnet_grid <- grid_max_entropy(glmnet_params, size = 20) glmnet_grid ## # A tibble: 20 x 2 ## penalty mixture ## ## 1 2.94e- 1 0.702 ## 2 1.48e- 4 0.996 ## 3 1.60e- 1 0.444 ## 4 5.86e- 1 0.975 ## 5 1.69e- 9 0.0491 ## 6 1.10e- 5 0.699 ## 7 2.76e- 2 0.988 ## 8 4.95e- 8 0.753 ## 9 1.07e- 5 0.382 ## 10 7.87e- 8 0.331 ## 11 4.07e- 1 0.180 ## 12 1.70e- 3 0.590 ## 13 2.52e-10 0.382 ## 14 2.47e-10 0.666 ## 15 2.31e- 9 0.921 ## 16 1.31e- 7 0.546 ## 17 1.49e- 6 0.973 ## 18 1.28e- 3 0.0224 ## 19 7.49e- 7 0.0747 ## 20 2.37e- 3 0.351 Because this is a 2-Dimensional Hyper Paramater Space (only 2 tuning parameters), we can visualize what the grid_max_entropy() function did. The grid selections were evenly spaced out to create uniformly distributed hyperparameter selections. Note that the penalty parameter is on the Log Base 10-scale by default (refer to the dials::penalty() function documentation). This functionality results in smarter choices for critical parameters, a big benefit of using the tidymodels framework. glmnet_grid %>% ggplot(aes(penalty, mixture)) + geom_point(color = palette_light()["blue"], size = 3) + scale_x_log10() + theme_tq() + labs(title = "Max Entropy Grid", x = "Penalty (log scale)", y = "Mixture") 5.5.2 xgboost – Grid Spec We can follow the same process for the xgboost model, specifying a parameter set using parameters(). The tuning parameters we select for are grid are made super easy using the min_n(), tree_depth(), and learn_rate() functions from dials(). xgboost_params <- parameters(min_n(), tree_depth(), learn_rate()) xgboost_params ## Collection of 3 parameters for tuning ## ## id parameter type object class ## min_n min_n nparam[+] ## tree_depth tree_depth nparam[+] ## learn_rate learn_rate nparam[+] Next, we set up the grid space. Because this is now a 3-Dimensional Hyperparameter Space, I up the size of the grid to 30 points. Note that this will drastically increase the time it takes to tune the models because the xgboost algorithm must be run 30 x 5 = 150 times. Each time it runs with 1000 trees, so we are talking 150,000 tree calculations. My point is that it will take a bit to run the algorithm once we get to Section 5.6 Hyperparameter Tuning. set.seed(123) xgboost_grid <- grid_max_entropy(xgboost_params, size = 30) xgboost_grid ## # A tibble: 30 x 3 ## min_n tree_depth learn_rate ## ## 1 39 1 0.00000106 ## 2 37 9 0.000749 ## 3 2 9 0.0662 ## 4 27 9 0.00000585 ## 5 5 14 0.00000000259 ## 6 17 2 0.000413 ## 7 34 11 0.0000000522 ## 8 10 4 0.0856 ## 9 28 3 0.000000156 ## 10 24 8 0.00159 ## # … with 20 more rows 5.6 Hyper Parameter Tuning – tune Now that we have specified the recipe, models, cross validation spec, and grid spec, we can use tune() to bring them all together to implement the Hyperparameter Tuning with 5-Fold Cross Validation. 5.6.1 glmnet – Hyperparameter Tuning Tuning the model using 5-fold Cross Validation is straight-forward with the tune_grid() function. We specify the formula, model, resamples, grid, and metrics. The only piece that I haven’t explained is the metrics. These come from the yardstick package, which has functions including mae(), mape(), rsme() and rsq() for calculating regression accuracy. We can specify any number of these using the metric_set(). Just make sure to use only regression metrics since this is a regression problem. For classification, you can use all of the normal measures like AUC, Precision, Recall, F1, etc. glmnet_stage_1_cv_results_tbl <- tune_grid( formula = msrp ~ ., model = glmnet_model, resamples = car_cv_folds, grid = glmnet_grid, metrics = metric_set(mae, mape, rmse, rsq), control = control_grid(verbose = TRUE) ) Use the show_best() function to quickly identify the best hyperparameter values. glmnet_stage_1_cv_results_tbl %>% show_best("mae", n = 10, maximize = FALSE) ## # A tibble: 10 x 7 ## penalty mixture .metric .estimator mean n std_err ## ## 1 1.28e- 3 0.0224 mae standard 16801. 5 117. ## 2 1.69e- 9 0.0491 mae standard 16883. 5 114. ## 3 7.49e- 7 0.0747 mae standard 16891. 5 112. ## 4 4.07e- 1 0.180 mae standard 16898. 5 111. ## 5 7.87e- 8 0.331 mae standard 16910. 5 110. ## 6 2.37e- 3 0.351 mae standard 16910. 5 110. ## 7 1.07e- 5 0.382 mae standard 16911. 5 110. ## 8 2.52e-10 0.382 mae standard 16911. 5 110. ## 9 1.60e- 1 0.444 mae standard 16912. 5 111. ## 10 1.31e- 7 0.546 mae standard 16914. 5 113. Key Point: A key observation is that the Mean Absolute Error (MAE) is$16,801, meaning the model is performing poorly. This is partly because we left the numeric features untransformed. Try updating the recipe with step_boxcox() and see if you can do better. Note that your MSRP will be transformed so you need to invert the MAE to the correct scale by finding the power for the box-cox transformation. But I digress.

5.6.2 xgboost – Hyperparameter Tuning

Follow the same tuning process for the xgboost model using tune_grid().

Warning: This takes approximately 20 minutes to run with 6-core parallel backend. I have recommendations to speed up this at the end of the article.

xgboost_stage_1_cv_results_tbl <- tune_grid( formula = msrp ~ ., model = xgboost_model, resamples = car_cv_folds, grid = xgboost_grid, metrics = metric_set(mae, mape, rmse, rsq), control = control_grid(verbose = TRUE) )

We can see the best xgboost tuning parameters with show_best().

xgboost_stage_1_cv_results_tbl %>% show_best("mae", n = 10, maximize = FALSE) ## # A tibble: 10 x 8 ## min_n tree_depth learn_rate .metric .estimator mean n std_err ## ## 1 2 9 0.0662 mae standard 3748. 5 151. ## 2 9 15 0.0117 mae standard 4048. 5 44.1 ## 3 21 12 0.0120 mae standard 4558. 5 91.3 ## 4 10 4 0.0856 mae standard 4941. 5 113. ## 5 24 3 0.00745 mae standard 7930. 5 178. ## 6 37 2 0.00547 mae standard 9169. 5 210. ## 7 24 8 0.00159 mae standard 9258. 5 347. ## 8 37 9 0.000749 mae standard 19514. 5 430. ## 9 3 1 0.000476 mae standard 27905. 5 377. ## 10 17 2 0.000413 mae standard 28124. 5 385.

Key Point:

A key observation is that the Mean Absolute Error (MAE) is $3,784, meaning the xgboost model is performing about 5X better than the glmnet. However, we won’t know for sure until we move to Stage 2, Evaluation. 6.0 Stage 2 – Compare and Select Best Model Feels like a whirlwind to get to this point. You’re doing great. Just a little bit more to go. Now, let’s compare our models. The “proper” way to perform model selection is not to use the cross validation results because in theory we’ve optimized the results to the data the training data. This is why we left out the testing set by doing initial splitting at the beginning. Now, do I agree with this? Let me just say that normally the Cross Validation Winner is the true winner. But for the sake of showing you the correct way, I will continue with the model comparison. 6.1 Select the Best Parameters Use the select_best() function from the tune package to get the best parameters. We’ll do this for both the glment and xgboost models. params_glmnet_best <- glmnet_stage_1_cv_results_tbl %>% select_best("mae", maximize = FALSE) params_glmnet_best ## # A tibble: 1 x 2 ## penalty mixture ## ## 1 0.00128 0.0224 params_xgboost_best <- xgboost_stage_1_cv_results_tbl %>% select_best("mae", maximize = FALSE) params_xgboost_best ## # A tibble: 1 x 3 ## min_n tree_depth learn_rate ## ## 1 2 9 0.0662 6.2 Finalize the Models Next, we can use the finalize_model() function to apply the best parameters to each of the models. Do this for both the glmnet and xgboost models. glmnet_stage_2_model <- glmnet_model %>% finalize_model(parameters = params_glmnet_best) glmnet_stage_2_model ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 0.00127732582071729 ## mixture = 0.0224261444527656 ## ## Computational engine: glmnet xgboost_stage_2_model <- xgboost_model %>% finalize_model(params_xgboost_best) xgboost_stage_2_model ## Boosted Tree Model Specification (regression) ## ## Main Arguments: ## trees = 1000 ## min_n = 2 ## tree_depth = 9 ## learn_rate = 0.0661551553016808 ## ## Engine-Specific Arguments: ## objective = reg:squarederror ## ## Computational engine: xgboost 6.3 Calculate Performance on Test Data We can create a helper function, calc_test_metrics(), to calculate the performance on the test set. calc_test_metrics <- function(formula, model_spec, recipe, split) { train_processed <- training(split) %>% bake(recipe, new_data = .) test_processed <- testing(split) %>% bake(recipe, new_data = .) target_expr <- recipe %>% pluck("last_term_info") %>% filter(role == "outcome") %>% pull(variable) %>% sym() model_spec %>% fit(formula = as.formula(formula), data = train_processed) %>% predict(new_data = test_processed) %>% bind_cols(testing(split)) %>% metrics(!! target_expr, .pred) } 6.3.1 glment – Test Performance Use the calc_test_metrics() function to calculate the test performance on the glmnet model. glmnet_stage_2_metrics <- calc_test_metrics( formula = msrp ~ ., model_spec = glmnet_stage_2_model, recipe = preprocessing_recipe, split = car_initial_split ) glmnet_stage_2_metrics ## # A tibble: 3 x 3 ## .metric .estimator .estimate ## ## 1 rmse standard 36200. ## 2 rsq standard 0.592 ## 3 mae standard 16593. 6.3.2 xgboost – Test Performance Use the calc_test_metrics() function to calculate the test performance on the xgboost model. xgboost_stage_2_metrics <- calc_test_metrics( formula = msrp ~ ., model_spec = xgboost_stage_2_model, recipe = preprocessing_recipe, split = car_initial_split ) xgboost_stage_2_metrics ## # A tibble: 3 x 3 ## .metric .estimator .estimate ## ## 1 rmse standard 11577. ## 2 rsq standard 0.961 ## 3 mae standard 3209. 6.4 Model Winner! Winner: The xgboost model had an MAE of$3,209 on the test data set. We select this model with the following parameters to move forward.

xgboost_stage_2_model ## Boosted Tree Model Specification (regression) ## ## Main Arguments: ## trees = 1000 ## min_n = 2 ## tree_depth = 9 ## learn_rate = 0.0661551553016808 ## ## Engine-Specific Arguments: ## objective = reg:squarederror ## ## Computational engine: xgboost 7.0 Stage 3 – Final Model (Full Dataset)

Now that we have a winner from Stage 2, we move forward with the xgboost model and train on the full data set. This gives us the best possible model to move into production with.

Use the fit() function from parsnip to train the final xgboost model on the full data set. Make sure to apply the preprocessing recipe to the original car_prices_tbl.

model_final <- xgboost_stage_2_model %>% fit(msrp ~ . , data = bake(preprocessing_recipe, new_data = car_prices_tbl)) 8.0 Making Predictions on New Data

We’ve went through the full analysis and have a “production-ready” model. Now for some fun – let’s use it on some new data. To avoid repetitive code, I created the helper function, predict_msrp(), to quickly apply my final model and preprocessing recipe to new data.

predict_msrp <- function(new_data, model = model_final, recipe = preprocessing_recipe) { new_data %>% bake(recipe, new_data = .) %>% predict(model, new_data = .) } 8.1 What does a

I’ll simulate a 2008 Dodge Ram 1500 Pickup and calculate the predicted price.

dodge_ram_tbl <- tibble( make = "Dodge", model = "Ram 1500 Pickup", year = 2008, engine_fuel_type = "regular unleaded", engine_hp = 300, engine_cylinders = 8, transmission_type = "AUTOMATIC", driven_wheels = "four wheel drive", number_of_doors = "2", market_category = "N/A", vehicle_size = "Compact", vehicle_style = "Extended Cab Pickup", highway_mpg = 19, city_mpg = 13, popularity = 1851 ) dodge_ram_tbl %>% predict_msrp() ## # A tibble: 1 x 1 ## .pred ## ## 1 30813. 8.2 What’s the “Luxury” Effect?

Let’s play with the model a bit. Next, let’s change the market_category from “N/A” to “Luxury”. The price just jumped from $30K to$50K.

luxury_dodge_ram_tbl <- tibble( make = "Dodge", model = "Ram 1500 Pickup", year = 2008, engine_fuel_type = "regular unleaded", engine_hp = 300, engine_cylinders = 8, transmission_type = "AUTOMATIC", driven_wheels = "four wheel drive", number_of_doors = "2", # Change from N/A to Luxury market_category = "Luxury", vehicle_size = "Compact", vehicle_style = "Extended Cab Pickup", highway_mpg = 19, city_mpg = 13, popularity = 1851 ) luxury_dodge_ram_tbl %>% predict_msrp() ## # A tibble: 1 x 1 ## .pred ## ## 1 49986. 8.3 How Do Prices Vary By Model Year?

Let’s see how our XGBoost model views the effect of changing the model year. First, we’ll create a dataframe of Dodge Ram 1500 Pickups with the only feature that changes is the model year.

dodge_rams_tbl <- tibble(year = 2017:1990) %>% bind_rows(dodge_ram_tbl) %>% fill(names(.), .direction = "up") %>% distinct() dodge_rams_tbl ## # A tibble: 28 x 15 ## year make model engine_fuel_type engine_hp engine_cylinders ## ## 1 2017 Dodge Ram … regular unleaded 300 8 ## 2 2016 Dodge Ram … regular unleaded 300 8 ## 3 2015 Dodge Ram … regular unleaded 300 8 ## 4 2014 Dodge Ram … regular unleaded 300 8 ## 5 2013 Dodge Ram … regular unleaded 300 8 ## 6 2012 Dodge Ram … regular unleaded 300 8 ## 7 2011 Dodge Ram … regular unleaded 300 8 ## 8 2010 Dodge Ram … regular unleaded 300 8 ## 9 2009 Dodge Ram … regular unleaded 300 8 ## 10 2008 Dodge Ram … regular unleaded 300 8 ## # … with 18 more rows, and 9 more variables: transmission_type , ## # driven_wheels , number_of_doors , market_category , ## # vehicle_size , vehicle_style , highway_mpg , city_mpg , ## # popularity

Next, we can use ggplot2 to visualize this “Year” effect, which is essentially a Depreciation Curve. We can see that there’s a huge depreciation going from models built in 1990’s vs 2000’s and earlier.

predict_msrp(dodge_rams_tbl) %>% bind_cols(dodge_rams_tbl) %>% ggplot(aes(year, .pred)) + geom_line(color = palette_light()["blue"]) + geom_point(color = palette_light()["blue"]) + expand_limits(y = 0) + theme_tq() + labs(title = "Depreciation Curve - Dodge Ram 1500 Pickup")

8.4 Which Features Are Most Important?

Finally, we can check which features have the most impact on the MSRP using the vip() function from the vip (variable importance) package.

vip(model_final, aesthetics = list(fill = palette_light()["blue"])) + labs(title = "XGBoost Model Importance - MSRP Prediction") + theme_tq()

9.0 Conclusion and Next Steps

The tune package presents a much needed tool for Hyperparameter Tuning in the tidymodels ecosystem. We saw how useful it was to perform 5-Fold Cross Validation, a standard in improving machine learning performance.

An advanced machine learning package that I’m a huge fan of is h2o. H2O provides a automatic machine learning, which takes a similar approach (minus the Stage 2 – Comparison Step) by automating the cross validation and hyperparameter tuning process. Because H2O is written in Java, H2O is much faster and more scalable, which is great for large-scale machine learning projects. If you are interested in learning h2o, I recommend my 4-Course R-Track for Business Program – The 201 Course teaches Advanced ML with H2O.

The last step is to productionalize the model inside a Shiny Web Application. I teach two courses on production with Shiny:

• Shiny Dashboards – Build 2 Predictive Web Dashboards
• Shiny Developer with AWS – Build Full-Stack Web Applications in the Cloud

The two Shiny courses are part of the 4-Course R-Track for Business Program.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Explainable AI (XAI)… Explained! Or: How to whiten any Black Box with LIME

Tue, 01/21/2020 - 09:00

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We already covered the so-called Accuracy-Interpretability Trade-Off which states that oftentimes the more accurate the results of an AI are the harder it is to interpret how it arrived at its conclusions (see also: Learning Data Science: Predicting Income Brackets).

This is especially true for Neural Networks: while often delivering outstanding results, they are basically black boxes and notoriously hard to interpret (see also: Understanding the Magic of Neural Networks).

There is a new hot area of research to make black-box models interpretable, called Explainable Artificial Intelligence (XAI), if you want to gain some intuition on one such approach (called LIME), read on!

Before we dive right into it it is important to point out when and why you would need interpretability of an AI. While it might be a desirable goal in itself it is not necessary in many fields, at least not for users of an AI, e.g. with text translation, character and speech recognition it is not that important why they do what they do but simply that they work.

In other areas, like medical applications (determining whether tissue is malignant), financial applications (granting a loan to a customer) or applications in the criminal-justice system (gauging the risk of recidivism) it is of the utmost importance (and sometimes even required by law) to know why the machine arrived at its conclusions.

One approach to make AI models explainable is called LIME for Local Interpretable Model-Agnostic Explanations. There is already a lot in this name!

1. First, you are able to make any (!) model interpretable (so not only neural networks).
2. Second, the explanation is always on a specific case, so the method tries to explain e.g. why this specific customer wasn’t approved for a loan but no general explanations, what is important to get a loan, will be given.
3. Third, and now we are getting at the heart of the method already, it approximates the complex, unintelligible model with a linear model.

This idea is, in my opinion, quite sexy because it has its equivalent in calculus: if you zoom in deep enough you can build most (even very complicated) function out of linear building blocks. This is what LIME basically does!

To gain some intuition we will use a very similar method and compare that with the results of LIME. We will even use the same illustrative picture used in the original paper (“Why Should I Trust You?”: Explaining the Predictions of Any Classifier) and create a toy-example out of it:

The paper explains (p. 4):

Toy example to present intuition for LIME. The black-box model’s complex decision function f (unknown to LIME) is represented by the blue/pink background, which cannot be approximated well by a linear model. The bold red cross is the instance being explained. LIME samples instances, gets predictions using f, and weighs them by the proximity to the instance being explained (represented here by size). The dashed line is the learned explanation that is locally (but not globally) faithful.

We are now taking this picture as our actual black-box model and approximate the decision boundary linearly. We do this by using logistic regression (see also: Learning Data Science: The Supermarket knows you are pregnant before your Dad does). LIME itself uses LASSO regression (see also: Cambridge Analytica: Microtargeting or How to catch voters with the LASSO).

Another thing is that we don’t weigh instances by proximity but randomly create more data points that are nearer to the respective case (by sampling from a multivariate normal distribution), yet the idea is the same.

Now we are ready to get started (you can find the prepared image here lime2.jpg, the packages jpeg and lime are on CRAN):

library(jpeg) library(lime) img <- readJPEG("pics/lime2.jpg") # adjust path border_col <- mean(img[ , , 1]) model <- ifelse(img[ , , 1] < border_col, 0, 1) image(model, axes = FALSE, xlab = "x1", ylab = "x2", col = c("#B9CDE5", "#F1DCDB")) axis(1, at = seq(0, 1, .1), labels = seq(0, nrow(model), round(nrow(model)/10))) axis(2, at = seq(0, 1, .1), labels = seq(0, ncol(model), round(ncol(model)/10))) # some S3 magic class(model) <- "black_box" predict.black_box <- function(model, newdata, type = "prob") { newdata <- as.matrix(newdata) apply(newdata, 1, function(x) model[x[1], x[2]]) } # the case to be analyzed x1 <- 140; x2 <- 145 points(x1/nrow(model), x2/ncol(model), pch = 3, lwd = 6, col = "red") predict(model, cbind(x1, x2)) ## [1] 1 # approximate locally by logistic regression set.seed(123) x1_prox <- round(rnorm(100, x1, 18)) x2_prox <- round(rnorm(100, x2, 18)) data <- cbind(x1_prox, x2_prox) df <- cbind.data.frame(y = predict(model, data), data) logreg <- glm(y ~ ., data = df, family = binomial) ## Warning: glm.fit: algorithm did not converge ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(logreg) ## ## Call: ## glm(formula = y ~ ., family = binomial, data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -6.423e-04 2.000e-08 2.000e-08 2.000e-08 5.100e-04 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 12378.05 735651.83 0.017 0.987 ## x1_prox -94.11 5606.33 -0.017 0.987 ## x2_prox 15.67 952.47 0.016 0.987 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1.0279e+02 on 99 degrees of freedom ## Residual deviance: 8.5369e-07 on 97 degrees of freedom ## AIC: 6 ## ## Number of Fisher Scoring iterations: 25 slope <- -coef(logreg)[2] / coef(logreg)[3] intercept <- -coef(logreg)[1] / coef(logreg)[3] segments(0, intercept/ncol(model), 1, (intercept + nrow(model) * slope)/ncol(model), lty = 2, lwd = 6, col = "grey40")

Here we can clearly see the locally approximated linear decision boundary. Now for the interpretation of the coefficients of the linear model:

# interpretation barplot(coef(logreg)[3:2], horiz = TRUE, col = ifelse(coef(logreg)[3:2] < 0, "firebrick", "steelblue"), border = NA, xlab = "Weight", ylab = "Feature", main = "Coefficients of linear model") legend("bottom", horiz = TRUE, c("Supports", "Contradicts"), fill = c("steelblue", "firebrick"))

The bar chart can be interpreted like so: attribute x1 has a strong negative influence on the resulting class (i.e. when you increase it the class will quickly change), while attribute x2 has a comparatively mild positive influence (i.e. when you increase it the class won’t change but the model will get even more confident, but only mildly so). This interpretation can also readily be understood when looking at the decision boundary above.

We are now going to compare this with the original LIME method:

# compare with original lime data <- expand.grid(1:nrow(model), 1:ncol(model)) colnames(data) <- c("x1", "x2") train <- data.frame(data, y = predict(model, data)) explainer <- lime(train, model) ## Warning: y does not contain enough variance to use quantile binning. Using ## standard binning instead. model_type.black_box <- function(x, ...) 'classification' explanation <- explain(data.frame(x1 = 140, x2 = 145), explainer, n_labels = 1, n_features = 2) explanation ## # A tibble: 2 x 13 ## model_type case label label_prob model_r2 model_intercept ## ## 1 classific~ 1 p 1 0.0515 0.511 ## 2 classific~ 1 p 1 0.0515 0.511 ## # ... with 7 more variables: model_prediction , feature , ## # feature_value , feature_weight , feature_desc , ## # data , prediction plot_features(explanation, ncol = 1)

While the results are not a full match (because of the slightly different approach taken here) the direction and ratio of the magnitude are very similar.

All in all, I think LIME is a very powerful and intuitive method to whiten any black-box model and XAI will be one of the most important and relevant research areas in machine learning and artificial intelligence in the future!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Tips and tricks in RStudio and R Markdown

Tue, 01/21/2020 - 01:00

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you have the chance to work with an experienced programmer, you may be amazed by how fast she can write code. In this article, I share some tips and shortcuts you can use in RStudio and R Markdown to speed up the writing of your code.

Run code

You most probably already know this shortcut but I still mention it for new R users. From your script you can run a chunk of code with:

command + Enter on Mac Ctrl + Enter on Windows Insert a comment in R and R Markdown

To insert a comment:

command + Shift + C on Mac Ctrl + Shift + C on Windows

This shortcut can be used both for:

• R code when you want to comment your code. It will add a # at the beginning of the line
• for text in R Markdown. It will add around the text

Note that if you want to comment more than one line, select all the lines you want to comment then use the shortcut. If you want to uncomment a comment, apply the same shortcut.

Knit a R Markdown document

You can knit R Markdown documents by using this shortcut:

command + Shift + K on Mac Ctrl + Shift + K on Windows Code snippets

Code snippets is usually a few characters long and is used as a shortcut to insert a common piece of code. You simply type a few characters then press Tab and it will complete your code with a larger code. Tab is then used again to navigate through the code where customization is required. For instance, if you type fun then press Tab, it will auto-complete the code with the required code to create a function:

name <- function(variables) { }

Pressing Tab again will jump through the placeholders for you to edit it. So you can first edit the name of the function, then the variables and finally the code inside the function (try by yourself!).

There are many code snippets by default in RStudio. Here are the code snippets I use most often:

• lib to call library()
library(package)
• mat to create a matrix
matrix(data, nrow = rows, ncol = cols)
• if, el, and ei to create conditional expressions such as if() {}, else {} and else if () {}
if (condition) { } else { } else if (condition) { }
• fun to create a function
name <- function(variables) { }
• for to create for loops
for (variable in vector) { }
• ts to insert a comment with the current date and time (useful if you have very long code and share it with others so they see when it has been edited)
# Tue Jan 21 20:20:14 2020 ------------------------------
library(shiny) ui <- fluidPage( ) server <- function(input, output, session) { } shinyApp(ui, server)

You can see all default code snippets and add yours by clicking on Tools > Global Options… > Code (left sidebar) > Edit Snippets…

Ordered list in R Markdown

In R Markdown, when creating an ordered list such as this one:

1. Item 1
2. Item 2
3. Item 3

Instead of bothering with the numbers and typing

1. Item 1 2. Item 2 3. Item 3

you can simply type

1. Item 1 1. Item 2 1. Item 3

for the exact same result (try it yourself or check the code of this article!). This way you do not need to bother which number is next when creating a new item.

To go feven further, any numeric will actually render the same result as long as the first item is the number you want to start from. For example, you could type:

1. Item 1 7. Item 2 3. Item 3

which renders

1. Item 1
2. Item 2
3. Item 3

However, I suggest always using the number you want to start from for all items because if you move one item at the top, the list will start with this new number. For instance, if we move 7. Item 2 from the previous list at the top, the list becomes:

7. Item 2 1. Item 1 3. Item 3

which incorrectly renders

1. Item 2
2. Item 1
3. Item 3
New code chunk in R Markdown

When editing R Markdown documents, you will need to insert a new R code chunk many times. The following shortcuts will make your life easier:

command + option + I on Mac (or command + alt + I depending on your keyboard) Ctrl + ALT + I on Windows

New R code chunk in R Markdown

Reformat code

A clear and readable code is always easier and faster to read (and look more professional when sharing it to collaborators). To automatically apply the most common coding guidelines such as whitespaces, indents, etc., use:

cmd + Shift + A on Mac Ctrl + Shift + A on Windows

So for example the following code which does not respect the guidelines (and which is not easy to read):

1+1 for(i in 1:10){if(!i%%2){next} print(i) }

becomes much more neat and readable:

1 + 1 for (i in 1:10) { if (!i %% 2) { next } print(i) } Others

Similar to many other programs, you can also use:

• command + Shift + N on Mac and Ctrl + Shift + N on Windows to open a new R Script
• command + S on Mac and Ctrl + S on Windows to save your current script or R Markdown document

Thanks for reading. I hope you find these tips and tricks useful. If you are using others, feel free to share them in the comment section. See more articles on R. As always, if you find a mistake/bug or if you have any questions do not hesitate to let me know in the comment section below, raise an issue on GitHub or contact me. Get updates every time a new article is published by subscribing to this blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Automatic DAG learning – part 2

Tue, 01/21/2020 - 01:00

[This article was first published on R on Just be-cause, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

We’ve seen on a previous post that one of the main differences between classic ML and Causal Inference is the additional step of using the correct adjustment set for the predictor features.

In order to find the correct adjustment set we need a DAG that represents the relationships between all features relevant to our problem.

One way of obtaining the DAG would be consulting domain experts. That however makes the process less accessible to wide audiences and more manual in nature. Learning the DAG from data automatically can thus make Causal Inference more streamlined usable.

On my last blog post we’ve hit a dead-end when it seemed non of the algorithms surveyed was able to learn the DAG accurately enough.

Enter: orientDAG package

While the algorithms surveyed on my last post were able to converge on the correct DAG skeleton (find edge presence without it’s orientation) they all failed in orienting the edges correctly. This means that for a given pair of variables X,Y the algorithms were able to tell if they are causally linked, but unable to determine if X causes Y or vice versa.

We can utilize literature centered around finding causal direction to correctly orient DAG edges after learning it’s skeleton.

When both X and Y are continuous, we can use the generalized correlation measure developed by Vinod.

When both X and Y are discrete, we can use the distance correlation measure (see Liu and Chan, 2016).

When either X or Y are discrete, and the other is continuous we can discretisize the continuous variable (using for example the procedure at infotheo::discretize) and use the method for 2 discrete variables mentioned above.

The orientDAG package uses the approach outlined above to orient DAG skeleton edges. To demonstrate how it works let’s go back to the example shown on my last post.

We used the simMixedDAG package to simulate datasets from the DAG below:

Following the bench-marking results on my last post we’ll restrict our attention to the bnlearn::tabu function for learning the DAG skeleton.

The main function in the orientDAG is fittingly called orient_dag. It takes as input an adjacency matrix where a 1 in the i,j position denotes an arrow from node i to node j.

Below we can see the adjacency matrix corresponding to the DAG above:

## age ageGroup educ educGroup gender nativeBorn vocab year ## age 0 0 0 1 0 1 0 0 ## ageGroup 0 0 0 0 0 0 0 0 ## educ 1 0 0 0 1 0 0 1 ## educGroup 0 0 0 0 0 0 0 0 ## gender 0 0 0 0 0 0 0 0 ## nativeBorn 0 1 0 0 0 0 1 0 ## vocab 0 0 0 0 1 0 0 1 ## year 0 0 0 0 0 0 0 0

The package also contains utility functions that facilitate DAG conversion between different representations. We can thus take the fitted bn DAG object from the tabu function and convert it to an adjacency matrix using the function bn_to_adjmatrix. The DAG below shows all conversions available in the package:

As before, we’ll measure DAG learning accuracy by the Structural Intervention Distance. For every sample size $$n$$ I simulate 100 datasets and measure SID. Below I compare the SID distribution for 3 algorithms:

1. tabu: DAGs obtained using the tabu function from the bnlearn package
2. tabu + orientDAG: DAGs obtained by further orienting the DAG’s from tabu using the orientDAG package
3. random: DAGs obtained by randomly re-orienting the DAGs obtained by the tabu function

We can see that the tabu performance deteriorates as sample size grows while tabu + orientDAG improves.

From SID we can derive a more intuitive measure: The probability of finding the correct adjustment set for a randomly chosen pair of treatment-exposure variables:

$P(\text{correct adj set)} = 1 – \frac{SID}{\#\{\text{possible treatment – exposure pairs}\}}$

Below we can see the corresponding plot:

For $$n = 80,000$$ the orientDAG package about doubles the performance compared with tabu.

Bayesian Network DGP

In the above example the underlying data generating process (DGP) didn’t conform to the Bayesian Network (BN) assumptions, which might explain the deteriorating performance of the tabu function.

Let’s see how the different algorithms fare when the underlying DGP does conform to the BN assumptions.

Below I plot the “mehra” DAG taken from the bnlearn package:

The above DAG contains 24 nodes, of which 8 are categorical. The nodes are connected by 71 arrows.

Below we can see SID for the different algorithms (this time running only 10 simulations per sample size due to the network size):

We can now see that when the BN assumptions hold the tabu function fares best as expected, while the tabu + orientDAG fares as bad as a random guess or even worse. It’s possible that the tabu + orientDAG performance is worse because it uses tests unrelated to the BN assumptions thus introducing noise to edges which are already oriented correctly by tabu.

We can mitigate such performance deterioration by introducing regularization parameters which force orient_dag to re-orient an edge only in cases where there is strong evidence for a given orientation. Namely, the difference between distance correlation/generalized correlations must exceed some threshold. We’ll call this approach tabu + orientDAG – conservative.

Below we can see the resulting performance measures:

While the tabu + orientDAG still fares worse then tabu, it still is able to marginally at least outperform a random guess.

Conclusion

It would seem that automatic learning of DAGs is highly dependent on knowing the underlying DGP assumptions and also requires very large sample sizes – thus making automatic learning of DAGs unreliable (at least for the examples surveyed in this post).

It may be that a hybrid solution where some edges are white-listed (their presence and orientation are assumed known) and some are blacklisted (are assumed to be known not to exist) by an expert, followed by an automatic algorithm can produce more reliable results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Calculating the Required Sample Size for a Binomial Test in R

Tue, 01/21/2020 - 01:00

[This article was first published on R on Sastibe's Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A Standard Problem: Determining Sample Size

Recently, I was tasked with a straightforward question: “In an A/B test setting, how many samples do I have to collect in order to obtain significant results?” As ususal in statistics, the answer is not quite as straightforward as the question, and it depends quite a bit on the framework. In this case, the A/B test was supposed to test whether the effect of a treatment on the success rate p had the assumed size e. The value of the success rate had to be estimated in both test and control group, i.e. p_test and pcontrol. In short, the test hypotheses were thus

H0 : ptest = pcontrol vs.

H1 : ptest = pcontrol + e

Now, for each statistical test, we aim at minimizing (or at least controlling for) the following types of error:

• Type I: Even though H0 is true, the test decides for H1
• Type II: Even though H1 is true, the test decides for H0.

Since I can never remember stuff like this, I immediately started looking for a simple mnemonic, and I found this one:

The null hypothesis is often represented as H0. Although mathematicians may disagree, where I live 0 is an even number, as evidenced by the fact that it is both preceded and followed by an odd number. Even numbers go together well. An even number and an odd number do not go together well. Hence the null hypothesis (even) is rejected by the type I error (odd), but accepted by the type II error (even).

The test statistic

In the given setup, the test now runs as follows: Calculate the contingency matrix of successes in both groups, and apply Fisher’s exact test. If the test is negative, i.e. does not reject the null hypothesis, we have to repeat the experiment. However, we are quite sure that the null hypothesis is wrong and would like to prove that with as little effort as possible.

The basic question in this situation is “how many observations do I need to collect, in order to avoid both errors of Type I and II to an appropriate degree of certainty?”. The “appropriate degree of certainty” is parametrized in the probability of errors of Type I (significance level) and Type II (power). The default choices for these values are 0.05 for the significance level, and 0.8 for power: In 5% of cases, we reject a “true” H0, and in 20% of cases we reject a “true” H1. Quite clearly, only the power of the test (and not the significance level) depends on the difference of the parameters ptest and pcontrol.

Existing functions in R

Are there already pre-defined functions to calculate minimal required sample sizes? A bit of digging around yields a match in the package Hmisc. There, the authors implement a method developed Fleiss, Tytun and Ury1. However, according to the documentation, the function is written only for the two-sided test case and does not include the continuity correction. I disagree with both decisions:

• the continuity correction term can grow quite large, and is always positive (see (5) in the cited paper). Thus, neglecting this term will always end in an underestimation of the necessary number of observations and may therefore lead to unsuccessful experiments.
• the two-sided case is not the norm, but rather the exception. When testing pcontrol vs. ptest, the counterhypothesis will almost always read “ptest > pcontrol“, since the measures taken assume to have, if any, a positive effect.
A new R function: calculate_binomial_samplesize

After these considerations, I decided to write my own function. Below is the code, the function allows for “switching the continuity correction off”, and for differentiating between the one-sided and the two-sided case. In the two-sided case without continuity correction, it coincides with “Hmisc:bsamsize”, as can be seen from the example provided.

#' Calculate the Required Sample Size for Testing Binomial Differences #' #' @description #' Based on the method of Fleiss, Tytun and Ury, this function tests the null #' hypothesis p0 against p1 > p_0 in a one-sided or two-sided test with significance level #' alpha and power beta. #' #' #' @usage #' calculate_binomial_samplesize(ratio0, p0, p1) #' #' #' @param ratio0 Numeric, proportion of sample of observations in group 0, the control #' group #' @param p1 Numeric, postulated binomial parameter in the treatment group. #' @param p0 Numeric, postulated binomial parameter in the control group. #' @param alpha Desired significance level for the test, defaults to 0.05 #' @param beta Desired pwoer for the test, defaults to 0.8 #' @param one_sided Bool, whether the test is supposed to be one-sided or two-sided. #' @param continuity_correction Bool, whether the sample size should be #' Defaults to TRUE. #' #' #' @return #' A named numeric vector, containing the required sample size for the treatment group, #' the control group, and the required total (the sum of both numbers). #' #' #' @seealso [Hmisc::bsamsize()] #' #' @author Sebastian Schweer \email{sastibe_r@@sastibe.de} #' #' @references #' Fleiss JL, Tytun A, Ury HK (1980): A simple approximation for calculating sample sizes #' for comparing independent proportions. Biometrics 36:343-346. #' #' @examples #'# Same result #' alpha = 0.02; power = 0.9; fraction = 0.4; p_lower = 0.23; p_higher = 0.34 #' #' Hmisc::bsamsize(p1= p_lower, p2 = p_higher, fraction = fraction, #' alpha = alpha, power = power) #' #' calculate_binomial_samplesize(ratio0 = fraction, p1= p_higher, p0 = p_lower, #' alpha = alpha, beta = power, one_sided = FALSE, continuity_correction = FALSE) #' #' #' @export calculate_binomial_samplesize <- function(ratio0, p1, p0, alpha = 0.05, beta = 0.8, one_sided = TRUE, continuity_correction = TRUE){ if(!is.numeric(ratio0) | !is.numeric(p1) | !is.numeric(p0)) stop("Input parameters ratio0, p0 and p1 need to be numeric.") if(!is.numeric(alpha) | !is.numeric(beta)) stop("Input parameters alpha and beta need to be numeric.") if(max(c(alpha, beta, ratio0, p0, p1)) >= 1 | min(c(alpha, beta, ratio0, p0, p1)) <= 0) stop("Input parameters ratio0, p0, p1, alpha, beta need to be in the interval (0,1)") delta = p1 - p0 # Nomenclature as in the paper r = 1 / ratio0 - 1 # Uniting the definitions if(one_sided == FALSE) { # Last statement of the paper alpha = alpha / 2 delta = abs(p1 - p0) } p_bar = (p0 + r*p1)/(r+1) m_dash_1 = qnorm(1 - alpha, mean = 0, sd = 1)*sqrt((r+1)*p_bar*(1 - p_bar)) m_dash_2 = qnorm(beta, mean = 0, sd = 1)*sqrt(p1*(1-p1) + r*p0*(1-p0)) m_dash = ( m_dash_1 + m_dash_2 )^2 / (r * delta^2) if(continuity_correction == TRUE){ m_dash = m_dash + (r + 1) / (r*delta) } return(c(size_0 = m_dash, size_1 = r*m_dash, size_overall = m_dash + r*m_dash)) }

1. Fleiss JL, Tytun A, Ury HK (1980): A simple approximation for calculating sample sizes for comparing independent proportions. Biometrics 36:343-6. [return]
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Graph Theory 101 with corruption cases in Spain

Tue, 01/21/2020 - 01:00

[This article was first published on R on Coding Club UC3M, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by María Medina Pérez

According to CIS’ barometer, political corruption is the second biggest concern in Spain, only behind unemployment, and has been in this position since 2013, as we see Spanish news talking about open trials and new investigations on a regular basis. The European Commission estimates that corruption alone costs the EU economy 120 billion of euros per year, just a little less than the annual budget of the European Union.

Even though corruption in Spain can seem endemic, given its volume and extensity, some political leaders claim that these cases are “isolated”, not related to each other, and a thing of the past.

Data about all corruption cases in Spain can be modeled as a graph to unveil the networks of “friends” that are always charged together. We can then use Graph Theory concepts to analyze the relations and try to determine if these are indeed isolated cases. Let’s go and see what we find out!

What is a graph?

Graphs are mathematical structures that model relations between entities. The entities are called vertices, or nodes, and are represented with a circle, and the relations are called edges, or links, and are represented with lines drawn from one circle to another.

Some relations between entities might be stronger than others. That is modeled by assigning weights to the edges of the graph, which are usually represented by using different line widths.

Below we can see an example of a graph with 6 nodes and 7 edges:

In our case, we are going to build two different graphs:

• Graph of people accused: with the people involved in the cases as nodes. Two people are related if they have been accused in the same case. The weight of the edges is the number of cases they share.

• Graph of cases: with the corruption cases as nodes. Two cases are related if they share a person that has been accused in both. The weight of the edges is the number of people the cases have in common.

Getting the data

Now, to build these graphs we need data of all the political corruption cases in Spain, and which people are involved in them. We can get that from Casos Aislados, a web page that is doing the work of collecting all this information from news and official court documents, and releasing it to the public.

To avoid saturating the webpage with requests, I have already downloaded and processed the data, resulting in three files that we’ll use to build the graphs.

• The list of cases and people accused in each of them can be found here.
• The file of relations within people accused can be found here.
• The file of relations within corruption cases can be found here.

We can load them into R by executing the following:

Once we have the data loaded into R, it’s time to create the graphs and start working with them. For that, we’ll use a package called igraph, which is the most popular package in R for working with graphs.

If you don’t have it already, you can install the package with

install.packages("igraph")

and then load the library typing

library(igraph)

We can now create the graphs from our dataframes as follows:

# Graph of people accused g_accused <- graph_from_data_frame(relations_accused[, c("accused_1", "accused_2")], directed=FALSE, vertices=unique(list_accused_in_cases$accused)) E(g_accused)$weight <- relations_accused$cases_in_common # Graph of corruption cases g_cases <- graph_from_data_frame(relations_cases[c("case_1", "case_2")], directed=FALSE, vertices=unique(list_accused_in_cases$case)) E(g_cases)$weight <- relations_cases$accused_in_common

Note that the function E(graph) retrieves the edges of a graph. We can use that to set edge attributes.

We have two ways of getting an overview of what the graph looks like.
The first one is using the summary() function, where we see that the graph of accused people has 3100 nodes, 31375 edges, a name attribute for the nodes, and a weight attribute for the edges.

summary(g_accused) ## IGRAPH c8e74b7 UNW- 3100 31375 -- ## + attr: name (v/c), weight (e/n)

The second way is plotting the graph to see its shape:

plot(g_cases, vertex.label=NA, vertex.size=2, vertex.color="#0CCF02")

Connected Components

We can see that the cases graph is composed of many single dots. Those are, indeed, isolated cases: corruption cases whose people involved haven’t been accused in any other case. But we can also appreciate some structures that are formed with interconnected nodes. Each of those structures is called a connected component in Graph Theory. Isolated nodes are also connected components themselves, with size 1.

With igraph we can calculate the connected components of a graph and check their sizes:

components_cases <- components(g_cases) components_cases_sizes <- sizes(components_cases)

If we have a look at the sizes, we can see that there are 40 connected components with more than one node. All those components are not isolated corruption cases, but corruption plots. Let’s explore them further.

Induced Subgraphs

If we want to focus on the corruption plots, we’ll need to create a new graph that only contains the cases that are involved in them. In Graph Theory, a subset of a graph created by selecting specific nodes and the edges that join them is called an induced subgraph.

Of course, igraph has a function for this. First we need to get the names of the nodes that are part of a complex component, and then we can use those names to subset the graph with induced_subgraph():

components_plots <- which(components_cases_sizes > 1) g_cases_plots <- induced_subgraph(g_cases, vids = which(membership(components_cases) %in% components_plots)) summary(g_cases_plots) ## IGRAPH ca24e1a UNW- 235 778 -- ## + attr: name (v/c), weight (e/n)

The next step is getting the induced subgraph of the people that have been accused in cases that belong to a corruption plot, this is, the cases we have just selected:

accused_plots <- subset(list_accused_in_cases, case %in% V(g_cases_plots)$name) g_accused_plots <- induced_subgraph(g_accused, vids = accused_plots$accused) summary(g_accused_plots) ## IGRAPH ca2e7e8 UNW- 1836 23527 -- ## + attr: name (v/c), weight (e/n)

As a result we see now that there are 235 corruption cases that are part of any of those 40 plots we had in total, and there are 1836 people involved in at least one of them, out of the 3100 we originally had.

And what exactly is the size of each connected component? We can see that in the picture below:

Take a moment and try to guess which cases correspond to the biggest corruption plots that appear in the image, you have probably heard about most of them.

• In the first place, with 84 cases and 893 people involved, we have a plot located mainly in Madrid, Valencia and Baleares and corresponding to the cases: Gürtel, Púnica, Lezo, Noos, Bárcenas, Bankia, Tarjetas Black, …
• In the second place, with 48 cases and 204 people involved, we have a plot located purely in Andalucía, comprising the ERE cases, Facturas Falsas, Cursos Formación, etc.
• In the third place we have a much smaller plot with 9 cases and 75 people involved, located mainly in Galicia and with the primary case being the Pokemon case.

The biggest component of a graph is called the giant component. We will now focus on this one, the biggest corruption plot, to learn a few more concepts of Graph Theory and Social Network Analysis.

biggest_plot <- which.max(components_cases_sizes) g_cases_plot1 <- induced_subgraph(g_cases, vids = which(membership(components_cases) == biggest_plot)) accused_plot1 <- subset(list_accused_in_cases, case %in% V(g_cases_plot1)$name) g_accused_plot1 <- induced_subgraph(g_accused, vids = accused_plot1$accused)

Since we are going to be plotting the same graph repeatedly, it’s a good practice to calculate the layout beforehand, and then passing it to the plot function. That way, our graph will always have the same shape, and we will save computing time.

layout_accused_plot1 <- layout_nicely(g_accused_plot1) plot(g_accused_plot1, layout=layout_accused_plot1, vertex.label=NA, vertex.size=3)

Shortest Paths

Another important concept regarding graph topology is shortest paths.

We can think of the graph of accused people as a network of contacts, and wonder how many calls someone has to make if they want to get in contact with another person of the graph, asking for new telephone numbers until they get the one they want. Of course, this person would choose the sequence that minimizes the number of calls to make, so the path followed would be the shortest path between the two people.

Shortest paths can be different if the edges of the graph are weighted: the person looking for the phone might prefer to call the closest friend in each step, even though that implies making a bigger number of calls.

The function have to use in igraph is shortest_paths(). Let’s calculate for example the shortest path between Gallardón and Bárcenas:

path <- shortest_paths(g_accused_plot1, from="Alberto Ruiz-Gallardón Jiménez", to="Luis Francisco Bárcenas Gutiérrez") path$vpath ## [[1]] ## + 4/893 vertices, named, from ca37cd5: ## [1] Alberto Ruiz-Gallardón Jiménez Ignacio González González ## [3] Alberto López Viejo Luis Francisco Bárcenas Gutiérrez The shortest sequence would be: Gallardón → Ignacio González → Alberto López Viejo → Bárcenas, with a length of 3 steps. If we just need to know the distance, and not the details about the steps, we can use the function distance(), which takes less time to execute: distances(g_accused_plot1, v="Alberto Ruiz-Gallardón Jiménez", to="Luis Francisco Bárcenas Gutiérrez") ## Luis Francisco Bárcenas Gutiérrez ## Alberto Ruiz-Gallardón Jiménez 3 Besides calculating the length of the shortest path between a specific pair of nodes, with the distance() function we can also calculate the length of the shortest path between every pair of nodes in the graph, getting a distance matrix: paths_lengths <- distances(g_accused_plot1) mean(paths_lengths) ## [1] 4.29711 If we then calculate the mean value of these distances, we’ll see that, on average, every two nodes of the graph are separated by 4 or 5 steps. The networks where most nodes can be reached from every other node by a small number of steps are called small-world networks. Many real-life networks have this property. Centrality Measures Another interesting thing we can do with graphs is trying to determine which nodes are the most relevant in the network. We don’t have an official definition for that, but there are several ways of measuring the centrality of a node in a graph that we can use: • Degree centrality: given by the degree of the node in the graph, this is, the number of edges that are connected to the node. In our people network, we could use this metric to define relevance as a function of the number of contacts of a node. • Betweenness: given by the number of shortest paths in the graph that pass through that node. This can be seen as a measure of how essential the node is to the graph. • Centrality: given by the length of the shortest paths from the node to every other node in the graph. With this metric, relevance would mean being well connected, with a short distance to everybody in the graph. We can calculate all this metrics using igraph as follows: nodes_degree <- degree(g_accused_plot1) nodes_betweenness <- betweenness(g_accused_plot1) nodes_closeness <- closeness(g_accused_plot1) Community Extraction Finally, the last thing we are going to analyze is the groups of people that are more connected, the groups of “friends” in the graph. These groups or clusters are called communities in Graph Theory. There are many different algorithms for calculating communities, and each of them can result in a different division. A lot of them are implemented in igraph, with functions starting with cluster_. In this case, we are going to calculate the groups of friends with an algorithm called Walktrap, which is based on random walks through the graph. We can play with different numbers of steps until we like the groups we get. comm_accused_plot1 <- cluster_walktrap(g_accused_plot1, steps=50) The goodness of a division in communities can be measured with something called modularity, which determines how well separated the clusters are from each other. modularity(comm_accused_plot1) ## [1] 0.8216408 Modularity takes values from -1 to 1, with a value close to 1 indicating a strong community structure. Therefore, we can see that the division we have achieved is pretty good. And how does that division look like? We can plot the graph together with the community structure using an enhanced version of the plot() function that accepts the community object and the graph object as parameters: plot(comm_accused_plot1, g_accused_plot1, layout=layout_accused_plot1, vertex.label=NA, vertex.size=3) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Coding Club UC3M. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### A brief account (via simulation) of the ROC (and its AUC) Tue, 01/21/2020 - 01:00 [This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The ROC (receiver operating characteristic) curve visually depicts the ability of a measure or classification model to distinguish two groups. The area under the ROC (AUC), quantifies the extent of that ability. My goal here is to describe as simply as possible a process that serves as a foundation for the ROC, and to provide an interpretation of the AUC that is defined by that curve. A prediction problem The classic application for the ROC is a medical test designed to identify individuals with a particular medical condition or disease. The population is comprised of two groups of individuals: those with the condition and those without. What we want is some sort of diagnostic tool (such as a blood test or diagnostic scan) that will identify which group a particular patient belongs to. The question is how well does that tool or measure help us distinguish between the two groups? The ROC (and AUC) is designed to help answer that question. True and false predictions While we might not know group membership for an individual, we assume that they do indeed belong to one of the two groups. When we base a prediction of group membership based on a test, we may or may not be right. There are four scenarios. It is possible that our prediction is (1) a true positive (the patient has the condition and that is what we predict), (2) a false positive (the patient does not have the condition, but we predict they do have it), (3) a false negative (the patient has the condition but we believe they are healthy), or (4) a true negative (the patient is healthy and that is our prediction.) A “good” test is one that maximizes true positive predictions while minimizing false positive predictions. We can actually only assess the quality of the test if we know the true group membership of the individuals. So, our plan is to take measurements on this known sample, make predictions based on the test, and see how our predictions match up to reality. The ROC is one way to characterize how well our test matches up to reality. Binary decision informed by a continuous measure While we make a binary decision about group membership – either we think they have the condition or they do not – the underlying measure that is used to make that determination may be continuous, like a score. For example, a hypothetical test might return a score between -10 and 10. We can pick a threshold anywhere along the continuum that will form the basis of our prediction. For example, we might say that any score > 0 indicates the condition is present, otherwise it is not. This simple test will be useful as a tool to discriminate between the disease and non-disease groups if that threshold indeed distinguishes the groups. This is probably best demonstrated with a simple simulation. The sample we will generate has 100 individuals, around 40% who have the condition in question. The average score for the non-disease group is set at -5, and the average score for the disease group is 5. Both have variance 3.5: library(simstudy) # define data defx <- defData(varname = "condition", formula = .4, dist = "binary") defx <- defData(defx, "x", formula = "-5 + 10*condition", variance = 3.5, dist = "normal") # generate data set.seed(1873) dx <- genData(100, defx) head(dx) ## id condition x ## 1: 1 0 -5.83 ## 2: 2 1 4.66 ## 3: 3 1 4.23 ## 4: 4 0 -3.87 ## 5: 5 1 1.78 ## 6: 6 0 -4.87 Looking at the plot below, a threshold of zero appears to do an excellent job of distinguishing the groups. All of those with the condition (depicted in red) are above the threshold, whereas all of those without the condition (depicted in green) fall below the threshold: The world is not always so neat and clean Of course, we don’t usually have a measure or test that separates the groups so cleanly. Let’s say the average of the disease group is 2.5 and the non-disease group is -3. The threshold of zero still works pretty well, but it is not perfect. Some with the disease fall below the threshold (false negatives), and some without the disease lie above the threshold (false positives). In fact, only 87% of those with the disease are correctly identified (true positives), while 13% of those without the condition are incorrectly identified has having the disease (false positives). defx <- updateDef(defx, changevar = "x", newformula="-3 + 5.5*condition", newvariance = 6) dx <- genData(100, defx) Generating the ROC Zero isn’t the only possible threshold we could use for the diagnosis test. We can lower the threshold to below zero to ensure that we have 100% true positives, but we will have to sacrifice by increasing the proportion of false positives. Likewise, we could reduce the proportion of false positives by increasing the threshold above zero, but would reduce the proportion of true positives in the process. There are, in fact, an infinite number of possible thresholds. Here is a sequence of plots of the same data with a number of different thresholds ranging from 8 to -8. The percent of true positives is shown on the top and the percent of false positives is shown on the bottom: The ROC is really just a summarized version of this sequence of plots. The X-axis is the proportion of false positives at a particular threshold, and the Y-axis is the proportion of true positives. As we lower the threshold, we move from left to right. So, in the plot below, each point represents one of the sections above: The ROC above is built from only 9 thresholds. If we consider all possible thresholds (continuous between -10 and 10), this is the the more complete curve: Area under the ROC The AUC is, well, the area under the ROC. The maximum AUC will be 1 when there is complete separation (there is an example of this below), and the minimum is 0.5 (depicted by the diagonal line) when there is no separation by the test measure (again, an example will follow). We can estimate this area by integrating an approximate function defined by the data between 0 and 1. f <- approxfun(x = roc$false.pos, y=roc$true.pos) integrate(f, lower = 0, upper = 1) ## 0.957 with absolute error < 0.00011 There is actually a meaningful interpretation of the AUC, that is described in a classic 1982 paper by Hanley & McNeil (if you want a deeper understanding of the issues, this paper is not a bad place to start – there is, of course, a huge literature on the topic of ROCs). The AUC is actually equivalent to the probability that the test measure of a random draw from the diseased group will be greater than the test measure of a random draw from the healthy group. So, an AUC = 0.90 indicates that 90% of the time we draw a test measure from the disease group and non-disease group, the measure from the disease group will be greater. Here is a simple function that returns a value of TRUE if the random draw from the disease group is greater: randcomp <- function(ds) { ds[condition == 1, sample(x, 1)] > ds[condition == 0, sample(x, 1)] } And here is the proportion of 1000 draws where the measure from the disease group draws is greater (this is expected to be close to the AUC, which was estimated above to be 0.957): mean(sapply(1:1000, function(x) randcomp(dx))) ## [1] 0.958 Of course, R has several packages that provide ROCs and calculate AUCs. I’m using package pROC here just to show you that my AUC estimate is not totally crazy: library(pROC) roc_obj <- roc(response = dx$condition, predictor = dx$x) auc(roc_obj) ## Area under the curve: 0.958 Alternative scenarios As I indicated above, the AUC can generally range from 0.5 to 1.0. There is no hard and fast rule about what is a “good” AUC – it will depend on the application. Certainly, anything below 0.7 or maybe even 0.8 is pretty weak. I am going to conclude by generating data at the two extremes. Minimal separation When the test measure for each group is equally distributed, there is unlikely to be any threshold for which the proportion of true positives exceeds the proportion of false positives. If this is the case, we should probably look for another test measure – or be prepared to make a lot of mistakes in the non-disease group. defx <- updateDef(defx, changevar = "x", newformula="0+0*condition", newvariance = 8) dx <- genData(100, defx) As we move the threshold lower, both the proportion of true positives and false positives steadily increase: As a result, the ROC hangs fairly close to the diagonal lower bound. We would expect the AUC to be fairly close to 0.5, which it is: f <- approxfun(x = roc$false.pos, y=roc$true.pos) integrate(f, lower = 0, upper = 1) ## 0.623 with absolute error < 4.5e-05 mean(sapply(1:1000, function(x) randcomp(dx))) ## [1] 0.613 Complete separation At the other extreme, the mean of the disease group is high enough so that there is no overlap between the two groups. In this case, the curve follows along Y-axis before going across the X-axis. We can achieve 100% true positives and no false positives if threshold is set at some point that is below the minimum of the disease group, and above the maximum of the non-disease group. Zero will be the ideal cut-off point for this example. defx <- updateDef(defx, changevar = "x", newformula="-4+8*condition", newvariance = 3.5) dx <- genData(100, defx) As expected the AUC is equal to 1: f <- approxfun(x = roc$false.pos, y=roc$true.pos) integrate(f, lower = 0, upper = 1) ## 0.996 with absolute error < 9.2e-05 mean(sapply(1:1000, function(x) randcomp(dx))) ## [1] 1 Logistic regression and the ROC Just a quick note to conclude. The ROC is often used in conjunction with classification problems based on logistic regression modeling. In this case, we may not have a single underlying test measure, but rather we may have multiple predictors or measures. In this case, group assignment decision needs to be based on a summary of these multiple measures; one logical candidate is the individual’s predicted probability estimated by model. If the specified logistic regression model provides good separation between the two groups, the predicted probabilities will be quite different for each group (higher AUC). However, if the model is not a strong classifier, the predicted probabilities for the two groups will be much closer together (lower AUC). References: Hanley, J.A. and McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), pp.29-36. Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### unpack Your Values in R Mon, 01/20/2020 - 19:53 [This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. I would like to introduce an exciting feature in the upcoming 1.9.6 version of the wrapr R package: value unpacking. The unpacking notation is made available if you install wrapr version 1.9.6 from Github: remotes::install_github("WinVector/wrapr") We will likely send this version to CRAN in a couple of weeks. Here is an example of the unpack feature in use. First we set up some example data. library(wrapr) packageVersion('wrapr') #> [1] '1.9.6' # make some example data d <- data.frame( x = 1:2, g = c('test', 'train'), stringsAsFactors = FALSE) Now we demonstrate the new feature: unpacking a named-list returned from a function. In this case we will demonstrate the effect using the function base::split(). base::split() splits data into a named list, with the names coming from the grouping vector. Our unpack feature will conveniently assign these sub-dataframes into our environment for us. # unpack the data into our workspace # notation is assignment-like: NEW_VARAIBLE = NAMED_ITEM unpack[train_set = train, test_set = test] <- split(d, d$g)

In the above example base::split() built a named list of sub-dataframes from our original data frame d. We used unpack[] to assign these named items into our working environment as the new variables: train_set and test_set. The unpacking was triggered by assigning the split results into the special unpack[] notation. Notice the unpack specification itself also looks like assignments (or more precisely argument bindings) with new names (used to say where values will be written) on the left and old names (used to say where values are found) on the right.

We can confirm that the training data is in the train_set variable, and the test data is in the test_set variable:

# confirm we have the new variables print(train_set) #> x g #> 2 2 train print(test_set) #> x g #> 1 1 test

The unpacking notation, when used in this manner, doesn’t depend on the order of the values. This makes for very safe code that concisely documents intent.

There is a small side effect, due to R’s assignment rules using the []<- notation will write a valued named “unpack” into the working environment. If one wishes to avoid this they can use either a function notation:

unpack(split(d, d$g), train_set = train, test_set = test) Or a “pipe into function” notation: split(d, d$g) %.>% unpack(., train_set = train, test_set = test)

(Note: the above was the wrapr dot-pipe. Currently unpack() does not work with the margrittr pipe, as in that case the results appear to get written into a temporary intermediate environment created by magrittr, and then lost. This difference between pipes isn’t so much a problem with unpack, but the extent that the wrapr dot pipe is designed for user extension.)

A killer application of unpack is: replacing save(a, b, file = FNAME)/load(file = FNAME) with a much safer and more explicit saveRDS(list(a = a, b = b, ...), file = FNAME)/unpack(readRDS(file = FNAME), a, b, ...) pattern. Here we are using the convention that names alone such as “a” implicitly stand for “a = a“.

unpack also supports positional unpacking, as we see below.

list(x = 1:2, y = 3:4) -> unpack_i[a, b] print(a) #> [1] 1 2 print(b) #> [1] 3 4

Though we feel the named pattern is more compatible with existing R functions and style.

We are still working on choosing names for this function. Likely we will pick “unpack” for the functional form, and perhaps one of “to” or “into” for the array-bracket assignment form. Right now we implement all 3 names, each with all functionality.

If you don’t want to bring in all of wrapr with library(wrapr), you can bring in just a few bits as follows:

unpack <- wrapr::unpack %.>% <- wrapr::%.>%

We are designing unpack to be very strict in its name checking before writing any values to the workspace. This is to avoid partial assignments where some fields are written and others are missing.f

We hope you check out the unpack feature and use it to your projects.

Related work includes:

• The zeallot::%<-% package already supplies excellent positional or ordered unpacking. But we feel that style may be more appropriate in the Python world where many functions return un-named tuples of results. Python functions tend to have positional tuple return values because the Python language has had positional tuple unpacking as a core language feature for a very long time (thus positional structures have become “Pythonic”). R has not emphasized positional unpacking, so R functions tend to return named lists or named structures. For named lists or named structures it may not be safe to rely on value positions. So I feel it is more “R-like” to use named unpacking.
• vadr::bind supplies named unpacking, but appears to use a “SOURCE = DESTINATION” notation. That is the reverse of a “DESTINATION = SOURCE” which is how both R assignments and argument binding are already written.
• base::attach. base::attach adds items to the search path with names controlled by the object being attached (instead of by the user).
• wrapr::let(). wrapr::let() re-maps names during code execution using a “TARGET = NEWNAME” target replacement scheme, where TARGET acts as if it had the name stored in NEWNAME for the duration of the let-block.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));