Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 3 min ago

Microsoft R Open 3.5.1 now available

Tue, 08/14/2018 - 19:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with all your favorite R interfaces and development environments.

This update brings a number of minor fixes to the R language engine from the R core team. It also makes available a host of new R packages contributed by the community, including packages for downloading financial data, connecting with analytics systems, applying machine learning algorithms and statistical models, and many more. New R packages are released every day, and you can access packages released after the 1 August 2018 CRAN snapshot used by MRO 3.5.1 using the checkpoint package.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

MRAN: Download Microsoft R Open

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

data.table is Really Good at Sorting

Tue, 08/14/2018 - 07:30

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below.

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

The above is consistent with data.table not only being faster than dplyr, but also having a fundamentally different asymptotic running time.

Performance like the above is one of the reasons you should strongly consider data.table for your R projects.

All details of the timings can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2018 Update of ffanalytics R Package

Tue, 08/14/2018 - 05:21

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Introduction

As part of our effort to provide users ways to replicate our analyses and improve their performance in fantasy football, we are continuously looking at ways we can improve our tools. Since the ffanalytics package was introduced back in 2016, we have had a tremendous response and feedback on the package and have worked through some of the kinks and are now introducing a new version of the ffanalytics package that should be more robust and provide more consistent results.

Improvements

One of the major concerns with data scrapes is that websites change and so does the way data are rendered on individual pages. Where the old version of the ffanalytics package was highly dependent on how columns were ordered on the different sites, the new version of the package does not depend on column order. Instead the package looks for specific components of the page and, based on predefined mapping, can determine which columns are which.

The other major concern is that different sites lists the same player with a slightly different name (i.e. Rob Kelly or Robert Kelly) or at different positions, so matching up the data can be tricky. The new version of the package maps player IDs from the different sites back to the player ID used by MyFantasyLeague.com, so it now doesn’t matter what name or position a specific player is listed as on a specific site because the data for a specific player will match up correctly.

Finally, the new version of the package relies heavily on the vocabulary and terminology used in the tidyverse package.  We highly recommend familiarizing yourself with that package.

It is our intent that the new version of the package will provide more consistent scrapes for users.

Player information

The player information used in the package is based on player information from MyFantasyLeague.com (MFL). When the package is loaded, it pulls the most recent player data from MFL into the player_table object. When the scrape runs, it determines the player ID for the player on the site that is being scraped and maps that ID back to the MFL player ID. The scraped data contains an id column which holds the player ID for MFL, and a src_id column that has the ID for the site. However, when aggregating the stats and generating the projections the player ID used is the MFL player ID.

Custom calculations

The customization of calculations is done as input to the projections_table() function. You can specify your own scoring rules, analyst weights, tier thresholds and VOR baselines to be used in the calculations. See ?projections_table for specifics, and check out the scoring_settings vignette on how to customize scoring rules

Installation and getting started

We have updated the instructions on installing and getting the package here so you can check that page out, and feel free to leave comments with feedback and suggestions.

The post 2018 Update of ffanalytics R Package appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mongolite 2.0: GridFS, connection pooling, and more

Tue, 08/14/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

This week version 2.0 of the mongolite package has been released to CRAN. Major new features in this release include support for MongoDB 4.0, GridFS, running database commands, and connection pooling.

Mongolite is primarily an easy-to-use client to get data in and out of MongoDB. However it supports increasingly many advanced features like aggregation, indexing, map-reduce, streaming, encryption, and enterprise authentication. The mongolite user manual provides a great introduction with details and worked examples.

GridFS Support!

New in version 2.0 is support for the MongoDB GridFS system. A GridFS is a special type of Mongo collection for storing binary data, such as files. To the user, a GridFS looks like a key-value server with potentially very large values.

We support two interfaces for sending/receiving data from/to GridFS. The fs$read() and fs$write() methods are the most flexible and can stream data from/to an R connection, such as a file, socket or url. These methods support a progress counter and can be interrupted if needed. These methods are recommended for reading or writing single files.

# Assume 'mongod' is running on localhost fs <- gridfs() buf <- serialize(nycflights13::flights, NULL) fs$write(buf, 'flights') #> [flights]: written 45.11 MB (done) #> List of 6 #> $ id : chr "5b70a37a47a302506117c352" #> $ name : chr "flights" #> $ size : num 45112028 #> $ date : POSIXct[1:1], format: "2018-08-12 23:15:38" #> $ type : chr NA #> $ metadata: chr NA # Read serialized data: out <- fs$read('flights') flights <- unserialize(out$data) # [flights]: read 45.11 MB (done) #> A tibble: 6 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight #> #> 1 2013 1 1 517 515 2 830 819 11 UA 1545 #> 2 2013 1 1 533 529 4 850 830 20 UA 1714 #> 3 2013 1 1 542 540 2 923 850 33 AA 1141 #> 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 #> 5 2013 1 1 554 600 -6 812 837 -25 DL 461 #> 6 2013 1 1 554 558 -4 740 728 12 UA 1696 #> # ... with 8 more variables: tailnum , origin , dest , air_time , distance , #> # hour , minute , time_hour

The fs$upload() and fs$download() methods on the other hand copy directly between GridFS and your local disk. This API is vectorized so it can transfer many files at once. Hover over to the gridfs chapter in the manual for more examples.

Running Commands

MongoDB exposes an enormous number of database commands, and mongolite cannot provide wrappers for each command. As a compromise, the new version of mongolite exposes an api to run raw commands, so you can manually run the commands for which we do not expose wrappers.

The result data from the commannd automatically gets simplified into nice R data structures using the jsonlite simplification system (but you can set simplify = FALSE if you prefer json structures).

m <- mongo() col$run('{"ping": 1}') #> $ok #> [1] 1

For example we can run the listDatabases command to list all db’s on the server:

admin <- mongo(db = "admin") admin$run('{"listDatabases":1}') #> $databases #> name sizeOnDisk empty #> 1 admin 32768 FALSE #> 2 config 73728 FALSE #> 3 local 73728 FALSE #> 4 test 72740864 FALSE #> #> $totalSize #> [1] 72921088 #> #> $ok #> [1] 1

The server tools chapter in the manual has some more examples.

Connection Pooling

Finally another much requested feature was connection pooling. Previously, mongolite would open a new database connection for each collection object in R. However as of version 2.0, mongolite will use existing connections to the same database when possible.

# This will use a single database connection db.flights <- mongolite::mongo("flights") db.iris <- mongolite::mongo("iris") db.files <- mongolite::gridfs()

A database connection is automatically closed when all objects that were using it have either been removed, or explicitly disconnected with the disconnect() method. For example using the example above:

# Connection still alive rm(db.flights) db.files$disconnect() # Now it will disconnect db.iris$disconnect()

Mongolite collection and GridFS objects automatically reconnect if when needed if they are disconnected (either explicitly or automatically e.g. when restarting your R session). For example:

> db.files$find() Connection lost. Trying to reconnect with mongo... #> id name size date type metadata #> 1 5b70a37a47a302506117c352 flights 45112028 2018-08-12 23:15:38 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

PCA revisited: using principal components for classification of faces

Tue, 08/14/2018 - 00:03

(This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers)

This is a short post following the previous one (PCA revisited).

In this post I’m going to apply PCA to a toy problem: the classification of faces. Again I’ll be working on the Olivetti faces dataset. Please visit the previous post PCA revisited to read how to download it.

The goal of this post is to fit a simple classification model to predict, given an image, the label to which it belongs. I’m goint to fit two support vector machine models and then compare their accuracy.

The

  1. The first model uses as input data the raw (scaled) pixels (all 4096 of them). Let’s call this model “the data model”.
  2. The second model uses as input data only some principal components. Let’s call this model “the PCA model”.

This will not be an in-depth and detailed fit and subsequent evaluation of the model, but rather a simple proof of concept to prove that a limited number of principal components can be used to perform a classification task instead of using the actual data.

The setup of the two models

In order to get an idea of what accuracy our models can achieve, I decided to do a simple 20 fold crossvalidation  of each model.

The data model is a simple support vector machine that is fitted using all the faces in the training set: it’s input data is a vector of 4096 components.

The PCA model is again the same support vector machine (with the same hyperparameters, which however may need some tweaking) fitted using 30 PCs.

By performing PCA on the dataset I transformed the data and, according to the analysis, 30 PCs account for about 82% of the total variance in the dataset. Indeed you can see from the plots below that the first 30 eigenvalues are those with the highest magnitude.

 

Results

After running  the two crossvalidations on the two different models, the main results I found are the following:

  1. The average accuracy of the two models is about the same (PCA model has a slightly better average accuracy): 93% vs 94%. Standard deviation seems to be again in favour of the PCA model: 3.2% vs 2.3%. The accuracy isn’t great, I obtained 98% on this dataset using other models, but of course this is just a test “out of the box” with no hyperparameter tuning and all the other fine tuning operations.
  2. The second model fits the data much faster than the first one.

The results are summarised in the boxplot below

Note that the PCA model has less information available compared to the model using the original data. However, the information is condensed in a fewer variables (the 30 PCs) in a more compact way. This technique is great when you are using models sensitive to correlation in the data: since the PCs are not correlated, you can transform your original data (which might be correlated and causing problems in the model) and use the principal components instead.

Another nice advantage is that since the PCA model uses fewer variables it is faster. Note that I decided to apply PCA to the whole dataset and not only to the training set of each round of crossvalidation. It could have been argued that in practice you have access to only the training set (of course). Then I used the eigenvectors to obtain the PCs for both the training and testing set of each crossvalidation iteration.

The code I used is here below

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

Mon, 08/13/2018 - 23:35

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.

If you have a large number of sites to scrape, working with splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker and docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker and docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050) to access the Splash API and you get a bonus port of 8036 to watch in your browser (the HAProxy stats page).

This works well when combined with furrr which is an R package that makes parallel operations very tidy.

One way to use purrr, splashr and Aquarium might look like this:

library(splashr) library(HARtools) library(urltools) library(furrr) library(tidyverse) list_of_urls_with_unique_urls <- c("http://...", "http://...", ...) make_a_splash <- function(org_url) { splash( host = "ip/name of system you started aquarium on", user = "your splash api username", pass = "your splash api password" ) %>% splash_response_body(TRUE) %>% # we want to get all the content splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from splash_go(org_url) %>% splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated splash_har() } safe_splash <- safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection future_map(sites, ~{ org <- safe_splash(.x) # go get it! if (is.null(org$result)) { sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages } else { HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE har = org$result, file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate() ) sprintf("Successfully retrieved %s", .x) } }) -> results

(Those with a keen eye will grok why splashr supports Splash API basic authentication, now)

The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr() likes to check for proper objects) to check for errors with something like:

flatten_chr(results) %>% keep(str_detect, "Error") ## [1] "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)" ## [2] "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)" ## [3] "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)" ## [4] "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)" ## [5] "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"

Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.

If you were stuck in the splashr/Splash slow-lane before, give this a try to help save you some time and frustration.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time series intervention analysis with fuel prices by @ellis2013nz

Mon, 08/13/2018 - 20:30

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

A couple of months ago I blogged about consumer spending on vehicle fuel by income. The impetus for that post was the introduction on 1 July 2018 of an 11.5 cent per litre “regional fuel tax” in Auckland.

One vehement critic of the levy, largely on the grounds of its impact on the poor, has been Sam Warburton (@Economissive on Twitter) of the New Zealand Initiative. Since early May 2018 (ie seven weeks before the fuel levy began), Sam has been collecting fuel prices from pricewatch.co.nz, and he recently made the data available.

One of the issues of controversy about a levy like this is whether it will lead to “price spreading” – fuel companies absorbing some of the extra tax in Auckland and increasing prices in other regions. A relatively small number of firms make retail pricing decisions about fuel in New Zealand so it’s plausible that imperfect competition is making this possible. I had a look at the data to see if the intervention of the fuel levy in New Zealand’s biggest city can be seen to impact on fuel pricing in the rest of the country.

To cut to the chase, this graphic exemplifies my approach and results:

We see that after the spike at the time the tax was introduced, fuel prices in other regions have converged somewhat on Auckland’s prices (particularly when considering the relative change happening before the tax). The impact of the tax is still clearly felt much more strongly in Auckland than anywhere else (as of course would be expected – the question at issue is whether anywhere else would be impacted at all). More on that later.

The data

First, let’s look at the data Sam’s collected. The Twitter thread linked to above provides an Excel workbook on Dropbox. There’s a worksheet for each region (which are defined similarly, but not identically, to New Zealand’s official Regions) as well as one describing the source. For each region we have data on fuel prices for combinations of Company, Date, fuel type (eg diesel, 91 octane, etc) and Region. If we plot all the data other than liquid petroleum gas (which has particularly sparse observations), it looks like this:

“Companies” have been sorted in order of increasing average price for that graphic, but fairly crudely (ie not taking into account different mixes of fuel type by Company).

We can see we have more data for 91 octane petrol than the other types. For the rest of this post I’ll be focusing on just 91 octane.

Here’s the R code to tidy up the data to this point and draw the graphic. It assumes you’ve manually downloaded the Excel workbook to your working folder (I’m currently working with severely restricted internet, so couldn’t experiment in automating that process.)

library(tidyverse) library(scales) library(openxlsx) library(nlme) # Import data: sn <- getSheetNames("fuel price data.xlsx") sn <- sn[sn != "Source"] fuel_orig <- list() for(i in 1:length(sn)){ tmp <- read.xlsx("fuel price data.xlsx", sheet = sn[i], cols = 1:7, detectDates = TRUE, na.strings = c("NA", "n/a")) tmp[ , "region"] <- sn[i] fuel_orig[[i]] <- tmp } # Combine into a single data frame fuel_df <- do.call("rbind", fuel_orig) # some useful extra information: south_island <- c("Canterbury", "Nelson", "Otago", "Southland", "West Coast") big_four <- c("CALTEX", "Z ENERGY", "BP", "MOBIL") # Make long, thin, tidy version: fuel_tidy <- fuel_df %>% select(-LPG) %>% gather(fueltype, value, -Company, -Date, -region) %>% filter(!is.na(value)) %>% mutate(island = ifelse(region %in% south_island, "South", "North"), company_type = ifelse(Company %in% big_four, "Big Four", "Smaller")) %>% mutate(region = fct_reorder(region, as.numeric(as.factor(island))), Company = fct_reorder(Company, value)) # Overview graphic: fuel_tidy %>% ggplot(aes(x = Date, y = value, colour = Company)) + facet_grid(fueltype~region, scales = "free_y") + geom_point(size = 0.8) + scale_y_continuous("Price per litre at the pump", label = dollar) + labs(x = "Date in 2018", caption = "Source: pricewatch.co.nz, collated by @Economissive") + ggtitle("Petrol prices in New Zealand over several months in mid 2018") + guides(colour = guide_legend(override.aes = list(size=5))) + scale_colour_brewer(palette = "Set1") Regional comparisons with Auckland

I tried a couple of different ways of comparing prices in individual regions with those in Auckland. I think this graphic is probably the most informative and straightforward:

The grey line in the background of each facet represents Auckland’s price; the shaded blue rectangle is the post-tax period (ie 1 July 2018 and onwards). The grey shaded area shows the difference between the given region’s price and that of Auckland.

We can see a lot of regional variation here, and an interesting pattern with three (or maybe even all) of the South Island regions experiencing price declines in June then picking up in July. Of course, the 11.5 cent increase in price in Auckland is very obvious in the grey lines and shading. Later on I’ll be using time series intervention analysis on this data; this is an approach commonly used in evaluating the impact of evaluations. If we were only after the direct impact, there would be no need to do any statistical tests beyond this graphic above; the big spike in prices hits you between the eyes, and there is no doubt about the discontinuity in Auckland’s prices on 1 July! The question, of course, is how sustained that impact is, and whether it bled into secondary impacts in other regions.

Here’s a second graphic that tries to visually simplify what’s going on, by calculating a single line of the ratio of prices in each region to those in Auckland. I think what it gains in visual simplicity (less lines and shading) it loses in clear interpretability. In particular, it’s not possible to tell from this graphic what changes in the graphic come from changes in Auckland, and which come from changes in the comparison region. That’s not a deal-breaker for using a graphic like this, but it does strongly suggest we should also include the first one, with the rawer average prices per region plainly shown without transformation, for context.

Note that even after the introduction of the Auckland-specific fuel tax, 91 octane petrol in several regions still costs more than in Auckland. The regions with prices higher than Auckland in the most recent data in the collection are West Coast, Otago, Nelson, Canterbury and Wellington.

Here’s the R code for those two graphics:

# construct two convenient summaries of the data, different ways of comparing regions to Auckland: p91 <- fuel_tidy %>% filter(fueltype == "91") %>% group_by(region, island, Date) %>% summarise(value = mean(value, tr = 0.2)) %>% ungroup() p91_rel <- p91 %>% group_by(Date) %>% mutate(Auckland = value[region == "Auckland"]) %>% filter(! region %in% c("Auckland", "Wairarapa")) %>% mutate(perc_of_auck = value / Auckland) # Plot showing original price data ggplot() + # annoying trick necessary here to draw a semi-transparent background rectangle: geom_rect(data = data.frame("hello world"), xmin = as.Date("2018-07-01"), xmax = Inf, ymin = -Inf, ymax = Inf, fill = "blue", alpha = 0.1) + # now draw the actual data: geom_ribbon(data = p91_rel, aes(x = Date, ymin = Auckland, ymax = value), fill = "grey", alpha = 0.5) + geom_line(data = p91_rel, aes(x = Date, y = Auckland), colour = "grey50") + geom_line(data = p91_rel, aes(x= Date, y = value, colour = island), size = 1.2) + facet_wrap(~region, ncol = 3) + scale_y_continuous("Price of 91 octane petrol compared to in Auckland\n", label = dollar) + labs(x = "2018; grey line shows Auckland", caption = "Source: pricewatch.co.nz, collated by @Economissive") # ratio of Auckland prices to others ggplot() + geom_hline(yintercept = 1, colour = "grey50") + geom_rect(data = data.frame("hello world"), xmin = as.Date("2018-07-01"), xmax = Inf, ymin = -Inf, ymax = Inf, fill = "blue", alpha = 0.1) + geom_line(data = p91_rel, aes(x= Date, y = perc_of_auck, colour = island)) + facet_wrap(~region, ncol = 3) + scale_y_continuous("Price of 91 octane petrol as a percentage of in Auckland\n", label = percent) + labs(x = "2018", caption = "Source: pricewatch.co.nz, collated by @Economissive") Modelling the impact of an intervention over time

When it came to directly addressing our question of interest regarding price spreading, I opted to group all non-Auckland regions together and compare average prices there with those in Auckland. There are better ways of modelling this that make full use of the granular data available (mostly involving mixed effects models, and more complex ways of representing the trend over time than linearly; and they would certainly take into account weighting from the spread in population over regions) but they come with big costs in complexity that I don’t have time for right now. Plus, the difference-of-averages method struck me as the easiest way to interpret and communicate, not to mention think about, the question of whether prices were converging back towards eachother after the initial shock of the addition of the tax. This leads me to the graphic I showed earlier in this post:

The linear regression lines shown in that graphic are a simplified version of the formal statistical model we want to fit and use to test our hypothesis. We’re looking for evidence that the slope of the post-tax line is materially less than the slope of the pre-tax line; in other words, is the gap in pricing between Auckland and other regions declining after the initial 11.5 cent shock of the tax.

I defined this as a simple linear model, but fit it using generalized least squares with time series residuals (auto-regressive moving average of order (1, 1)). This is straightforward to specify and fit using the gls function in Pinheiro, Bates et al’s nlme package, but there are other ways of doing it too.

This results in the following:

Dependent variable: value Date 0.001** (0.0003) post_tax 19.159** (7.804) Date:post_tax -0.001** (0.0004) Constant -10.470** (4.946) Observations 93 Log Likelihood 330.591 Akaike Inf. Crit. -647.182 Bayesian Inf. Crit. -629.761 Note: *p<0.1; **p<0.05; ***p<0.01

…which simply confirms what is visually obvious, that there is indeed statistically significant evidence of the slope changing direction downwards after the tax is introduced. In other words, we do have evidence consistent with some degree of “spreading” taking place. After the initial clean shock of the introduction of the tax, prices in Auckland and in the rest of the country are indeed converging somewhat; although nowhere near as much as the full cost of the tax.

This effect holds whether I use all of New Zealand as the comparison point or just the South Island (which has less competition in fuel retailers) or just the North Island, although in the latter case the effect is not as strong (as can be seen in the graphic). It also doesn’t seem to matter whether we use all available prices, or just those of the “big four” companies that are present in all regions.

We can’t say for sure the effect comes from introducing the tax from just looking at the numbers. Drawing that conclusion would require carefully considering any other possible causality options. For example, one driver of the pattern we’re seeing is clearly that prices in Canterbury, Nelson and Otago stopped declining and started rising slightly in July. What are other plausible causes of that pattern? Understanding and considering such alternative theories would need more knowledge of the fuel market in New Zealand than I have, so I’ll leave it to others to debate that. All I can safely conclude is what I wrote above:

after the spike caused by the tax, fuel prices in Auckland and in the rest of the country are converging somewhat (although much less than the full cost of the tax), and plausibly this is because of companies’ price adjustments down in Auckland and up elsewhere to spread the cost of the tax over a broader base.

Here’s the code for that final graphic and statistical modelling;

# Data on the difference between Auckland's average price and those in other areas: diff_data <- fuel_tidy %>% filter(fueltype == "91" & company_type == "Big Four") %>% group_by(Date) %>% summarise(auck_v_rest = mean(value[region == "Auckland"]) - mean(value[region != "Auckland"]), auck_v_si = mean(value[region == "Auckland"]) - mean(value[island == "South"]), auck_v_ni = mean(value[region == "Auckland"]) - mean(value[island == "North" & region != "Auckland"]), ) %>% mutate(post_tax = as.integer(Date >= as.Date("2018-07-01"))) %>% gather(comparison, value, -Date, -post_tax) %>% mutate(comparison = case_when( comparison == "auck_v_si" ~ "Compared to South Island", comparison == "auck_v_ni" ~ "Compared to rest of North island", comparison == "auck_v_rest" ~ "Compared to all NZ except Auckland")) # Graphic: ggplot(diff_data, aes(x = Date, y = value)) + facet_wrap(~comparison, ncol = 3) + geom_line() + geom_smooth(aes(group = post_tax), method = "lm") + scale_y_continuous("Average price of 91 octane petrol in Auckland\nminus average price in comparison area", label = dollar) + labs(x = "Date in 2018\nAverage prices have not been weighted by population or sales", caption = "Source: pricewatch.co.nz, collated by @Economissive") + ggtitle("Fuel prices in Auckland compared to three other comparison areas", "Restricted to prices from BP, Caltex, Mobil and Z Energy") # Modelling: # first, make a convenient subset of the data (useful later in various tests and diagnostics): D <- subset(diff_data, comparison == "Compared to all NZ except Auckland") # Fit model, taking care to specify time series residuals, which aren't as useful for inference # as i.i.d. residuals and hence lead to more conservative inference: model <- gls(value ~ Date * post_tax, data = subset(diff_data, comparison == "Compared to all NZ except Auckland"), cor = corARMA(p = 1, q = 1)) # print-friendly summary of coefficieints stargazer::stargazer(model, type = "html") # more comprehensive summary (not shown in blog): summary(model) Observations per region

A topic of interest in fuel pricing debate in New Zealand is the number of companies present in each region, with a particular focus on the presence of Gull. In case of interest, here are the observations in the pricewatch data collected by Sam Warburton:

region – ALTERNATE FUEL GULL CHALLENGE CALTEX MOBIL Z ENERGY GAS ALLEY BP Auckland 206 143 349 251 279 372 279 279 372 Bay of Plenty 262 200 295 202 278 264 279 161 370 Coromandel 3 6 249 4 209 219 233 69 277 Waikato 0 0 228 227 276 302 278 220 367 Hawke’s Bay 0 45 217 16 261 228 255 132 345 Northland 0 0 205 52 274 229 256 268 275 Manawatū-Whanganui 0 108 193 120 275 265 277 176 371 Taranaki 0 184 178 188 243 88 274 188 301 Wellington 0 91 78 252 278 277 279 213 372 East Coast 0 41 71 158 221 122 145 104 211 Central Plateau 0 0 44 51 159 244 275 0 295 Wairarapa 0 1 0 0 2 42 3 2 103 Canterbury 0 262 0 279 279 278 279 245 372 Nelson 0 50 0 34 216 237 263 163 237 Otago 0 225 0 226 273 265 277 116 364 Southland 0 114 0 197 249 195 226 98 245 West Coast 0 115 0 227 119 121 115 65 242

That table was generated by:

library(knitr) fuel_tidy %>% group_by(region, Company) %>% summarise(freq = n()) %>% spread(Company, freq, fill = 0) %>% arrange(desc(GULL)) %>% kable var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Evaluating Olive McBride with the Arkham Horror LCG Chaos Bag Simulator in R

Mon, 08/13/2018 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

(If you care, there may be spoilers in this post.)

Introduction

I love Arkham Horror; The Card Game. I love it more than I really should; it’s ridiculously fun. It’s a cooperative card game where you build a deck representing a character in the Cthulhu mythos universe, and with that deck you play scenarios in a narrative campaign1 where you grapple with the horrors of the mythos. Your actions in (and between) scenarios have repercussions for how the campaign plays out, changing the story, and you use experience points accumulated in the scenarios to upgrade your deck with better cards.

The game is hard. Really hard. And I don’t mean it’s hard because it has a learning curve (although if you’re not used to deep games like this, it probably does have a learning curve). This is a Cthulhu game, which means that the mythos is going to do everything to beat you down. Your best-laid plans are going to get derailed. The worst thing that could possibly happen to you will happen. This is why this game is unreasonably fun: it’s a game for masochists.

And it makes a damn ton of good stories. I often want to tell the tale of how amazingly awful the events played out, or how I just snatched victory out of the jaws of defeat. Bloomberg recently published an article about the rise of board games in companies. If you’re in one of these companies where employees often play board games together after hours, you should consider adding Arkham Horror (the card game; I’ve never played the board game and I’ve heard mixed reviews) to the mix. It’s a great team builder.

Two elements make Arkham Horror so damn hard and unpredictable: the encounter deck and the chaos bag. The encounter deck is a deck that all players draw from every turn that spawns monsters and awful events. The chaos bag is used for skill test you make for advancing your board state. You draw tokens from the chaos bag, add the modifier of the revealed token to your skill value for that skill test (most of them are negative numbers), check the new value against the difficulty of the test, and if your modified skill value is at least as great as the difficulty of the test, you pass; otherwise, you fail. (You can pitch cards from your hand to increase your skill value before you reveal a token.)


*(Image source: Ars Technica)

This is the chaos bag in a nutshell, but the chaos bag does more. The elder sign token represents good luck and often helps your board state; it’s considered a great success. But most icon tokens in the bag are trying to hurt you. There are four icon tokens: the skulls, the cultist, the tablets, and the elder things. These often not only apply a modifier but do something bad to you (which changes between scenarios).

And of course there’s this little bastard.

This is the auto-fail token. You just fail. Thanks for playing.

Simulating the Chaos Bag

Bad things happen in Arkham Horror. Sometimes, though, they feel like they happen a lot. For example, you can draw two auto-fails in a row. Or three. I cannot understate how devastating that can be in a game. Sometimes it feels like this happens a lot. Sometimes it feels like this particular game went unusually poorly, and unusually poor games seem to happen frequently.

That’s a contradiction, of course. I think that this sense of bad games happening often emerges from humans fundamentally poor understanding of probability and how it actually works. I’m not just referring to the mathematics; people’s intuition of what should happen according to probability does not match what probability says happens. This phenomenon is well-documented (see, for example, the book Irrationality, by Stuart Sutherland) and is one of the explanations for why people seemed to underestimate Donald Trump’s chances of winning the 2016 election (in short, unlikely events occur more frequently than people perceive; see also Nate Silver’s series of articles). In fact, you are more likely to find any pattern than no pattern at all (and in Arkham Horror, patterns are usually bad for you.)

This perception of “unusually many auto-fails” (which, as a statistician, I know cannot be right no matter what my feelings say) prompted me to write an R function that generates a string of pulls from the chaos bag, each pull independent of the previous pull. Here’s the function:

chaos_bag_string_sim <- function(dicebag = list('E' = 1, # Elder Sign 'P' = 1, # +1 '0' = 2, '1' = 3, # -1 '2' = 2, # -2 '3' = 1, '4' = 1, 'S' = 2, # Skull 'C' = 1, # Cultist 'T' = 1, # Tablet 'L' = 1, # Elder Thing 'A' = 1), # Autofail n = 24, replace = TRUE) { paste0(sample(names(dicebag), replace = replace, prob = as.vector(unlist(dicebag)), size = n), collapse = '') }

Notice that the function takes a list, dicebag. This list uses one-character codes representing chaos bag tokens, and the actual contents of the list are numbers representing the frequency of that token in the bag. The bag that serves as a default is a fictitious chaos bag that I believe is representative of Arkham Horror on standard difficulty. Since most numbered tokens are negative, I denote the +1 token with 'P' and the negative tokens with their numbers.

How many pulls from the bag occur in a game? That's really hard to figure out; in fact, this will vary from scenario to scenario dramatically. So let's just make an educated guess to represent the "typical" game. A round consists of a mythos phase, then three actions per player, followed by a monsters phase and an upkeep phase. The mythos phase often prompts skill tests, and I would guess that the average number of skill tests made during each player's turn is two. We'll guess that about half of the draws from the encounter deck prompt skill tests. A game may last about 16 rounds. This adds up to about 40 skill tests per player per game. As a consequence, a four-player game (I mostly play multiplayer, and the more the merrier) would yield 160 skill tests.

Let's simulate a string of 160 skill tests, each independent of the other2.

(game <- chaos_bag_string_sim(n = 160)) [1] "311113E12121E2LS20A32A4ETT1SE3T3P0AT2S12PLSC033CP22A11CLL31L024P12ES24A1S0E E3PSSS12C13T21224104L43PCT13TTCSSASE20SA1TT2SSC2CPC20012CC41S234PPEP101E0LS 1P1110TSS1"

I represented pulls from the chaos bag as a string because we can use regular expressions to find patterns in the string. For instance, we can use the expression AA to find all occurances of double-auto-fails in the string.

grepl("AA", game) [1] FALSE

The grepl() function returns a boolean (logical) value identifying whether the pattern appeared in the string; in this game, no two auto-fails appeared in a row.

What about two auto-fails appearing, separated by two other tests? This is matched by the expression A..A (remember, . matches any character).

grepl("A..A", game) [1] TRUE

We can take this further and use our simulator to estimate the probability events occur. After all, in probability, there are two ways to find the probability of events:

  1. Create a probability model and use mathematics to compute the probability an event of interest occurs; for example, use a model of a die roll to compute the probability of rolling a 6
  2. Perform the experiment many times and count how many times the event occured in these experiments many times and count how often the event of interest occured; for example, roll a die 1000 times and count how many times it rolled a 6 (this is usually done on a computer)

While the latter approach produces estimates, these estimates are guaranteed to get better the more simulations are performed. In fact, if you want at least a 95% chance of your estimate being accurate up to two decimal places, you could perform the simulation 10,000 times.3

Using this, we can estimate the probability of seeing two auto-fails in a row.

mean(grepl("AA", replicate(10000, {chaos_bag_string_sim(n = 160)}))) [1] 0.4019

There's about a 40% chance of seeing two auto-fail tokens in a single four-player game. That's pretty damn likely.

Let's ask a slightly different question: what is the average number of times we will see two autofails in a row in a game? For this we will want to dip into the stringr package, using the str_count() function.

library(stringr) str_count(game, "AA") [1] 0 str_count(game, "A..A") # How many times did we see auto-fails separated by two # tests? [1] 1

Or we can ask how often we saw three "bad" tokens in a row. In Arkham Horror on standard difficulty, players often aim to pass a test when drawing a -2 or better. This means that "bad" tokens include -3, -4, auto-fail, and often the tablet and elder thing tokens, too. The regular expression pattern that can match "two bad things in a row" is [34TLA]{2} (translation: see either 3, 4, T, L, or A exactly two times).

str_count(game, "[34TLA]{2}") [1] 14

How could we estimate the average number of times this would occur in a four-player game? The simulation trick still works. Pick a large number of simulations to get an accurate estimate; the larger, the more accurate (but also more work for your computer).

mean(str_count(replicate(10000, {chaos_bag_string_sim(n = 160)}), "[34TLA]{2}")) [1] 10.6717

You can imagine that this can keep going and perhaps there are queries that you would like to see answered. Below I leave you with an R script that you can use to do these types of experiments. This script is designed to be run from a Unix-flavored command line, though you could source() the script into an R session to use the chaos_bag_string_sim() function interactively. Use regular expressions to define a pattern you are interested in (this is a good opportunity to learn regular expressions for those who want the exercise).

#!/usr/bin/Rscript ####################################### # AHChaosBagSimulator.R ####################################### # Curtis Miller # 2018-08-03 # A script for simulating Arkham Horror's chaos bag ####################################### chaos_bag_string_sim <- function(dicebag = list('E' = 1, # Elder Sign 'P' = 1, # +1 '0' = 2, '1' = 3, # -1 '2' = 2, # -2 '3' = 1, '4' = 1, 'S' = 2, # Skull 'C' = 1, # Cultist 'T' = 1, # Tablet 'L' = 1, # Elder Thing 'A' = 1), # Autofail n = 24, replace = TRUE) { paste0(sample(names(dicebag), replace = replace, prob = as.vector(unlist(dicebag)), size = n), collapse = '') } # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } main <- function(pattern, average = FALSE, pulls = 24, replications = 10000, no_replacement = FALSE, dicebag = "", help = FALSE) { if (dicebag != "") { dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE) dicebag_df$token <- as.character(dicebag_df$token) if (!is.numeric(dicebag_df$freq)) {stop("Dicebag freq must be integers.")} dicebag <- as.list(dicebag_df$freq) names(dicebag) <- dicebag$token } else { dicebag = list('E' = 1, # Elder Sign 'P' = 1, # +1 '0' = 2, '1' = 3, # -1 '2' = 2, # -2 '3' = 1, '4' = 1, 'S' = 2, # Skull 'C' = 1, # Cultist 'T' = 1, # Tablet 'L' = 1, # Elder Thing 'A' = 1) # Autofail } games <- replicate(replications, {chaos_bag_string_sim(dicebag = dicebag, n = pulls, replace = !no_replacement)}) if (average) { cat("Average occurance of pattern:", mean(stringr::str_count(games, pattern)), "\n") } else { cat("Probability of occurance of pattern:", mean(grepl(pattern, games)), "\n") } quit() } if (sys.nframe() == 0) { cl_args <- parse_args(OptionParser( description = "Simulates the Arkham Horror LCG chaos bag.", option_list = list( make_option(c("--pattern", "-r"), type = "character", help = "Pattern (regular expression) to match"), make_option(c("--average", "-a"), type = "logical", action = "store_true", default = FALSE, help = "If set, computes average number of occurances"), make_option(c("--pulls", "-p"), type = "integer", default = 24, help = "The number of pulls from the chaos bag"), make_option(c("--replications", "-N"), type = "integer", default = 10000, help = "Number of replications"), make_option(c("--no-replacement", "-n"), type = "logical", action = "store_true", default = FALSE, help = "Draw tokens without replacement"), make_option(c("--dicebag", "-d"), type = "character", default = "", help = "(Optional) dice bag distribution CSV file") ) )) names(cl_args)[which(names(cl_args) == "no-replacement")] <- "no_replacement" do.call(main, cl_args) }

Below is an example of a CSV file specifying a chaos bag.

token,freq E,1 P,1 0,2 1,3 2,2 3,1 4,1 S,2 C,1 T,1 L,1 A,1

Below is some example usage.

$ chmod +x AHChaosBagSimulator.R # Only need to do this once $ ./AHChaosBagSimulator.R -r AA -p 160 Probability of occurance of pattern: 0.4074 $ ./AHChaosBagSimulator.R -r [34TLA]{2} -a -p 160 Average occurance of pattern: 10.6225

To see documentation, type ./AHChaosBagSimulator.R --help. Note that something must always be passed to -r; this is the pattern needed.

Olive McBride

This tool was initially written as a way to explore chaos bag probabilities. I didn't consider the tool to be very useful. It simply helped make the point that unlikely events seem to happen frequently in Arkham Horror. However, I found a way to put the tool to practical use.

Recently, Arkham Horror's designers have been releasing cards that seem to demand more math to fully understand. My favorite Arkham Horror card reviewer, The Man from Leng, has fretted about this in some of his recent reviews about cards using the seal mechanic, which change the composition of the chaos bag by removing tokens from it.

I can only imagine how he would feel about a card like Olive McBride (one of the cards appearing in the upcoming Mythos Pack, "Heart of the Elders", to be released later this week.)

Olive McBride allows you to reveal three chaos tokens instead of one, and choose two of those tokens to resolve. This effect can be triggered any time you would reveal a chaos token.

I'll repeat that again: Olive can trigger any time a chaos token is revealed. Most of the time tokens are revealed during skill tests, but other events lead to dipping into the chaos bag too. Notably, The Dunwich Legacy leads to drawing tokens without taking skill tests, such as when gambling in "The House Always Wins", due to some encounter cards. This makes Olive McBride a "master gambler", since she can draw three tokens and pick the winning ones when gambling. (She almost breaks the scenario.) Additionally, Olive can be combined with cards like Ritual Candles and Grotesque Statue to further shift skill tests in your favor.

These are interesting situations but let's ignore combos like this for now. What does Olive do to your usual skill test? Specifically, what does she do to the modifiers?

Before going on, I need to address verbiage. When are about to perform a skill test, typically they will make a statement like "I'm at +2 for this test". This means that the skill value of the investigator (after applying all modifiers) is two higher than the difficulty of the test. Thus, if the investigator draws a -2 or better, the investigator will pass the test; if the investigator daws a -3 or worse, the investigator fails. My play group does not say this; we say "I'm at -2 for this test," meaning that if the investigator sees a -2 or better from the bag, the investigator will pass. This is more intuitive to me, and also I think it's more directly translated to math.

Presumably when doing a skill test with Olive, if all we care about is passing the test, we will pick the two tokens drawn from the bag that have the best modifiers. We add the modifiers of those tokens together to get the final modifier. Whether this improves your odds of passing the test or not isn't immediately clear.

I've written an R function that simulates skill tests with Olive. With this we can estimate Olive's effects.

olive_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = 3, replace = FALSE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- replicate(N, {do.call(chaos_bag_string_sim, dargs)}) vals <- sapply(pulls, function(p) { vecp <- strsplit(p, "")[[1]] vecnum <- translate[vecp] max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])})) }) vals[which(is.nan(vals))] <- Inf return(vals) }

The parameter translate gives a vector that translates code for chaos tokens to numerical modifiers. Notice that the auto-fail token is assigned -Inf since it will cause any test to fail. If we wanted, say, the elder sign token to auto-succeed (which is Mateo's elder sign effect), we could replace its translation with Inf. By default the function uses the dicebag provided with chaos_bag_string_sim(), but this can be changed.

Here's a single final modifier from a pull using Olive.

olive_sim(N = 1) C01 -1

Interestingly, the function returned a named vector, and the name corresponds to what was pulled. In this case, a cultist, 0, and -1 token were pulled; the resulting best modifier is -1. The function is already built to do this many times.

olive_sim(N = 5) 31S 2S1 LS1 120 C2E -3 -3 -3 -1 0

Below is a function that can simulate a lot of normal skill tests.

test_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = N, replace = TRUE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- do.call(chaos_bag_string_sim, dargs) vecp <- strsplit(pulls, "")[[1]] vecnum <- translate[vecp] return(vecnum) }

Here's a demonstration.

test_sim(N = 5) 2 S 3 1 0 -2 -2 -3 -1 0

Finally, below is a function that, when given a vector of results like those above, produce a table of estimated probabilities of success at skill tests when given a vector of values the tests must beat in order to pass the test (that is, using the "I'm at -2" type of language).

est_success_table_gen <- function(res, values = (-10):3) { r = v)}) names(r) <- values return(rev(r)) }

Let's give it a test run.4

u <- test_sim() est_success_table_gen(u) 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 0.000 0.057 0.127 0.420 0.657 0.759 0.873 0.931 0.931 0.931 0.931 0.931 0.931 -10 0.931 as.matrix(est_success_table_gen(u)) [,1] 3 0.000 2 0.057 1 0.127 0 0.250 -1 0.420 -2 0.657 -3 0.759 -4 0.873 -5 0.931 -6 0.931 -7 0.931 -8 0.931 -9 0.931 -10 0.931

This represents the base probability of success. Let's see what Olive does to this table.

y <- olive_sim() as.matrix(est_success_table_gen(y)) [,1] 3 0.024 2 0.063 1 0.140 0 0.238 -1 0.423 -2 0.531 -3 0.709 -4 0.829 -5 0.911 -6 0.960 -7 0.983 -8 0.996 -9 1.000 -10 1.000

Perhaps I can make the relationship more clear with a plot.

plot(stepfun((-10):3, rev(c(0, est_success_table_gen(u)))), verticals = FALSE, pch = 20, main = "Probability of Success", ylim = c(0, 1)) lines(stepfun((-10):3, rev(c(0, est_success_table_gen(y)))), verticals = FALSE, pch = 20, col = "blue")

The black line is the estimated probability of success without Olive, and the blue line the same with Olive. (I tried reversing the -axis, but for some reason I could not get good results.) What we see is:

  • Olive improves the chances a "hail Mary" will succeed. If you need +1, +2, or more to succeed, Olive can help make that happen (though the odds still aren't great)
  • Olive can help you guarantee a skill test will succeed. If you boost your skill value to very high numbers, Olive can effectively neuter the auto-fail token. That's a good feeling.
  • Otherwise, Olive hurts your chances of success. Being at -2 is particularly worse with Olive than without. However, most of the time she changes the probability of success too little to notice.

For most investigators, then, Olive doesn't do much to make her worth your while. But I think Olive makes a huge difference for two investigators: Father Mateo and Jim Culver.

Both of these investigators like some chaos bag tokens a lot. Father Mateo really likes the elder sign since it is an auto-success (in addition to other very good effects), while Jim Culver likes skulls since they are always 0.

What does Olive do for Father Mateo?

translate <- c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf) # Mateo mateo_translate <- translate; mateo_translate["E"] <- Inf mateo_no_olive <- test_sim(translate = mateo_translate) mateo_olive <- olive_sim(translate = mateo_translate) plot(stepfun((-10):3, rev(c(0, est_success_table_gen(mateo_no_olive)))), verticals = FALSE, pch = 20, main = "Mateo's Probabilities", ylim = c(0, 1)) lines(stepfun((-10):3, rev(c(0, est_success_table_gen(mateo_olive)))), verticals = FALSE, col = "blue", pch = 20)

as.matrix(est_success_table_gen(mateo_no_olive)) [,1] 3 0.049 2 0.049 1 0.101 0 0.226 -1 0.393 -2 0.627 -3 0.745 -4 0.862 -5 0.928 -6 0.928 -7 0.928 -8 0.928 -9 0.928 -10 0.928 as.matrix(est_success_table_gen(mateo_olive)) [,1] 3 0.177 2 0.177 1 0.218 0 0.278 -1 0.444 -2 0.554 -3 0.746 -4 0.842 -5 0.924 -6 0.972 -7 0.989 -8 0.997 -9 1.000 -10 1.000

Next up is Jim Culver.

# Culver culver_translate <- translate; culver_translate["S"] <- 0 culver_no_olive <- test_sim(translate = culver_translate) culver_olive <- olive_sim(translate = culver_translate) plot(stepfun((-10):3, rev(c(0, est_success_table_gen(culver_no_olive)))), verticals = FALSE, pch = 20, main = "Culver's Probabilities", ylim = c(0, 1)) lines(stepfun((-10):3, rev(c(0, est_success_table_gen(culver_olive)))), verticals = FALSE, col = "blue", pch = 20)

as.matrix(est_success_table_gen(culver_no_olive)) [,1] 3 0.000 2 0.065 1 0.112 0 0.333 -1 0.530 -2 0.638 -3 0.761 -4 0.874 -5 0.936 -6 0.936 -7 0.936 -8 0.936 -9 0.936 -10 0.936 as.matrix(est_success_table_gen(culver_olive)) [,1] 3 0.020 2 0.095 1 0.217 0 0.362 -1 0.564 -2 0.666 -3 0.803 -4 0.888 -5 0.951 -6 0.972 -7 0.992 -8 0.999 -9 1.000 -10 1.000

Olive helps these investigators succeed at skill tests more easily, especially Mateo. We haven't even taken account of the fact that good things happen when certain tokens appear for these investigators! Sealing tokens could also have a big impact on the distribution of the bag when combined with Olive McBride.

Again, there's a lot that could be touched on that I won't here, so I will share with you a script allowing you to do some of these analyses yourself.

#!/usr/bin/Rscript ####################################### # AHOliveDistributionEstimator.R ####################################### # Curtis Miller # 2018-08-03 # Simulates the chaos bag distribution when applying Olive McBride ####################################### # optparse: A package for handling command line arguments if (!suppressPackageStartupMessages(require("optparse"))) { install.packages("optparse") require("optparse") } olive_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = 3, replace = FALSE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- replicate(N, {do.call(chaos_bag_string_sim, dargs)}) vals <- sapply(pulls, function(p) { vecp <- strsplit(p, "")[[1]] vecnum <- translate[vecp] max(apply(combn(3, 2), 2, function(i) {sum(vecnum[i])})) }) vals[which(is.nan(vals))] <- Inf return(vals) } test_sim <- function(translate = c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf), N = 1000, dicebag = NULL) { dargs <- list(n = N, replace = TRUE) if (!is.null(dicebag)) { dargs$dicebag <- dicebag } pulls <- do.call(chaos_bag_string_sim, dargs) vecp <- strsplit(pulls, "")[[1]] vecnum <- translate[vecp] return(vecnum) } est_success_table_gen <- function(res, values = (-10):3) { r <- sapply(values, function(v) {mean(res >= v)}) names(r) <- values return(rev(r)) } # Main function # See definition of cl_args for functionality (help does nothing) main <- function(dicebag = "", translate = "", replications = 10000, plotfile = "", width = 6, height = 4, basic = FALSE, oliveless = FALSE, lowest = -10, highest = 3, chaos_bag_script = "AHChaosBagSimulator.R", title = "Probability of Success", pos = "topright", help = FALSE) { source(chaos_bag_script) if (dicebag != "") { dicebag_df <- read.csv(dicebag, stringsAsFactors = FALSE) dicebag <- as.numeric(dicebag_df$freq) names(dicebag) <- dicebag_df$token } else { dicebag <- NULL } if (translate != "") { translate_df <- read.csv(translate, stringsAsFactors = FALSE) translate <- as.numeric(translate_df$mod) names(translate) <- translate_df$token } else { translate <- c("E" = 2, "P" = 1, "0" = 0, "1" = -1, "2" = -2, "3" = -3, "4" = -4, "S" = -2, "C" = -3, "T" = -4, "L" = -5, "A" = -Inf) } if (any(names(translate) != names(dicebag))) { stop("Name mismatch between translate and dicebag; check the token columns") } if (!oliveless) { olive_res <- olive_sim(translate = translate, dicebag = dicebag, N = replications) cat("Table of success rate with Olive when at lest X is needed:\n") print(as.matrix(est_success_table_gen(olive_res, values = lowest:highest))) cat("\n\n") } if (basic) { basic_res <- test_sim(translate = translate, dicebag = dicebag, N = replications) cat("Table of success rate for basic test when at lest X is needed:\n") print(as.matrix(est_success_table_gen(basic_res, values = lowest:highest))) cat("\n\n") } if (plotfile != "") { png(plotfile, width = width, height = height, units = "in", res = 300) if (basic) { plot(stepfun(lowest:highest, rev(c(0, est_success_table_gen(basic_res)))), verticals = FALSE, pch = 20, main = title, ylim = c(0, 1)) if (!oliveless) { lines(stepfun(lowest:highest, rev(c(0, est_success_table_gen(olive_res)))), verticals = FALSE, pch = 20, col = "blue") legend(pos, legend = c("No Olive", "Olive"), col = c("black", "blue"), lty = 1, pch = 20) } } dev.off() } quit() } if (sys.nframe() == 0) { cl_args <- parse_args(OptionParser( description = "Estimate the chaos bag distribution with Olive McBride.", option_list = list( make_option(c("--dicebag", "-d"), type = "character", default = "", help = "(Optional) dice bag distribution CSV file"), make_option(c("--translate", "-t"), type = "character", default = "", help = "(Optional) symbol numeric translation CSV file"), make_option(c("--replications", "-r"), type = "integer", default = 10000, help = "Number of replications to perform"), make_option(c("--plotfile", "-p"), type = "character", default = "", help = paste("Where to save plot (if not set, no plot", "made; -b/--basic must be set)")), make_option(c("--width", "-w"), type = "integer", default = 6, help = "Width of plot (inches)"), make_option(c("--height", "-H"), type = "integer", default = 4, help = "Height of plot (inches)"), make_option(c("--basic", "-b"), type = "logical", action = "store_true", default = FALSE, help = "Include results for test without Olive McBride"), make_option(c("--oliveless", "-o"), type = "logical", action = "store_true", default = FALSE, help = "Don't include results using Olive McBride"), make_option(c("--lowest", "-l"), type = "integer", default = -10, help = "Lowest value to check"), make_option(c("--highest", "-i"), type = "integer", default = 3, help = "Highest value to check"), make_option(c("--pos", "-s"), type = "character", default = "topright", help = "Position of legend"), make_option(c("--title", "-m"), type = "character", default = "Probability of Success", help = "Title of plot"), make_option(c("--chaos-bag-script", "-c"), type = "character", default = "AHChaosBagSimulator.R", help = paste("Location of the R file containing", "chaos_bag_string_sim() definition; by", "default, assumed to be in the working", "directory in AHChaosBagSimulator.R")) ) )) names(cl_args)[which( names(cl_args) == "chaos-bag-script")] = "chaos_bag_script" do.call(main, cl_args) }

You've already seen an example CSV file for defining the dice bag; here's a file for defining what each token is worth.

token,mod E,2 P,1 0,0 1,-1 2,-2 3,-3 4,-4 S,-2 C,-3 T,-4 L,-5 A,-Inf

Make the script executable and get help like so:

$ chmod +x AHOliveDistributionEstimator.R $ ./AHOliveDistributionEstimator.R -h Conclusion

If you've never heard of this game I love, hopefully you've heard of it now. Give the game a look! And if you have heard of this game before, I hope you learned something from this post. If you're a reviewer, perhaps I've given you some tools you could use to help evaluate some of Arkham Horror's more intractable cards.

My final thoughts on Olive: she's not going to displace Arcane Initiate's spot as the best Mystic ally, but she could do well in certain decks. Specifically, I think that Mateo decks and Jim Culver decks planning on running Song of the Dead will want to run her since there are specific chaos tokens those decks want to see; the benefits of Olive extend beyond changing the probability of success. Most of the time you will not want to make a hail Mary skill test and you won't have the cards to push your skill value to a point where anything but the auto-fail is a success, so most of the time Olive will hurt your chances rather than help you, if she has any effect at all. Thus a typical Mystic likely won't find Olive interesting… but some decks will love her.

By the way, if you are in the Salt Lake City area of Utah, I play Arkham Horror LCG at Mind Games, LLC. While I haven't been to many other stores, Mind Games seems to have the best stocking of Arkham Horror (as well as two other Fantasy Flight LCGs, A Game of Thrones and Legend of the Five Rings). Every Tuesday night (the store's late night) a group of card game players come in to play; consider joining us! In addition, Mind Games likely has the best deal when it comes to buying the game; whenever you spend $50 on product, you get an additional $10 off (or, alternatively, you take $10 off for every $60 you spend). Thus Mind Games could be the cheapest way to get the game (without going second-hand or Internet deal hunting).

(Big image credit: Aurore Folney and FFG.)

EDIT: WordPress.com does not like R code and garbled some of the code in the script for Olive simulation. It should be correct now. If anyone spots other errors, please notify me; I will fix them.

I have created a video course published by Packt Publishing entitled Applications of Statistical Learning with Python, the fourth volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for data science, emphasizing application. It starts with introducing natural language processing and computer vision. It ends with two case studies; in one, I train a classifier to detect spam e-mail, while in the other, I train a computer vision system to detect emotions in faces. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

  1. You can also play scenarios stand-alone, which isn’t nearly as fun as playing in a campaign, especially one of the eight-scenario campaigns. 
  2. This is not right because sometims you redraw tokens from the bag without replacement; we'll ignore that case for now. 
  3. Using the asymptotically appropriate confidence interval based on the Normal distribution (as described here), the margin of error error will not exceed since . Thus, we have ; solving this for yields . So if we want a 95% chance that the margin of error will not exceed , we need a dataset of size . 
  4. Note that the earlier calculations that argued the odds of being accurate up to two decimal places were high no longer hold, since many more numbers are being estimated, and one estimate is not independent of the other. That said, the general principle of more simulations leading to better accuracy still holds. 
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

IML and H2O: Machine Learning Model Interpretability And Feature Explanation

Mon, 08/13/2018 - 14:47

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, several R packages designed to provide this capability are gaining in popularity. We analyze the IML package in this article.

In recent blog posts we assessed LIME for model agnostic local interpretability functionality and DALEX for both local and global machine learning explanation plots. This post examines the iml package (short for Interpretable Machine Learning) to assess its functionality in providing machine learning interpretability to help you determine if it should become part of your preferred machine learning toolbox.

We again utilize the high performance machine learning library, h2o, implementing three popular black-box modeling algorithms: GLM (generalized linear models), RF (random forest), and GBM (gradient boosted machines). For those that want a deep dive into model interpretability, the creator of the iml package, Christoph Molnar, has put together a free book: Interpretable Machine Learning. Check it out.

Articles In The Model Interpretability Series

Articles related to machine learning and black-box model interpretability:

Awesome Data Science Tutorials with LIME for black-box model explanation in business:

Transform Your Data Science Abilities In 10 Weeks

If you’re interested in learning how to apply critical thinking and data science while solving a real-world business problem following an end-to-end data science project, check out Data Science For Business (DS4B 201-R). Over the course of 10 weeks you will solve an end-to-end Employee Churn data science project following our systematic Business Science Problem Framework.



Start Learning Today!

We have two new course offerings coming soon! Shiny Web Apps and Python And Spark For Customer Churn! Get started with Business Science University.

FREE BOOK: Interpretable Machine Learning

The creator of the IML (Interpretable Machine Learning) package has a great FREE resource for those interested in applying model interpretability techniques to complex, black-box models (high performance models). Check out the book: Interpretable Machine Learning by Christoph Molnar.



Interpretable Machine Learning Book by Christoph Molnar

Learning Trajectory

We’ll cover the following topics on iml, combining with the h2o high performance machine learning package:

IML and H2O: Machine Learning Model Interpretability And Feature Explanation

By Brad Boehmke, Director of Data Science at 84.51°

Advantages & disadvantages

The iml package is probably the most robust ML interpretability package available. It provides both global and local model-agnostic interpretation methods. Although the interaction functions are notably slow, the other functions are faster or comparable to existing packages we use or have tested. I definitely recommend adding iml to your preferred ML toolkit. The following provides a quick list of some of its pros and cons:

Advantages

  • ML model and package agnostic: can be used for any supervised ML model (many features are only relevant to regression and binary classification problems).
  • Variable importance: uses a permutation-based approach for variable importance, which is model agnostic, and accepts any loss function to assess importance.
  • Partial dependence plots: Fast PDP implementation and allows for ICE curves.
  • H-statistic: one of only a few implementations to allow for assessing interactions.
  • Local interpretation: provides both LIME and Shapley implementations.
  • Plots: built with ggplot2 which allows for easy customization

Disadvantages

  • Does not allow for easy comparisons across models like DALEX.
  • The H-statistic interaction functions do not scale well to wide data (may predictor variables).
  • Only provides permutation-based variable importance scores (which become slow as number of features increase).
  • LIME implementation has less flexibilty and features than lime.
Replication Requirements Libraries

I leverage the following packages:

# load required packages library(rsample) # data splitting library(ggplot2) # allows extension of visualizations library(dplyr) # basic data transformation library(h2o) # machine learning modeling library(iml) # ML interprtation # initialize h2o session h2o.no_progress() h2o.init() ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## C:\Users\mdanc\AppData\Local\Temp\RtmpotQxdq/h2o_mdanc_started_from_r.out ## C:\Users\mdanc\AppData\Local\Temp\RtmpotQxdq/h2o_mdanc_started_from_r.err ## ## ## Starting H2O JVM and connecting: . Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 3 seconds 250 milliseconds ## H2O cluster version: 3.16.0.2 ## H2O cluster version age: 8 months and 13 days !!! ## H2O cluster name: H2O_started_from_R_mdanc_rkf359 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.53 GB ## H2O cluster total cores: 8 ## H2O cluster allowed cores: 8 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: Algos, AutoML, Core V3, Core V4 ## R Version: R version 3.4.4 (2018-03-15) Data

To demonstrate iml’s capabilities we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem.
I perform a few house cleaning tasks on the data prior to converting to an h2o object and splitting.

NOTE: The surrogate tree function uses partykit::cpart, which requires all predictors to be numeric or factors. Consequently, we need to coerce any character predictors to factors (or ordinal encode).

# classification data df <- rsample::attrition %>% mutate_if(is.ordered, factor, ordered = FALSE) %>% mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0"))) # convert to h2o object df.h2o <- as.h2o(df) # create train, validation, and test splits set.seed(123) splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = c("train","valid","test")) names(splits) <- c("train","valid","test") # variable names for resonse & features y <- "Attrition" x <- setdiff(names(df), y) H2O Models

We will explore how to visualize a few of the more common machine learning algorithms implemented with h2o. For brevity I train default models and do not emphasize hyperparameter tuning. The following produces a regularized logistic regression, random forest, and gradient boosting machine models.

# elastic net model glm <- h2o.glm( x = x, y = y, training_frame = splits$train, validation_frame = splits$valid, family = "binomial", seed = 123 ) # random forest model rf <- h2o.randomForest( x = x, y = y, training_frame = splits$train, validation_frame = splits$valid, ntrees = 1000, stopping_metric = "AUC", stopping_rounds = 10, stopping_tolerance = 0.005, seed = 123 ) # gradient boosting machine model gbm <- h2o.gbm( x = x, y = y, training_frame = splits$train, validation_frame = splits$valid, ntrees = 1000, stopping_metric = "AUC", stopping_rounds = 10, stopping_tolerance = 0.005, seed = 123 )

All of the models provide AUCs ranging between 0.75 to 0.79. Although these models have distinct AUC scores, our objective is to understand how these models come to this conclusion in similar or different ways based on underlying logic and data structure.

# model performance h2o.auc(glm, valid = TRUE) ## [1] 0.7870935 h2o.auc(rf, valid = TRUE) ## [1] 0.7681021 h2o.auc(gbm, valid = TRUE) ## [1] 0.7468242

Although these models have distinct AUC scores, our objective is to understand how these models come to this conclusion in similar or different ways based on underlying logic and data structure.

IML procedures

In order to work with iml, we need to adapt our data a bit so that we have the following three components:

  1. Create a data frame with just the features (must be of class data.frame, cannot be an H2OFrame or other class).

  2. Create a vector with the actual responses (must be numeric – 0/1 for binary classification problems).

  3. iml has internal support for some machine learning packages (i.e. mlr, caret, randomForest). However, to use iml with several of the more popular packages being used today (i.e. h2o, ranger, xgboost) we need to create a custom function that will take a data set (again must be of class data.frame) and provide the predicted values as a vector.

# 1. create a data frame with just the features features <- as.data.frame(splits$valid) %>% select(-Attrition) # 2. Create a vector with the actual responses response <- as.numeric(as.vector(splits$valid$Attrition)) # 3. Create custom predict function that returns the predicted values as a # vector (probability of purchasing in our example) pred <- function(model, newdata) { results <- as.data.frame(h2o.predict(model, as.h2o(newdata))) return(results[[3L]]) } # example of prediction output pred(rf, features) %>% head() ## [1] 0.18181818 0.27272727 0.06060606 0.54545455 0.03030303 0.42424242

Once we have these three components we can create a predictor object. Similar to DALEX and lime, the predictor object holds the model, the data, and the class labels to be applied to downstream functions. A unique characteristic of the iml package is that it uses R6 classes, which is rather rare. To main differences between R6 classes and the normal S3 and S4 classes we typically work with are:

  • Methods belong to objects, not generics (we’ll see this in the next code chunk).
  • Objects are mutable: the usual copy-on-modify semantics do not apply (we’ll see this later in this tutorial).

These properties make R6 objects behave more like objects in programming languages such as Python. So to construct a new Predictor object, you call the new() method which belongs to the R6 Predictor object and you use $ to access new():

# create predictor object to pass to explainer functions predictor.glm <- Predictor$new( model = glm, data = features, y = response, predict.fun = pred, class = "classification" ) predictor.rf <- Predictor$new( model = rf, data = features, y = response, predict.fun = pred, class = "classification" ) predictor.gbm <- Predictor$new( model = gbm, data = features, y = response, predict.fun = pred, class = "classification" ) # structure of predictor object str(predictor.gbm) ## Classes 'Predictor', 'R6' ## Public: ## class: classification ## clone: function (deep = FALSE) ## data: Data, R6 ## initialize: function (model = NULL, data, predict.fun = NULL, y = NULL, class = NULL) ## model: H2OBinomialModel ## predict: function (newdata) ## prediction.colnames: NULL ## prediction.function: function (newdata) ## print: function () ## task: unknown ## Private: ## predictionChecked: FALSE Global interpretation

iml provides a variety of ways to understand our models from a global perspective:

  1. Feature Importance
  2. Partial Dependence
  3. Measuring Interactions
  4. Surrogate Model

We’ll go through each.

1. Feature importance

We can measure how important each feature is for the predictions with FeatureImp. The feature importance measure works by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. This model agnostic approach is based on (Breiman, 2001; Fisher et al, 2018) and follows the given steps:

For any given loss function do 1: compute loss function for original model 2: for variable i in {1,...,p} do | randomize values | apply given ML model | estimate loss function | compute feature importance (permuted loss / original loss) end 3. Sort variables by descending feature importance

We see that all three models find OverTime as the most influential variable; however, after that each model finds unique structure and signals within the data. Note: you can extract the results with imp.rf$results.

# compute feature importance with specified loss metric imp.glm <- FeatureImp$new(predictor.glm, loss = "mse") imp.rf <- FeatureImp$new(predictor.rf, loss = "mse") imp.gbm <- FeatureImp$new(predictor.gbm, loss = "mse") # plot output p1 <- plot(imp.glm) + ggtitle("GLM") p2 <- plot(imp.rf) + ggtitle("RF") p3 <- plot(imp.gbm) + ggtitle("GBM") gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

Permutation-based approaches can become slow as your number of predictors grows. To assess variable importance for all 3 models in this example takes only 8 seconds. However, performing the same procedure on a data set with 80 predictors (AmesHousing::make_ames()) takes roughly 3 minutes. Although this is slower, it is comparable to other permutation-based implementations (i.e. DALEX, ranger).

system.time({ imp.glm <- FeatureImp$new(predictor.glm, loss = "mse") imp.rf <- FeatureImp$new(predictor.rf, loss = "mse") imp.gbm <- FeatureImp$new(predictor.gbm, loss = "mse") }) ## user system elapsed ## 2.982 0.112 8.164

The following lists some advantages and disadvantages to iml’s feature importance procedure.

Advantages:

  • Model agnostic
  • Simple interpretation that’s comparable across models
  • Can apply any loss function (accepts custom loss functions as well)
  • Plot output uses ggplot2; we can add to or use our internal branding packages with it

Disadvantages:

  • Permutation-based methods are slow
  • Default plot contains all predictors; however, we can subset $results data frame if desired
2. Partial dependence

The Partial class implements partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. The procedure follows the traditional methodology documented in Friedman (2001) and Goldstein et al. (2014) where the ICE curve for a certain feature illustrates the predicted value for each observation when we force each observation to take on the unique values of that feature. The PDP curve represents the average prediction across all observations.

For a selected predictor (x) 1. Determine grid space of j evenly spaced values across distribution of x 2: for value i in {1,...,j} of grid space do | set x to i for all observations | apply given ML model | estimate predicted value | if PDP: average predicted values across all observations end

The following produces “ICE boxplots” and a PDP line (connects the averages of all observations for each response class) for the most important variable in all three models (OverTime). All three model show a sizable increase in the probability of employees attriting when working overtime. However, you will notice the random forest model experiences less of an increase in probability compared to the other two models.

glm.ot <- Partial$new(predictor.glm, "OverTime") %>% plot() + ggtitle("GLM") rf.ot <- Partial$new(predictor.rf, "OverTime") %>% plot() + ggtitle("RF") gbm.ot <- Partial$new(predictor.gbm, "OverTime") %>% plot() + ggtitle("GBM") gridExtra::grid.arrange(glm.ot, rf.ot, gbm.ot, nrow = 1)

For continuous predictors you can reduce the grid space to make computation time more efficient and center the ICE curves. Note: to produce the centered ICE curves (right plot) you use ice$center and provide it the value to center on. This will modify the existing object in place (recall this is a unique characteristic of R6 –> objects are mutable). The following compares the marginal impact of age on the probability of attriting. The regularized regression model shows a monotonic decrease in the probability (the log-odds probability is linear) while the two tree-based approaches capture the non-linear, non-monotonic relationship.

# GLM model glm.age <- Partial$new(predictor.glm, "Age", ice = TRUE, grid.size = 50) glm.age$center(min(features$Age)) p1 <- plot(glm.age) + ggtitle("GLM") # RF model rf.age <- Partial$new(predictor.rf, "Age", ice = TRUE, grid.size = 50) rf.age$center(min(features$Age)) p2 <- plot(rf.age) + ggtitle("RF") # GBM model gbm.age <- Partial$new(predictor.gbm, "Age", ice = TRUE, grid.size = 50) gbm.age$center(min(features$Age)) p3 <- plot(gbm.age) + ggtitle("GBM") gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

Similar to pdp you can also compute and plot 2-way interactions. Here we assess how the interaction of MonthlyIncome and OverTime influences the predicted probability of attrition for all three models.

p1 <- Partial$new(predictor.glm, c("MonthlyIncome", "OverTime")) %>% plot() + ggtitle("GLM") + ylim(c(0, .4)) p2 <- Partial$new(predictor.rf, c("MonthlyIncome", "OverTime")) %>% plot() + ggtitle("RF") + ylim(c(0, .4)) p3 <- Partial$new(predictor.gbm, c("MonthlyIncome", "OverTime")) %>% plot() + ggtitle("GBM") + ylim(c(0, .4)) gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

The following lists some advantages and disadvantages to iml’s PDP and ICE procedures.

Advantages:

  • Provides PDPs & ICE curves (unlike DALEX)
  • Allows you to center ICE curves
  • Computationally efficient
  • grid.size allows you to increase/reduce grid space of xi values
  • Rug option illustrates distribution of all xi values
  • Provides convenient plot outputs for categorical predictors

Disadvantages:

  • Only provides heatmap plot of 2-way interaction plots
  • Does not allow for easy comparison across models like DALEX
3. Measuring Interactions

A wonderful feature provided by iml is to measure how strongly features interact with each other. To measure interaction, iml uses the H-statistic proposed by Friedman and Popescu (2008). The H-statistic measures how much of the variation of the predicted outcome depends on the interaction of the features. There are two approaches to measure this. The first measures if a feature (xi) interacts with any other feature. The algorithm performs the following steps:

1: for variable i in {1,...,p} do | f(x) = estimate predicted values with original model | pd(x) = partial dependence of variable i | pd(!x) = partial dependence of all features excluding i | upper = sum(f(x) - pd(x) - pd(!x)) | lower = variance(f(x)) | rho = upper / lower end 2. Sort variables by descending rho (interaction strength)

The intereaction strength (rho) will be between 0 when there is no interaction at all and 1 if all of variation of the predicted outcome depends on a given interaction. All three models capture different interaction structures although some commonalities exist for different models (i.e. OverTime, Age, JobRole). The interaction effects are stronger in the tree based models versus the GLM model, with the GBM model having the strongest interaction effect of 0.4.

# identify variables with largest interactions in each model interact.glm <- Interaction$new(predictor.glm) %>% plot() + ggtitle("GLM") interact.rf <- Interaction$new(predictor.rf) %>% plot() + ggtitle("RF") interact.gbm <- Interaction$new(predictor.gbm) %>% plot() + ggtitle("GBM") # plot gridExtra::grid.arrange(interact.glm, interact.rf, interact.gbm, nrow = 1)

Considering OverTime exhibits the strongest interaction signal, the next question is which other variable is this interaction the strongest. The second interaction approach measures the 2-way interaction strength of feature xi and xj and performs the following steps:

1: i = a selected variable of interest 2: for remaining variables j in {1,...,p} do | pd(ij) = interaction partial dependence of variables i and j | pd(i) = partial dependence of variable i | pd(j) = partial dependence of variable j | upper = sum(pd(ij) - pd(i) - pd(j)) | lower = variance(pd(ij)) | rho = upper / lower end 3. Sort interaction relationship by descending rho (interaction strength)

The following measures the two-way interactions of all variables with the OverTime variable. The two tree-based models show MonthlyIncome having the strongest interaction (although it is a week interaction since rho < 0.13). Identifying these interactions can be useful to understand which variables create co-denpendencies in our models behavior. It also helps us identify interactions to visualize with PDPs (which is why I showed the example of the OverTime and MonthlyIncome interaction PDP earlier).

# identify variables with largest interactions w/OverTime interact.glm <- Interaction$new(predictor.glm, feature = "OverTime") %>% plot() interact.rf <- Interaction$new(predictor.rf, feature = "OverTime") %>% plot() interact.gbm <- Interaction$new(predictor.gbm, feature = "OverTime") %>% plot() # plot gridExtra::grid.arrange(interact.glm, interact.rf, interact.gbm, nrow = 1)

The H-statistic is not widely implemented so having this feature in iml is beneficial. However, its important to note that as your feature set grows, the H-statistic becomes computationally slow. For this data set, measuring the interactions across all three models only took 45 seconds and 68 seconds for the two-way interactions. However, for a wider data set such as AmesHousing::make_ames() where there are 80 predictors, this will up towards an hour to compute.

4. Surrogate model

Another way to make the models more interpretable is to replace the “black box” model with a simpler model (aka “white box” model) such as a decision tree. This is known as a surrogate model in which we

1. apply original model and get predictions 2. choose an interpretable "white box" model (linear model, decision tree) 3. Train the interpretable model on the original dataset and its predictions 4. Measure how well the surrogate model replicates the prediction of the black box model 5. Interpret / visualize the surrogate model

iml provides a simple decision tree surrogate approach, which leverages partykit::cpart. In this example we train a CART decision tree with max depth of 3 on our GBM model. You can see that the white box model does not do a good job of explaining the black box predictions (R^2 = 0.438).

# fit surrogate decision tree model tree <- TreeSurrogate$new(predictor.gbm, maxdepth = 3) # how well does this model fit the original results tree$r.squared ## [1] 0.4384319

The plot illustrates the distribution of the probability of attrition for each terminal node. We see an employee with JobLevel > 1 and DistanceFromHome <= 12 has a very low probability of attriting.

# Visualize tree plot(tree)

When trying to explain a complicated machine learning model to decision makers, surrogate models can help simplify the process. However, its important to only use surrogate models for simplified explanations when they are actually good representatives of the black box model (in this example it is not).

Local interpretation

In addition to providing global explanations of ML models, iml also provides two newer, but well accepted methods for local interpretation.

Local interpretation techniques provide methods to explain why an individual prediction was made for a given observation.

To illustrate, lets get two observations. The first represents the observation that our random forest model produces the highest probability of a attrition (observation 154 has a 0.666 probability of attrition) and the second represents the observation with the lowest probability (observation 28 has a 0 probability of attrition).

# identify obs with highest and lowest probabilities (high <- predict(rf, splits$valid) %>% .[, 3] %>% as.vector() %>% which.max()) ## [1] 154 (low <- predict(rf, splits$valid) %>% .[, 3] %>% as.vector() %>% which.min()) ## [1] 28 # get these observations high_prob_ob <- features[high, ] low_prob_ob <- features[low, ] 1. Lime

iml implements its own version of local interpretable model-agnostic explanations (Ribeiro et al., 2016). Although it does not use the lime package directly, it does implement the same procedures (see lime tutorial).

A few notable items about iml implementation (see referenced tutorial above for these details within lime):

  • like lime, can change distance metric (default is gower but accepts all distance functions provided by ?dist),
  • like lime, can change kernel (neighborhood size),
  • like lime, computationally efficient –> takes about 5 seconds to compute,
  • like lime, can be applied to multinomial responses,
  • like lime, uses the glmnet package to fit the local model; however…
  • unlike lime, only implements a ridge model (lime allows ridge, lasso, and more),
  • unlike lime, can only do one observation at a time (lime can do multiple),
  • unlike lime, does not provide fit metric such as (R^2) for the local model.

The following fits a local model for the observation with the highest probability of attrition. In this example I look for the 10 variables in each model that are most influential in this observations predicted value (k = 10). The results show that the Age of the employee reduces the probability of attrition within all three models. Morever, all three models show that since this employee works OverTime, this is having a sizable increase in the probability of attrition. However, the tree-based models also identify the MaritalStatus and JobRole of this employee contributing to his/her increased probability of attrition.

# fit local model lime.glm <- LocalModel$new(predictor.glm, k = 10, x.interest = high_prob_ob) %>% plot() + ggtitle("GLM") lime.rf <- LocalModel$new(predictor.rf, k = 10, x.interest = high_prob_ob) %>% plot() + ggtitle("RF") lime.gbm <- LocalModel$new(predictor.gbm, k = 10, x.interest = high_prob_ob) %>% plot() + ggtitle("GBM") gridExtra::grid.arrange(lime.glm, lime.rf, lime.gbm, nrow = 1)

Here, I reapply the same model to low_prob_ob. Here, we see Age, JobLevel, and OverTime all having sizable influence on this employees very low predicted probability of attrition (zero).

# fit local model lime.glm <- LocalModel$new(predictor.glm, k = 10, x.interest = low_prob_ob) %>% plot() + ggtitle("GLM") lime.rf <- LocalModel$new(predictor.rf, k = 10, x.interest = low_prob_ob) %>% plot() + ggtitle("RF") lime.gbm <- LocalModel$new(predictor.gbm, k = 10, x.interest = low_prob_ob) %>% plot() + ggtitle("GBM") gridExtra::grid.arrange(lime.glm, lime.rf, lime.gbm, nrow = 1)

Although, LocalModel does not provide the fit metrics (i.e. R^2) for our local model, we can compare the local models predicted probability to the global (full) model’s predicted probability.

For the high probability employee, the local model only predicts a 0.34 probability of attrition whereas the local model predicts a more accurate 0.12 probability of attrition for the low probability employee. This can help you guage the trustworthiness of the local model.

High Probability:

# high probability observation predict(rf, splits$valid) %>% .[high, 3] # actual probability ## [1] 0.6666667 lime_high <- LocalModel$new(predictor.rf, k = 10, x.interest = high_prob_ob) lime_high$predict(high_prob_ob) # predicted probability ## prediction ## 1 0.3371567

Low Probability:

# low probability observation predict(rf, splits$valid) %>% .[low, 3] # actual probability ## [1] 0 lime_low <- LocalModel$new(predictor.rf, k = 10, x.interest = low_prob_ob) lime_low$predict(low_prob_ob) # predicted probability ## prediction ## 1 0.1224224 2. Shapley values

An alternative for explaining individual predictions is a method from coalitional game theory that produces whats called Shapley values (Lundberg & Lee, 2016). The idea behind Shapley values is to assess every combination of predictors to determine each predictors impact. Focusing on feature xj, the approach will test the accuracy of every combination of features not including xj and then test how adding xj to each combination improves the accuracy. Unfortunately, computing Shapley values is very computationally expensive. Consequently, iml implements an approximate Shapley estimation algorithm that follows the following steps:

ob = single observation of interest 1: for variables j in {1,...,p} do | m = random sample from data set | t = rbind(m, ob) | f(all) = compute predictions for t | f(!j) = compute predictions for t with feature j values randomized | diff = sum(f(all) - f(!j)) | phi = mean(diff) end 2. sort phi in decreasing order

The Shapley value ($\phi$) represents the contribution of each respective variable towards the predicted valued compared to the average prediction for the data set.

We use Shapley$new to create a new Shapley object. For this data set it takes about 9 seconds to compute. The time to compute is largely driven by the number of predictors but you can also control the sample size drawn (see sample.size argument) to help reduce compute time. If you look at the results, you will see that the predicted value of 0.667 is 0.496 larger than the average prediction of 0.17. The plot displays the contribution each predictor played in this difference of 0.496.

# compute Shapley values shapley.rf <- Shapley$new(predictor.rf, x.interest = high_prob_ob) # look at summary of results shapley.rf shapley.rf <- readRDS("2018-08-13-iml/shapley_rf.rds") shapley.rf ## Interpretation method: Shapley ## Predicted value: 0.666667, Average prediction: 0.170373 (diff = 0.496293) ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data: ## Sampling from data.frame with 233 rows and 30 columns. ## ## Head of results: ## feature phi phi.var ## 1 Age 0.031515152 0.0027626123 ## 2 BusinessTravel 0.047575758 0.0031212956 ## 3 DailyRate 0.011515152 0.0015170994 ## 4 Department -0.006363636 0.0003579412 ## 5 DistanceFromHome -0.011818182 0.0009256013 ## 6 Education 0.001212121 0.0007220042 ## feature.value ## 1 Age=28 ## 2 BusinessTravel=Travel_Frequently ## 3 DailyRate=791 ## 4 Department=Research_Development ## 5 DistanceFromHome=1 ## 6 Education=Master #plot results plot(shapley.rf)

We can compare the Shapley values across each model to see if common themes appear. Again, OverTime is a common theme across all three models. We also see MonthlyIncome influential for the tree-based methods and there are other commonalities for the mildly influential variables across all three models (i.e. StockOptionLevel, JobLevel, Age, MaritalStatus).

shapley.glm <- Shapley$new(predictor.glm, x.interest = high_prob_ob) %>% plot() + ggtitle("GLM") shapley.rf <- plot(shapley.rf) + ggtitle("RF") shapley.gbm <- Shapley$new(predictor.gbm, x.interest = high_prob_ob) %>% plot() + ggtitle("GBM") gridExtra::grid.arrange(shapley.glm, shapley.rf, shapley.gbm, nrow = 1)

Similarly, we can apply for the low probability employee. Some common themes pop out for this employee as well. It appears that the age, total number of working years, and the senior position (JobLevel, JobRole) play a large part in the low predicted probability of attrition for this employee.

shapley.glm <- Shapley$new(predictor.glm, x.interest = low_prob_ob) %>% plot() + ggtitle("GLM") shapley.rf <- Shapley$new(predictor.rf, x.interest = low_prob_ob) %>% plot() + ggtitle("RF") shapley.gbm <- Shapley$new(predictor.gbm, x.interest = low_prob_ob) %>% plot() + ggtitle("GBM") gridExtra::grid.arrange(shapley.glm, shapley.rf, shapley.gbm, nrow = 1)

Shapley values are considered more robust than the results you will get from LIME. However, similar to the different ways you can compute variable importance, although you will see differences between the two methods often you will see common variables being identified as highly influential in both approaches. Consequently, we should use these approaches to help indicate influential variables but not to definitively label a variables as the most influential.

Next Steps (Transform Your Abilities)

It’s critical to understand that modeling is just one part of the overall data science project. Don’t mistake – it’s an important part, but other parts are equally if not more important, which is why our Data Science For Business With R (DS4B 201-R) course is successfully teaching data science students how to apply data science in the real world.



We teach end-to-end data science. This means you learn how to:

  • Chapter 1: Sizing The Problem, Custom Functions With Tidy Eval: Learn how to understand if a business problem exists by sizing the problem. In addition, create custom functions using Tidy Eval, a programming interface that is needed if you want to build custom functions using dplyr, ggplot2.

  • Chapter 2: Exploration with GGally, skimr – We show you how to explore data and identify relationships efficiently and effectively

  • Chapter 3: recipes, Premodeling Correlation Analysis: We teach you how to use recipes to develop data transformation workflow and we show you how to perform a pre-modeling correlation analysis so you do not move into modeling prematurely, which again saves you time

  • Chapters 4 and 5: H2O AutoML: You will first learn how to use all of the major h2o functions to perform automated machine learning for binary classification including working with the H2O leaderboard, visualizing results, and performing grid search and cross validation. In the next chapter, you learn how to evaluate multiple models against each other with ROC, Precision vs Recall, Gain and Lift, which is necessary to explain your choices for best model to the business stakeholders (your manager and key decision makers).

  • Chapter 6: LIME: You learn how to create local explanations to interpret black-box machine learning models. Explanations for the model predictions are critical because the business cannot do anything to improve unless they understand why. We show you how with lime.

  • Chapters 7 and 8: Expected Value Framework, Threshold Optimization, Sensitivity Analysis: These are my two favorite chapters because they show you how to link the churn classification model to financial results, and they teach purrr for iterative grid-search style optimization! These are POWERFUL CONCEPTS.

  • Chapter 9: Recommendation Algorithm: We again use a correlation analysis but in a different way. We discretize, creating a special visualization that enables us to develop a recommendation strategy.

Data Science For Business With R (DS4B 201-R)

Learn everything you need to know to complete a real-world, end-to-end data science project with the R programming language. Transform your abilities in 10 weeks.

Get Started Today!

Business Science University Course Roadmap

We have one course out and two courses coming soon!

Data Science For Business With R (DS4B 201-R)

Over the course of 10 weeks, we teach you how to solve an end-to-end data science project. Available now!

Transform you abilities by solving employee churn over a 10-week end-to-end data science project in DS4B 201-R



Get Started Today!

Building A Shiny Application (DS4B 301-R)

Our next course teaches you how to take the H2O Model, LIME explanation results, and the recommendation algorithm you develop in DS4B 201-R and turn it into a Shiny Web Application that predicts employee attrition! Coming in Q3 2018.

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in DS4B 301-R (Building A Shiny Web App)

Kelly O’Briant is lead developer for the Shiny App Course coming soon. She’s a brilliant software engineer / data scientist that knows how to make a great looking and performing Shiny app.

Sign Up! Coming Q3!

Data Science For Business With Python (DS4B 201-P)

Did we mention with have a DS4B Python Course coming? Well we do! Coming in Q4 2018.

The problem changes: Customer Churn! The tools will be H2O, LIME, and a host of other tools implemented in Python + Spark.

Python Track: Data Science For Business With Python And Spark

Favio Vazquez, Principle Data Scientist at OXXO, is building the Python + Spark equivalent of DS4B 201-R. He’s so talented knowing Python, Spark, and R, along with a host of other data science tools.

Sign Up! Coming Q4!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A More Informative Approach to Movie Recommendation

Mon, 08/13/2018 - 06:46

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

A Deluge of Content

Over the last two decades, the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable access for millions of consumers to a seemingly countless number of movies and television shows. With so many choices now available, it is more important than ever to have effective ways to filter all the possible titles and find options that best match your interests. Many streaming services have implemented functionality to suggest possible titles of interests for their users. Without these guiding tools, deciding what to watch next can seem like a monstrous undertaking.

 

The Goal

In my experience, the “you may also like”-style suggesting algorithms work well. However, I find that when I’m searching for something to watch I want to know more than just whether or not I will like it. What kind of movie is it? What parts of the movie were done better than others? In short, I want to know what aspects of the movie suggest that I will enjoy it. This information is not provided by the suggestion functionality of the streaming platforms I have encountered. So I thought of a different way to approach the problem.

Movie critics work hard to produce reviews that help the general audience decide whether a movie is worth watching. A critic review can provide a much more detailed look into a movie’s worth than an algorithm simply telling you “watch this.” Still, the issue of individual critics’ personal taste and preference exists. Although general trends usually do arise, movie reviews can be highly subjective. How do I know I can trust a particular critic? Do they value the same qualities in a movie that I do? These uncertainties can compromise the validity of a review on a person to person basis. A critic’s reviews would carry more weight if I knew that our interests were similar. With this in mind, I decided to create an application that would match a user with the most similarly-minded critics. The user could then follow their matched critics to explore a range of movies they could potentially enjoy.

 

The Data

I chose to extract movie review data from the movie and TV show review site RottenTomatoes (RT). RT is a great resource for anyone looking for a new TV show to watch or to choose between movies that are currently playing. The site provides ratings by both professional critics and average viewers to give a general idea of the quality of a film or show. RT also gives a certification of “Fresh” or “Rotten” so users can quickly get an idea if the title is worth checking out. I used the scrapy package in python to code a script that would extract certain values from the RT site. From RT, I scraped certified critics’ reviews of the top 100 movies per year from 2000 to 2018. From the top 100 movies list, I only inspected those movies that were reviewed by 100 or more unique critics. The 100 critic count mark seemed to be a threshold for widely recognizable movies. On the critic reviews page for each movie, if a critic gave a rating for the movie (some reviews only included a description) I scraped the critic’s name, the organization they belonged to, and the rating they gave. I also scraped the average rating for each of the movies.

 

Data Cleaning

I used python to clean the data that I scraped from RT. The main issue that I encountered was that there were many different rating formats used among the different critics. For example, here are some of the formats used:

  • 3/5
  • 3/4
  • 6/10
  • 60/100
  • 3 out of 5
  • A +
  • A plus
  • *** 1/2
  • 3 stars
  • Recommended vs. Not Recommended

In order to normalize the critics’ ratings so I could compare them, I decided to convert them all to fractional values between 0 and 1. The different rating formats made this conversion difficult. I wrote if-statements containing regular expressions for each format to match and extract the rating information. Then I was able to reassign each value as a fractional representation of the original rating (e.g. *** 1/2 = 3.5/5 = 0.7). Because there were many different format types, I only retained the ratings that had the most common formats. I also dropped the ratings that I did not think would add value to the matching algorithm, such as recommended vs. not recommended (too coarse of a rating). After the conversion, many ratings had more than five decimal places. For simplicity, I truncated them all to three decimal places.

 

The App

The main focus of this project was to use web scraping to collect unorganized data and then analyze that data. Instead of exploring the RT data to find trends among movies, such as comparing reviews among different genres or movie release dates, I decided to use the data to help people find movies they might like. This was my thought process for the application’s functionality:

  1. Have a user pick five movies they have seen
  2. Have the user rate each movie according to how much they enjoyed it
  3. Compare the user’s ratings to critic ratings and find the critics with the most similar ratings
  4. Show the top critics ordered by closest match and show other movies those critics rated

In order to transform this theoretical process into a reality, I used the Shiny package in R to code an application that could ask the user for input and return a table of top matched critics for them to follow.

 

The user selects five movies they have seen.

 

The user rates the movies they selected based on how much they enjoyed them.

 

The matching algorithm is pretty straightforward, although I utilized two different methods. The first method I used was to take the absolute value of the difference between the user scores and the critic scores and then sum the differences up. I called this match score the “Absolute Distance Score” (the lower the score, the better the match). This approach was simple, but I found that it created some issues. For example, a critic that rated all five movies exactly one point off from the user ratings would be given the same match score as a critic that rated four movies exactly the same as the user but one movie five points off. In this example, the first critic rated each movie almost exactly the same as the user, and the second critic rated four movies exactly the same and rated one very differently. I decided that in this type of situation the critics should not be given the same match score so I implemented an additional method. I added a “Squared Distance Score” which squared the difference between the user and critic’s ratings and summed them up. This weights distance exponentially so larger differences will increase the match score more. I included both score types in the table that shows the top matches, however I sorted the table by lowest squared distance score. I hoped that allowing the user to see both scores would provide more insight into the meaning behind the match score.

An important complication arose in the case where a critic did not review all five of the user’s chosen movies. This could lead to misleading match scores in the match table. If one critic reviewed all five of the user’s movies, there is a greater chance that their match score will be higher than a that of a critic that reviewed only three movies (five differences will be summed for the first critic and only three differences will be summed for the second). In the match table, two critics could have the same score but perhaps only because data was missing for one or more of the movies. Because of this, I added a review count to the match table. I ordered the table by highest review count and selected the top five match scores for critics who rated five, four, or three of the user’s movies. I did not include critics that rated two or less of the user’s movies. I made it clear in the text above the tables that differences in match scores between review count groups should be taken with a grain of salt. Overall, the five review count group will give the most robust matches.

 

This table shows the user’s top matched critics based on the movies and ratings chosen.

 

These tables show other movies that the currently selected critic has reviewed, specifically those movies that the critic rated high or low.

 

Below the match table, I included two tables that show high and low rated movies for the currently selected critic. My goal here was to recommend movies to the user based on the opinion of their matched critics. Here the user can browse the list of highly rated movies to see which titles they might enjoy. Conversely, the low rated movie table will allow the user to know which movies to avoid.

The last tables in the application is something I was very excited to implement. When searching for a movie to watch, I generally only stumble upon movies that are universally loved or universally disliked because those are the movies that receive the most publicity. Obviously there are many other movies out there that I might really enjoy, but it’s hard to find the diamonds in the rough. The final tables in my application show the user which movies their critics rated much differently from the general public. The left table shows movies that the critic rated highly but had a low average rating; these are the movies no one would ever think to recommend to you, but are ones you might really enjoy based on your matched critic. On the other hand, the right table shows movies that the critic rated low but had a high average rating; these are the movies that everyone keeps telling you to see, but you might be better off avoiding them. I think this is a very useful tool in finding movies that cater to your interests and can open up a whole new world of movies for you to see.

 

These tables show the movies that the currently selected critic rated much differently from the general public.

 

Going Forward

I am really happy with how this application turned out, but there are still a few kinks I would like to work out.

  1. In the match table, I include a column with a list of all the organisations that the critics have been associated with. I populate this column by aggregating all the unique organisations for a critic from the original data set and concatenating them with a “|” as a separator. While the matching calculation does not take very long, this organisation column creation more than doubles the required time for the table to be created. I would like to add a few lines in the python cleaning code to replace each organisation value in the original data set with all the critic’s organisations so that this calculation does not need to be done each time a user submits a new request.
  2. Due to my filtering criteria during scraping, I only included movies that were reviewed by 100 or more critics. Additionally, the list of movies only includes the top 100 movies from each year between 2000 and 2018.  I have not explored how loosening these criteria would affect the data set size or how it would affect the match calculation time, but I would like to add more options in the future for users to choose from (and also to potentially make the algorithm more accurate by adding more uncommon movies).
  3. Also due to the criteria above, the movies in the data set are necessarily popular. This gave rise to a complication in the “Critic Rating High / Average Rating Low” table. Because all of the movies are popular, there are few to no movies with average ratings below 6/10 or so. This somewhat defeats the purpose of the table, which is supposed to introduce the user to unknown or generally disliked movies. On RT, each critic has their own page with every review they have submitted along with RT’s T-Meter rating (a percentage of critics who gave the movie a “Fresh” rating). In the future, I would like to scrape each critic’s page and use this second data set to populate the two critic vs. average tables. This should give the user a more exhaustive list of movies they should consider seeing or avoiding.
Hey, Thanks!

Thank you for taking the time to read about my project! I was really excited to create this application because I think it serves a real purpose and can be helpful for a large group of people. Feel free to check out my application, and please let me know if it helped you find a movie you enjoyed! Here is a link to my GitHub repository if you would like to explore my code.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Getting Started with the Russian Tweet Dataset

Sun, 08/12/2018 - 18:20

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) – a “Russian troll factory”. In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
"IRAhandle_tweets_2.csv",
"IRAhandle_tweets_3.csv",
"IRAhandle_tweets_4.csv",
"IRAhandle_tweets_5.csv",
"IRAhandle_tweets_6.csv",
"IRAhandle_tweets_7.csv",
"IRAhandle_tweets_8.csv",
"IRAhandle_tweets_9.csv")
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
tweet <- read_csv(file) }

library(tidyverse)
tweet_data <- NULL
for (file in my_files) {
temp <- each_file(file)
temp$id <- sub(".csv", "", file)
tweet_data <- rbind(tweet_data, temp)
}

Note that this is a large file, with 2,973,371 observations of 16 variables. Let’s do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:

  • Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We’re FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
  • Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
  • News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
  • Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
  • Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”

But a quick table of the results of the variable, account_category, shows 8 in the dataset.

table(tweet_data$account_category)
##
## Commercial Fearmonger HashtagGamer LeftTroll NewsFeed
## 122582 11140 241827 427811 599294
## NonEnglish RightTroll Unknown
## 837725 719087 13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I’m also going to keep only a few key variables. Then I’m going to clean up this dataset to remove links, because I don’t need those for my analysis – I certainly wouldn’t want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
select(author,content,publish_date,account_category) %>%
filter(account_category != "NonEnglish")

library(qdapRegex)
##
## Attaching package: 'qdapRegex'
reduced$content <- rm_url(reduced$content)

rm(tweet_data)

Now we have a dataset of 2,135,646 observations of 4 variables. I’m planning on doing some analysis on my own of this dataset – and will of course share what I find – but for now, I thought I’d repeat a technique I’ve covered on this blog and demonstrate a new one.

library(tidytext)

tweetwords <- reduced %>%
unnest_tokens(word, content) %>%
anti_join(stop_words)
## Joining, by = "word"
wordcounts <- tweetwords %>%
count(account_category, word, sort = TRUE) %>%
ungroup()

head(wordcounts)
## # A tibble: 6 x 3
## account_category word n
##
## 1 NewsFeed news 124586
## 2 RightTroll trump 95794
## 3 RightTroll rt 86970
## 4 NewsFeed sports 47793
## 5 Commercial workout 42395
## 6 NewsFeed politics 38204

First, I’ll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
bind_tf_idf(word, account_category, n) %>%
arrange(desc(tf_idf))

tweet_tfidf %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(account_category) %>%
top_n(15) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = account_category)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~account_category, ncol = 2, scales = "free") +
coord_flip()
## Selecting by tf_idf

But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I’m going to try having it extract 6 topics. We can see how well they line up with the account types.

tweets_dtm <- wordcounts %>%
cast_dtm(account_category, word, n)

library(topicmodels)
tweets_lda <- LDA(tweets_dtm, k = 6, control = list(seed = 42))
tweet_topics <- tidy(tweets_lda, matrix = "beta")

Now we can pull out the top terms from this analysis, and plot them to see how they lined up.

top_terms <- tweet_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
arrange(topic, -beta)

top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") +
coord_flip()

Based on these plots, I’d say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ceteris Paribus v0.3 is on CRAN

Sat, 08/11/2018 - 17:49

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

Ceteris Paribus package is a part of DALEX family of model explainers. Version 0.3 just gets to CRAN. It’s equipped with new functions for very elastic visual exploration of black box models. Its grammar generalizes Partial Dependency Plots, Individual Conditional Expectations, Wangkardu Plots and gives a lot of flexibility in model comparisons, groups comparisons and so on.

See a 100 sec introduction to the ceterisPackage package on YouTube.

Here you will find a one-pager cheat-sheet with selected use cases.

Here is a longer introduction with notation and some theory.

Here there is a vignette with examples for regression (housing prices).

And here for multiclass classification (HR models).

It’s a work in progress. Feel free to contribute!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mid Table Mediocrity

Sat, 08/11/2018 - 01:00

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

Visualising potential winners in the Scottish Championship –

Last August, I was given a surprise gift of a season ticket to watch my local football team, Inverness Caledonian Thistle.

They’d been relegated from the top league in Scotland the season before, and needed all the local support they could get. In truth, the gift ( from my father) was a sneaky attempt to get my eldest son hooked into supporting his local team (as every adult season ticket allows free entry for an under-12).

Long story short, by the end of the season, it was me that was hooked. After an appalling start to the season, the team went on a couple of great runs of results. In the late Autum, from languishing in 9th position (out of 10 teams) they went on an unbeaten run which saw them break their record for the number of games without conceding a goal.

Then, towards the end of the season, they went on another unbeaten run, featuring a number of consecutive wins.
This saw supporters start to dream of finishing in 4th place, and with it, a chance of promotion back to the Premier League.

The winners automatically get promoted, with 4th playing 3rd, then the winners playing the 2nd team, and the winners of THAT game, playing the second bottom team from the premier league.

So the hope was that ICT would finish 4th, and somehow navigate their way through 3 more matches against teams who finished higher in the league or were in the league above.

The league standings, prior to the latest round of matches, were as follows:

St Mirren were runaway leaders, and heading for automatic promotion. Livingston looked to have cemented 2nd place. At the other end, Brechin were doomed to be relegated, with Dumbarton also looking at a playoff to remain in the league. This left 6 teams scrapping for 3rd and 4th. The teams were resonably evently matched – on any given day, any team could beat anyone else.

By the way, QOTS is my abbreviation for Queen of the South, a team based in Dumfries, in the Scottish Borders, and not to be confused with Queens of the Stone Age, who are a hard rock group from California.
Although I’m pretty sure QOTSA frontman Josh Homme could do a job up front for any team as he’s a big lad.

Anyway – I started to think about ways to plot potential placings on a game by game basis. I wasn’t concerned with thinking about correct scores or goal difference. It was easy enought to calculate where one team might be in the table based on 3 points for a win, 1 for a draw and none for a loss. But then there were all the other games to take into account too.

A base R command that I have used in plotting in the past is expand.grid().
It’s especially useful for plotting heatmaps to ensure that there is an observation for every combination of the x and y axis.
It’s also really useful for working out possible combinations of football results.

So with the above fixture list, and current league placings, we can start to figure out the potential outcomes.

first <- head(fixtures,1) permutations <- expand.grid(first$home,first$away,status,Results) permutations

That seemed to work, so let’s create a function:

get_combo <- function(x,y){ permutations <- expand.grid(x,y, status,Results) %>% rename(home = Var1, away = Var2,status = Var3, results = Var4) %>% mutate(home = as.character(home), away = as.character(away), results = as.character(results)) return(permutations) }

Now we can use purrr to apply the function to each fixure. This returns a list of dataframes, which can be combined with bind_rows, and then we can left join to another ancillary table to return the potential points gained for each combination of result:

permutations <- map2(fixtures$home,fixtures$away, get_combo) permutations <- bind_rows(permutations) %>% left_join(lookup, by = "results") permutations

We can then take these outcomes, and with our current points totals, start to work out possible points for each team for each combination.

And of course we can visualise those too:

That’s a start, but, a lot of these teams have similar colours and its hard to work out who they are.
Perhaps a dumbbell plot might help? We can use the ggalt package for that:

p <- ggplot(potential_summary,aes(y = reorder(team,xstart), x = xstart,xend = xend, group = team)) + geom_dumbbell(size = 1.25,aes(colour = team)) + scale_colour_manual(values = finalcolours) + theme_minimal() + ggtitle(label = "Potential Points - Scottish Championship", subtitle = "Range of possible points after games played 14th Apr 2018") + labs(y = "Team", x = "Possible Points") + theme(legend.position = "bottom")

Hmm, let’s zoom in a bit on our mid table teams. We can do that by using coord_cartesian().

And while we’re at it, let’s add some labels on the dumbbells, and remove some of the extraneous axis labels and tick marks:

##zoom in on mid table q <- p + coord_cartesian(xlim = c(35,56)) + scale_x_continuous(limits = c(35,56),breaks = c(35:56)) + ggtitle(label = "Mid Table Mediocrity in the Scottish Championship", subtitle = "Range of possible points for teams who are \n outside promotion or relegation places") + labs(caption = "games to be played 14th Apr 2018") q q <- q + geom_text(aes(xstart - 1,label = xstart),size = 3) q <- q + geom_text(aes(xend + 1,label = xend),size = 3) q <- q + ggrepel::geom_text_repel(aes(xstart + 2,label = team),size = 3) q # get rid of labels etc. that we don't need q <- q + coord_cartesian(ylim = c(3,8)) + theme(legend.position = "none") + theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()) + theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank()) + ggExtra::removeGrid() q

So now we can see that Inverness, Morton and Queen of the South are really closely bunched up, but with a bit to do to make into 4th. They certainly won’t mangage to get into those positions based on the next match.

We can split out the potential placings for each team, based on whether they win, lose or draw.

Or we can just show everything on one plot, and let the reader try and make sense of the resulting stramash:

Here’s the full gist if you want to mess around with it:

There are a few more combinations you could try, plus some more work with tidyr might allow you to create different (and better plots)..

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Redmonk Language Rankings, June 2018

Fri, 08/10/2018 - 20:16

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The latest Redmonk Language Rankings are out. These rankings are published twice a year, and the top three positions in the June 2018 rankings remain unchanged from last time: JavaScript at #1, Java at #2, and Python at #3. The Redmonk rankings are based on Github repositories (as a proxy for developer activity) and StackOverflow activity (as a proxy for user discussion), as shown in the chart below.

Compared to two quarters ago, R has dropped 2 places to #14. But as Stephen O'Grady notes, this ranking volatility is typical:

R is essentially a case study at this point for why we recommend against over-reacting to any particular quarter’s performance. R has dropped two spots previously, only to bounce back a quarter or two later. Its domain specific nature means that it is never likely to compete for the very top spots on this ranking, but it remains dominant and broadly used within its area of expertise.

You can see the complete rankings list along with discussion of the languages covered in at the Redmonk blog linked below.

Tecosystems: The RedMonk Programming Language Rankings: June 2018

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RQuantLib 0.4.5: Windows is back, and small updates

Fri, 08/10/2018 - 18:51

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A brand new release of RQuantLib, now at version 0.4.5, just arrived on CRAN, and will get to Debian shortly. This release re-enables Windows builds thanks to a PR by Jeroen who now supplies a QuantLib library build in his rwinlib repositories. (Sadly, though, it is already one QuantLib release behind, so it would be awesome if a volunteer could step forward to help Jeroen keeping this current.) A few other smaller fixes were made, see below for more.

The complete set of changes is listed below:

Changes in RQuantLib version 0.4.5 (2018-08-10)
  • Changes in RQuantLib code:

    • The old rquantlib.h header is deprecated and moved to a subdirectory. (Some OS confuse it with RQuantLib.h which Rcpp Attributes like to be the same name as the package.) (Dirk in #100 addressing #99).

    • The files in src/ now include rquantlib_internal.h directly.

    • Several ‘unused variable’ warnings have been taken care of.

    • The Windows build has been updated, and now uses an external QuantLib library from ‘rwinlib’ (Jeroen Ooms in #105).

    • Three curve-building example are no longer running by default as win32 has seen some numerical issues.

    • Two Rcpp::compileAttributes generated files have been updated.

Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the RQuantLib page. Questions, comments etc should go to the rquantlib-devel mailing list off the R-Forge page. Issue tickets can be filed at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Installation of R 3.5 on Ubuntu 18.04 LTS and tips for spatial packages

Fri, 08/10/2018 - 17:17

(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

Do you plan to upgrade your server installation from Ubuntu 16.04 to Ubuntu 18.04 LTS ? It is also the best time to migrate to R 3.5 ! By the way, if you always found difficult to install R packages for geographical data on Ubuntu, this time is over. Just follow the guide…

New Ubuntu version and new R version

If you use Ubuntu on your laptop or on your servers in your company, chances are high that you use a Long Term Support (LTS) version. Using a LTS version is safer for companies having stability constraints. This avoids possible breaking changes that would compromise your workflows and thus your business. Indeed, if you work with spatial data in R on Ubuntu, you probably also noticed that installation using recent versions of packages like gdal, geos, proj or netcdf was only possible if you are working on a LTS version of Ubuntu using UbuntuGIS PPA. This month, the first point release of Ubuntu 18.04.1 Bionic Beaver for server or desktop is available for download. If you are still using Ubuntu 16.04 you will now be proposed this upgrade.

During the last few months, you also heard about the release of R 3.5 as announced by the R Core team. This introduced lots of performance improvements thanks to the ALTREP. Now that the first maintenance release R 3.5.1 is out, you may also want to put it in production at work.

While your are playing with a series of sudo apt update commands, why not do both upgrades at the same time ?

Upgrade to Ubuntu 18.04

As always, be sure to have some backups of your files before starting migration ! If you are running Ubuntu 16.04 LTS, you may see the new release using the “update core manager”. Be sure it is installed :

sudo apt-get install update-manager-core

then run

sudo do-release-upgrade

This should work if you allowed only LTS versions to be installed. In my own case, maybe I was not patient enough, but the new version was not proposed. Then you can temporarily allow for “normal” release, which, here, will propose you the last version available 18.04. The one we want. Be careful with this change. If you are using your computer at work in production mode, set it back to “lts” as soon as you have the LTS installed.

Check /etc/update-manager/release-upgrades and change the line:

Prompt=lts

to:

Prompt=normal

Then you can run sudo do-release-upgrade

Follow instructions and you’re done for the Ubuntu upgrade.

Delete previous version of R

Remove additional repositories

To be able to use R 3.5, you will have to delete everything about the previous R version.

We first need to check if you used one of the R mirror server. Maybe you do, maybe you don’t… The R mirror server will be one of these: https://cran.r-project.org/mirrors.html. The name of this server can be listed in the file “/etc/apt/sources.list” that you can edit with:

In the terminal:

sudo vi /etc/apt/sources.list

Find the line(s) that looks like deb https://mirror.ibcp.fr/pub/CRAN/ xenial/ and comment it with a #.

If you did not find these lines, the server may be listed in an external file listed in “/etc/apt/sources.list.d”. Find the “name.of.file.with.a.mirror.name” (replace this path with the appropriate one) and delete the files.

In the terminal:

ls /etc/apt/sources.list.d sudo rm -i /etc/apt/sources.list.d/name.of.file.for.mirror.list

If you did not find it, you probably installed R using the Ubuntu default servers. Then, you do not have anything else to do.

Remove Ubuntu packages for R

Now you need to remove all Ubuntu packages for R and clean your installation.

In the terminal:

sudo apt purge r-base r-recommended r-cran-* sudo apt autoremove sudo apt update

R should be totally removed after that.

Install new version of R (3.5)

Set new Ubuntu repositories for R 3.5

Detailed installation instructions are on https://cran.r-project.org/bin/linux/ubuntu/. Thanks to Dirk Eddelbuettel and Michael Rutter, installing R on Ubuntu is a child’s play !
We will use the new mirror https://cloud.r-project.org, which will be automatically redirected to a nearby CRAN mirror.

In the terminal:

sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/' sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 sudo apt update Install R 3.5

Very few deb packages are necessary:

In the terminal:

sudo apt install r-base r-base-core r-recommended Install other packages binaries with Ubuntu

On a personal level, I have some R packages installed with an Ubuntu PPA (Personal Package Archives) because I trust maintainers and I am sure this R packages will work with my Ubuntu distribution. They require Linux dependencies that can be complicated to install. This is the case for mapping packages or those with java in them.
With R 3.5, Ubuntu packages required to be rebuilt differently than with previous version. Michael Rutter explains this on his dedicated blog. Hence, you’ll need to use his PPA before installing Ubuntu packages for R (starting with “r-cran-”).

In the terminal (my personal selection):

sudo add-apt-repository ppa:marutter/c2d4u3.5 sudo apt-get update sudo apt install r-cran-rgl r-cran-rjags r-cran-snow r-cran-ggplot2 r-cran-igraph r-cran-lme4 r-cran-rjava r-cran-devtools r-cran-roxygen2 r-cran-rjava r-cran-xlsx Install packages for spatial data analyses in Ubuntu

Installation of R packages like {sf}, {sp}, {rgdal}, {rgeos}, … may require some external Ubuntu packages. This has been made easy with the UbuntuGIS PPA. The maintainers team is small and it may takes some time before packages are released for your distribution. Only Ubuntu LTS distributions are available. Also, as of today, packages for Ubuntu 18.04 are not available: if you have the skills, you may help them!.

With Ubuntu 16.04, UbuntuGIS is necessary for the installation of the spatial R packages like {sf}. With Ubuntu 18.04, Ubuntu default versions of dependencies are sufficiently up-to-date, but you still need to install them.

Whatever is the Ubuntu LTS distribution you are using, you’d better install UbuntuGIS PPA to be able to work with R packages for geographical data analyses. By the way, you will also be able to install the latest version of QGIS with it.

In the terminal:

sudo add-apt-repository 'deb http://ppa.launchpad.net/ubuntugis/ubuntugis-unstable/ubuntu bionic main ' sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 314DF160 sudo apt update sudo apt install libgdal1-dev libproj-dev libgeos-dev libudunits2-dev libv8-dev libcairo2-dev libnetcdf-dev

Then you can install your R packages using Ubuntu if you want them to always be up-to-date and correctly built for your distribution.

In the terminal (my personal selection):

sudo apt install r-cran-gstat r-cran-maps r-cran-mapdata r-cran-ncdf4 r-cran-sp r-cran-sf r-cran-sp r-cran-raster r-cran-geor r-cran-ggmap r-cran-leaflet r-cran-rosm Update R packages

You’re done ! You can now run R and do not forget to re-install all other R packages to be built with this new version of R. This requires checkBuilt = TRUE to force re-install packages even if you have the latest version available.

In R console:

update.packages(ask = FALSE, checkBuilt = TRUE)

If you need help for your R server installation and deployment of packages and shiny application, you can contact the ThinkR team.

The post Installation of R 3.5 on Ubuntu 18.04 LTS and tips for spatial packages appeared first on (en) The R Task Force.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gender diversity in the TV series industry

Fri, 08/10/2018 - 12:33

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

In a previous post, I studied gender diversity in the film industry, I did this by focusing on some key behind-the-camera roles and measuring the evolution of the gender diversity in the last decade. The conclusion was not great: women are under-represented, especially in the most important roles of directors and writers, as these key roles determine the way women are portrayed in front of the camera.

I was curious about the TV series industry too: as the TV series industry is faster paced than the movie industry, might they might be more open to women? I decided to have a look.

In this post, as in the film industry post, the behind-the-camera roles I studied were: directorswritersproducerssound teamsmusic teamsart teamsmakeup teams and costume teams.

The whole code to reproduce the following results is available on GitHub.

Data Frame Creation – Web Scraping

All the data I used was gathered from the IMDb website: I went through the 100 Most Popular TV Shows (according to the IMDb ratings), and gathered some useful information about these 100 series: I built a data frame which contains the titles of these series, their years of release and their IMDb episode links – the link where we can find all the episodes of a series.

# IMDb 100 most popular TV shows ------------------------------ url <- "https://www.imdb.com/chart/tvmeter?sort=us,desc&mode=simple&page=1" page <- read_html(url) serie_nodes <- html_nodes(page, '.titleColumn') %>% as_list() # Series details serie_name <- c() serie_link <- c() serie_year <- c() for (i in seq_along(serie_nodes)){ serie_name <- c(serie_name, serie_nodes[[i]]$a[[1]]) serie_link <- c(serie_link, attr(serie_nodes[[i]]$a, "href")) serie_year <- c(serie_year, serie_nodes[[i]]$span[[1]]) } serie_link <- paste0("http://www.imdb.com",serie_link) serie_year <- gsub("[()]", "", serie_year) serie_episodelist <- sapply(strsplit(serie_link, split='?', fixed=TRUE), function(x) (x[1])) %>% paste0("episodes?ref_=tt_eps_yr_mr") # Create dataframe ---------------------------------------------- top_series <- data.frame(serie_name, serie_year, serie_episodelist, stringsAsFactors = FALSE) # series_year was the date of 1st release but we needed the years of release for all the episodes # I did not manage to gather this information by doing some web scraping. # I added it manually as it is available on the IMDb episodes links (column serie_episodelist) top_series[20:30, ] ## serie_name serie_year ## 20 Legion 2017 ## 21 A Series of Unfortunate Events 2017, 2018 ## 22 Timeless 2016, 2017, 2018 ## 23 Westworld 2016, 2018 ## 24 Luke Cage 2016 ## 25 MacGyver 2016, 2017, 2018 ## 26 Lethal Weapon 2016, 2017, 2018 ## 27 Designated Survivor 2016, 2017, 2018 ## 28 Bull 2016, 2017, 2018 ## 29 This Is Us 2016, 2017, 2018 ## 30 Atlanta 2016, 2018 ## serie_episodelist ## 20 http://www.imdb.com/title/tt5114356/episodes?ref_=tt_eps_yr_mr ## 21 http://www.imdb.com/title/tt4834206/episodes?ref_=tt_eps_yr_mr ## 22 http://www.imdb.com/title/tt5511582/episodes?ref_=tt_eps_yr_mr ## 23 http://www.imdb.com/title/tt0475784/episodes?ref_=tt_eps_yr_mr ## 24 http://www.imdb.com/title/tt3322314/episodes?ref_=tt_eps_yr_mr ## 25 http://www.imdb.com/title/tt1399045/episodes?ref_=tt_eps_yr_mr ## 26 http://www.imdb.com/title/tt5164196/episodes?ref_=tt_eps_yr_mr ## 27 http://www.imdb.com/title/tt5296406/episodes?ref_=tt_eps_yr_mr ## 28 http://www.imdb.com/title/tt5827228/episodes?ref_=tt_eps_yr_mr ## 29 http://www.imdb.com/title/tt5555260/episodes?ref_=tt_eps_yr_mr ## 30 http://www.imdb.com/title/tt4288182/episodes?ref_=tt_eps_yr_mr

The series_year column often contains several years. For example, for the series called “This is us”, it means that episodes have been released in 2016, 2017 and 2018. This column will allow me to split the episodes by year of release, and then visualise the gender diversity of the crew for each year.

List Creation – Web Scraping

At this stage, I just had some global information on the 100 series. The next step was to go through the IMDb links gathered in the column series_episodelist of my top_series data frame, which gives me access to all the series episodes split by year of release. I did some web scraping on these links and built a list which gathered:

  • the names of the 100 most popular TV shows
  • for each series, the different years of release
  • for each year, the names of the episodes which have been released
  • for each episode, the names of the people whose job was included in one of the categories I listed above (directors, writers, …, costume teams)
### Create series list series_list <- list() # FOCUS ON EACH SERIES ----------------------------------------------------------------- for (r in seq_len(nrow(top_series))) { serie_name <- top_series[r, "serie_name"] print(serie_name) # Years of release for each serie list_serieyear <- as.list(strsplit(top_series[r, "serie_year"], split = ", ")[[1]]) # List of IMDb links where we find all the episodes per year of release link_episodelist_peryear <- list() episodes_list_peryear <- list() # FOCUS ON EACH YEAR OF REALEASE FOR THIS SERIE ------------------------------------- for (u in seq_along(list_serieyear)){ year <- list_serieyear[[u]] print(year) link_episodelist_yeari <- strsplit(top_series[r, "serie_episodelist"], split='?', fixed=TRUE)[[1]][1] %>% paste0("?year=", year, collapse = "") link_episodelist_peryear[[u]] <- link_episodelist_yeari # FOCUS ON EACH EPISODE FOR THIS YEAR OF RELEASE ---------------------------------- for (l in seq_along(link_episodelist_peryear)){ page <- read_html(link_episodelist_peryear[[l]]) episodes_nodes <- html_nodes(page, '.info') %>% as_list() episode_name <- c() episode_link <- c() for (t in seq_along(episodes_nodes)){ episode_name <- c(episode_name, episodes_nodes[[t]]$strong$a[[1]]) episode_link <- c(episode_link, attr(episodes_nodes[[t]]$strong$a, "href")) } episode_link <- paste0("http://www.imdb.com",episode_link) episode_link <- sapply(strsplit(episode_link, split='?', fixed=TRUE), function(x) (x[1])) %>% paste0("fullcredits?ref_=tt_ql_1") episode_name <- sapply(episode_name, function(x) (gsub(pattern = "\\#", replacement = "", x))) %>% # some names = "Episode #1.1" as.character() # GATHER THE NAME OF THE EPISODE, ITS YEAR OF RELEASE AND ITS FULL CREW LINK ---- episodes_details_peryear <- data.frame(year = year, episode_name = episode_name, episode_link = episode_link, stringsAsFactors = FALSE) } # FOCUS ON EACH FULL CREW LINK ---------------------------------------------------- for (e in seq_len(nrow(episodes_details_peryear))){ print(episodes_details_peryear[e, "episode_link"]) episode_page <- read_html(episodes_details_peryear[e, "episode_link"]) episode_name <- episodes_details_peryear[e, "episode_name"] # GATHER ALL THE CREW NAMES FOR THIS EPISODE ------------------------------------- episode_allcrew <- html_nodes(episode_page, '.name , .dataHeaderWithBorder') %>% html_text() episode_allcrew <- gsub("[\n]", "", episode_allcrew) %>% trimws() #Remove white spaces # SPLIT ALL THE CREW NAMES BY CATEGORY ------------------------------------------- episode_categories <- html_nodes(episode_page, '.dataHeaderWithBorder') %>% html_text() episode_categories <- gsub("[\n]", "", episode_categories) %>% trimws() #Remove white spaces ## MUSIC DEPT ----------------------------------------------------------------------- episode_music <- c() for (i in 1:(length(episode_allcrew)-1)){ if (grepl("Music by", episode_allcrew[i])){ j <- 1 while (! grepl(episode_allcrew[i], episode_categories[j])){ j <- j+1 } k <- i+1 while (! grepl(episode_categories[j+1], episode_allcrew[k])){ episode_music <- c(episode_music, episode_allcrew[k]) k <- k+1 } } } for (i in 1:(length(episode_allcrew)-1)){ if (grepl("Music Department", episode_allcrew[i])){ # Sometimes music dept is last category if (grepl ("Music Department", episode_categories[length(episode_categories)])){ first <- i+1 for (p in first:length(episode_allcrew)) { episode_music <- c(episode_music, episode_allcrew[p]) } } else { j <- 1 while (! grepl(episode_allcrew[i], episode_categories[j])){ j <- j+1 } k <- i+1 while (! grepl(episode_categories[j+1], episode_allcrew[k])){ episode_music <- c(episode_music, episode_allcrew[k]) k <- k+1 } } } } if (length(episode_music) == 0){ episode_music <- c("") } ## IDEM FOR OTHER CATEGORIES ---------------------------------------------------------- ## EPISODE_INFO CONTAINS THE EPISODE CREW NAMES ORDERED BY CATEGORY ------------------- episode_info <- list() episode_info$directors <- episode_directors episode_info$writers <- episode_writers episode_info$producers <- episode_producers episode_info$sound <- episode_sound episode_info$music <- episode_music episode_info$art <- episode_art episode_info$makeup <- episode_makeup episode_info$costume <- episode_costume ## EPISODES_LIST_PER_YEAR GATHERS THE INFORMATION FOR EVERY EPISODE OF THE SERIE------- ## SPLIT BY YEAR OF RELEASE -------------------------------------------------------- episodes_list_peryear[[year]][[episode_name]] <- episode_info } ## SERIES_LIST GATHERS THE INFORMATION FOR EVERY YEAR AND EVERY SERIE ------------------- series_list[[serie_name]] <- episodes_list_peryear } }

Let’s have a look at the information gathered in series_list. Here are some of the names I collected:

## - Black Mirror, 2011 ## Episode: The National Anthem ## Director: Otto Bathurst ## - Black Mirror, 2017 ## Episode: Black Museum ## Director: Colm McCarthy ## - Game of Thrones, 2011 ## Episode: Winter Is Coming ## Music team: Ramin Djawadi, Evyen Klean, David Klotz, Robin Whittaker, Michael K. Bauer, Brandon Campbell, Stephen Coleman, Janet Lopez, Julie Pearce, Joe Rubel, Bobby Tahouri ## - Game of Thrones, 2017 ## Episode: Dragonstone ## Music team: Ramin Djawadi, Omer Benyamin, Evyen Klean, David Klotz, William Marriott, Douglas Parker, Stephen Coleman

What we can see is that for the same series the crew changes depending on the episode we consider.

Gender Determination

Now that I had all the names gathered in the series_list, I needed to determine the genders. I used the same package as in my previous post on the film industry: GenderizeR, which “uses genderize.io API to predict gender from first names”. More details on this package and the reasons why I decided to use it are available in my previous post.

With this R package, I was able to determine for each episode the number of males and females in each category of jobs:

  • the number of male directors,
  • the number of female directors,
  • the number of male producers,
  • the number of female producers,
  • the number of males in costume team,
  • the number of females in costume team.

Here is the code I wrote:

### Genderize our lists of names # for each serie for (s in seq_along(series_list) ){ print(names(series_list[s])) # print serie name # for each year for (y in seq_along(series_list[[s]])){ print(names(series_list[[s]][y])) # print serie year # for each episode for (i in seq_along(series_list[[s]][[y]])){ print(names(series_list[[s]][[y]][i])) # print serie episode # Genderize directors ----------------------------------------------------- directors <- series_list[[s]][[y]][[i]]$directors if (directors == ""){ directors_gender <- list() directors_gender$male <- 0 directors_gender$female <- 0 series_list[[s]][[y]][[i]]$directors_gender <- directors_gender } else{ # Split the firstnames and the lastnames # Keep the firstnames directors <- strsplit(directors, " ") l <- c() for (j in seq_along(directors)){ l <- c(l, directors[[j]][1]) } directors <- l serie_directors_male <- 0 serie_directors_female <- 0 # Genderize every firstname and count the number of males and females for (p in seq_along(directors)){ directors_gender <- genderizeAPI(x = directors[p], apikey = "233b284134ae754d9fc56717fec4164e") gender <- directors_gender$response$gender if (length(gender)>0 && gender == "male"){ serie_directors_male <- serie_directors_male + 1 } if (length(gender)>0 && gender == "female"){ serie_directors_female <- serie_directors_female + 1 } } # Put the number of males and females in series_list directors_gender <- list() directors_gender$male <- serie_directors_male directors_gender$female <- serie_directors_female series_list[[s]][[y]][[i]]$directors_gender <- directors_gender } # Same code for the 7 other categories ----------------------------------- } } } }

Here are some examples of numbers of male and female I collected:

## Black Mirror, 2011 ## Episode: The National Anthem ## Number of male directors: 1 ## Number of female directors: 0 ## ## Black Mirror, 2017 ## Episode: Black Museum ## Number of male directors: 1 ## Number of female directors: 0 ## ## Game of Thrones, 2011 ## Episode: Winter Is Coming ## Number of male in music team: 8 ## Number of female in music team: 3 ## ## Game of Thrones, 2017 ## Episode: Dragonstone ## Number of male in music team: 7 ## Number of female in music team: 0 ## Percentages Calculation

With these numbers gathered in my list, I then calculated the percentages of women in each job category, for each year between 2007 and 2018. I gathered these figures in a data frame called percentages:

## year directors writers producers sound music art makeup ## 1 2018 22.69693 25.06514 27.87217 12.247212 23.25581 36.93275 73.10795 ## 2 2017 20.51948 28.20016 27.28932 10.864631 25.46912 29.90641 71.41831 ## 3 2016 17.13456 24.51189 27.93240 11.553444 25.03117 30.98003 71.74965 ## 4 2015 16.14764 19.42845 26.43828 11.214310 22.16505 29.83354 69.50787 ## 5 2014 18.38624 20.88644 27.59163 10.406150 22.21016 30.11341 69.97544 ## 6 2013 14.94413 19.60432 28.15726 10.504896 23.29693 29.01968 69.01683 ## 7 2012 15.60694 19.82235 29.66566 10.685681 21.45378 26.74160 67.47677 ## 8 2011 13.95349 17.60722 26.73747 11.296882 17.11185 25.61805 64.81795 ## 9 2010 15.95745 17.05882 27.38841 11.264644 16.51376 24.14815 65.33004 ## 10 2009 16.49123 18.90496 28.79557 8.498350 21.72285 26.11128 68.15961 ## 11 2008 17.87440 16.62088 29.05844 7.594264 18.74405 23.46251 68.39827 ## 12 2007 21.15385 21.78771 30.12798 9.090909 19.23077 21.66124 63.03502 ## costume ## 1 77.24853 ## 2 81.34648 ## 3 79.35358 ## 4 76.48649 ## 5 76.62972 ## 6 74.74791 ## 7 77.35247 ## 8 77.46315 ## 9 77.67380 ## 10 79.56332 ## 11 80.53191 ## 12 79.24720 Gender Diversity in 2017: TV Series Industry VS Film Industry

Based on this data frame, I created some bar plots to visualise the gender diversity of each job category for each year. Here is the code I wrote to create the bar plot for 2017, which compares the TV series industry to the film industry.

### Barplot 2017 # Data manipulation ------------------------------------------------------------- # Import our movies dataset percentages_movies <- read.csv("percentages_movies.csv") percentages_movies <- percentages_movies[ , -1] # Change column names for movie and serie dataframes colnames(percentages_movies) <- c("year", "directors", "writers", "producers", "sound", "music", "art", "makeup", "costume") colnames(percentages) <- c("year", "directors", "writers", "producers", "sound", "music", "art", "makeup", "costume") # From wide to long dataframes percentages_movies_long <- percentages_movies %>% gather(key = category, value = percentage, -year) percentages_long <- percentages %>% gather(key = category, value = percentage, -year) # Add a column to these dataframes: movie or film ? percentages_movies_long$industry <- rep("Film industry", 88) percentages_long$industry <- rep("Series industry", 96) # Combine these 2 long dataframes percentages_movies_series <- bind_rows(percentages_long, percentages_movies_long) # Filter with year=2017 percentages_movies_series_2017 <- percentages_movies_series %>% filter(year == 2017) # Data visualisation ------------------------------------------------------------- percentages_movies_series_2017$percentage <- as.numeric(format(percentages_movies_series_2017$percentage, digits = 2)) bar_2017 <- ggplot(percentages_movies_series_2017, aes(x = category, y = percentage, group = category, fill = category)) + geom_bar(stat = "identity") + facet_wrap(~industry) + coord_flip() + # Horizontal bar plot geom_text(aes(label = percentage), hjust=-0.1, size=3) + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(), plot.title = element_text(hjust = 0.5), # center the title legend.title=element_blank()) + labs(title = paste("Percentages of women in 2017"), x = "", y = "Percentages") + guides(fill = guide_legend(reverse=TRUE)) + # reverse the order of the legend scale_fill_manual(values = brewer.pal(8, "Spectral")) # palette used to fill the bars and legend boxs

I have built a simple shiny app which gives access to the bar plots for each year between 2007 and 2017.

Let’s analyse the graph of the year 2017. If we only focus on the TV series figures, we see that sound teams show the lowest female occupation, with less than 11%. It is followed by the role of director with 20.5%. Then, we can see that between 25% and 30% of the roles of writers, producers, music teams and art teams are taken by women. Thus, women are still under-represented in the TV series industry. However, even if series figures show little gender diversity in the above job categories, they are better than the film industry ones, especially for the key roles of directors, writors and producers, which are respectively 5.7, 3 and 1.2 times higher for the series industry than for the film industry. The last thing to notice is that as in the film industry, the series industry graph shows a representativeness gap between the above roles and the jobs of make-up artists and costume designers, among which more than 70% of the roles are taken by women.

Evolution of the Gender Diversity: TV Series Industry VS Film Industry

Let’s have a look at the evolution of the gender diversity in these two industries in the last decade.

### Evolution plot # year as date percentages_movies_series_ymd <- percentages_movies_series %>% subset(year != 2018) percentages_movies_series_ymd$year <- ymd(percentages_movies_series_ymd$year, truncated = 2L) # Data visualisation evolution <- ggplot(percentages_movies_series_ymd, aes(x = year, y = percentage, group = category, colour = category)) + geom_line(size = 2) + facet_wrap(~industry) + theme(panel.grid.minor.x = element_blank(), plot.title = element_text(hjust = 0.5)) + # center the title scale_x_date(date_breaks = "2 year", date_labels = "%Y") + scale_color_manual(values = brewer.pal(8, "Set1")) + labs(title = "Percentages of women from 2007 to 2017\n Film industry VS serie industry", x = "", y = "Percentages")

The first thing I noticed is that for both the film and series industries, the representation gap between the roles of make-up artists and costume designers and the other ones had not decreased since 2007.

The fact that the roles of directors, writers and producers are more open to women in the TV series industry than in the film one is easy to visualise with this graph, and we can see that it has been the case at least since 2007 (and probably before). Besides, since 2007 the series industry has been more diversified in terms of gender for all the categories I studied, except for the sound roles.

I also noticed that since 2010/2011, in the TV series industry, almost all the categories tend to be more diversified in terms of gender. The only exceptions are the roles of producers (percentages are generally decreasing slightly since 2007), sound teams (no improvement has been achieved since 2010) and costume teams (the trend has been positive only since 2013). Apart from that, there is a positive trend for the TV series industry, which is not the case for the film industry.

This trend is significant for some roles: writers, music teams, art teams and make-up teams percentages in the series industry have increased by 5 to 10% in the last decade. But if we look at the role of directors, the percentage of women has also increased by 5% since 2011, but the percentage reached in 2017 is essentially the same as the one reached in 2007, just as for the film industry. Let’s hope that the trend seen since 2011 for directors will continue.

Conclusion

This study has definitely shown that the TV series industry is more diversified in terms of gender than the film industry, especially for the key roles of directors and writers.

However even if the series percentages are better than the film ones, women are still under-represented in the TV series industry as the same regrettable analysis has been echoed: the only jobs which seem open to women are the stereotyped female jobs of make-up artists and costume designers. In all the other categories, the percentages of women in the series industry never reach more than 30%.

But contrary to the film industry, the TV series one is actually evolving in the right direction: since 2011, a positive trend has been happening for directors and writers. This evolution is encouraging for the future and suggests that powerful female characters, such as Daenerys Targaryen from Game of Thrones, are coming on TV screens.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to (quickly) enrich a map with natural and anthropic details

Fri, 08/10/2018 - 04:13

(This article was first published on R – : Francesco Bailo :, and kindly contributed to R-bloggers)


In this post I show how to enrich a ggplot map with data obtained from the Open Street Map (OSM) API. After adding elevation details to the map, I add water bodies and elements identifying human activity. To highlight the areas more densely inhabitated, I propose to use a density-based clustering algorithm of OSM features.

Data

Let’s first start with some data: a few points somewhere in Italy.

my_points.df <- data.frame(lon = c(9.442366, 9.30222, 9.304297, 9.322077, 9.304432, 9.302683, 9.321543, 9.301541, 9.402329, 9.285115, 9.319564, 9.33959, 9.33367, 9.261872, 9.308274, 9.3011, 9.45355, 9.32527, 9.25892, 9.303647), lat = c(45.952427, 46.041, 45.958035, 45.922367, 45.953438, 45.966248, 45.919487, 45.959721, 46.05933, 46.009094, 45.917286, 46.081585, 45.997231, 45.98764, 46.076529, 45.959641, 45.972649, 45.9151, 46.055069, 46.07671))

I will use the great sf package so let’s create a sf spatial object with

library(sf) my_points.sf <- st_as_sf(my_points.df, coords = c("lon", "lat"), crs = 4326)

crs = 4326 sets the Coordinate Reference System using an integer EPSG code. EPSG:4326 uses the World Geodetic System 1984 (WGS 84) that is used in GPS and its unit is the degree (not the meter!).

In this post I show how to enrich a ggplot map with data obtained from the Open Street Map (OSM) API. After adding elevation details to the map, I add water bodies and elements identifying human activity. To highlight the areas more densely inhabitated, I propose to use a density-based clustering algorithm of OSM features.

As a first step let’s define a bounding box that includes my points.

my_bbox <- c(xmin = min(my_points.df$lon), xmax = max(my_points.df$lon), ymin = min(my_points.df$lat), ymax = max(my_points.df$lat))

Once I have set the points, I can create a polygon with the function sf::st_polygon(), which takes a two-column matrix for each polygon and create a simple feature (sf) object, and the function sf::st_geometry(), which takes a simple future object and get its geometry. Points need to be five because we always need to close our polygons! So the last (or 5th) row will have the same pair of coordinates of the first row. st_crs will set the CRS for the geometry my_bbox.sf.

my_bbox.m <- matrix(c(my_bbox['xmin'], my_bbox['xmin'], my_bbox['xmax'], my_bbox['xmax'], my_bbox['xmin'], my_bbox['ymax'], my_bbox['ymin'], my_bbox['ymin'], my_bbox['ymax'], my_bbox['ymax']), ncol = 2) my_bbox.sf <- st_geometry(st_polygon(x = list(my_bbox.m))) st_crs(my_bbox.sf) <- 4326

To give some margin to my points, I create an additional buffered bbox polygon around my_bbox.sf. I want to specify the buffer distance in meters, so I need to transform my polygon to a meter-based coordinate system such as EPSG:32632 and then back to EPSG:4326. EPSG:32632 uses the same geodetic CRS of EPSG:4326 (WGS 84) but crucially its unit is the meter (not the degree!) so all calculations are meter-based. EPSG:32632 is the 32N zone of the Universal Transverse Mercator (UTM) coordinate system. The world grid that can help you find the best UTM zone for your project is here.

my_bbox_buff_2500.sf <- my_bbox.sf %>% st_transform(crs = 32632) %>% st_buffer(dist = 2500) %>% # 2.5 kilometers st_transform(crs = 4326) my_bbox_buff_5000.sf <- my_bbox.sf %>% st_transform(crs = 32632) %>% st_buffer(dist = 5000) %>% # 5 kilometers st_transform(crs = 4326) my_bbox_buff_25000.sf <- my_bbox.sf %>% st_transform(crs = 32632) %>% st_buffer(dist = 25000) %>% # 25 kilometers st_transform(crs = 4326)

This is how my point data look like in its essence.

library(ggplot2) my_world_map <- map_data('world') my_world_map <- my_world_map[my_world_map$region %in% c("Italy","Switzerland"),] ggplot() + geom_sf(data = my_points.sf) + geom_sf(data = my_bbox_buff_2500.sf, fill = NA) + coord_sf(xlim = c(st_bbox(my_bbox_buff_25000.sf)['xmin'], st_bbox(my_bbox_buff_25000.sf)['xmax']), ylim = c(st_bbox(my_bbox_buff_25000.sf)['ymin'], st_bbox(my_bbox_buff_25000.sf)['ymax'])) + geom_polygon(data = my_world_map, aes(x=long, y = lat, group = group, fill = region), colour = 'black', alpha = .4) + theme_bw()

To give some geographic context to the data I plotted also the country polygons of Italy and Switzerland from the ggplot2::map_data('world'). coord_sf allows to set the limits of the plot which are extracted from the bounding box of my_bbox_buff_25000.sf with the function sf::st_bbox().

Add elevation details

Depending on the specific geography of the place you need to map, it might be helpful to provide an idea of the elevation. Elevation differences (that is, hills, mountains and valleys) are crucial constraints in the development of human and animal activities. To map the elevation around our points, I need to download data of a Digital Elevation Model (DEM). There are different options, I will use here data obtained by the Shuttle Radar Topography Mission (SRTM). The package raster ships with the function (raster::getData()) to download the data directly from the R console. The arguments lat and lon are used to get the data file that is needed for the project.

library(raster) dem.raster <- getData("SRTM", lat = 46.0146, lon = 9.344197, download = TRUE)

The SRTM data file is actually pretty large so it probably a good idea to crop it to reduce plotting time. The function raster::crop() takes an sp object (form the sp package) to be used as mask to determine the extent of the resulting (cropped) raster. I need to convert my bbox first with as(my_bbox_buff_25000.sf, 'Spatial').

dem.raster <- crop(dem.raster, as(my_bbox_buff_25000.sf, 'Spatial'), snap='out')

Because I need to add my raster as a ggplot layer, I need to transform it first into a matrix of points with raster::rasterToPoints() and finally into a data.frame object. The third column of the dem.df contains the altitude value (alt).

dem.m <- rasterToPoints(dem.raster) dem.df <- data.frame(dem.m) colnames(dem.df) = c("lon", "lat", "alt")

I can already plot dem.df with

ggplot() + geom_raster(data = dem.df, aes(lon, lat, fill = alt), alpha = .45) + scale_fill_gradientn(colours = terrain.colors(100)) + geom_sf(data = my_bbox_buff_2500.sf, fill = NA) + coord_sf(xlim = c(st_bbox(my_bbox_buff_25000.sf)['xmin'], st_bbox(my_bbox_buff_25000.sf)['xmax']), ylim = c(st_bbox(my_bbox_buff_25000.sf)['ymin'], st_bbox(my_bbox_buff_25000.sf)['ymax'])) + geom_polygon(data = my_world_map, aes(x=long, y = lat, group = group), fill = NA, colour = 'black') + theme_bw()

Now, instead of colouring each cell of the raster based on its altitude, I want to convey the same information with the hill shade. First, I extract the terrain characteristics (the slope and the aspect in this case) with raster::terrain(). Then, I compute the hill shade with raster::hillShade(), setting the elevation angle (of the sun) to 40 and the direction angle of the light to 270. I then transform the resulting hill.raster into a data.frame as before. (Thanks to tim riffe for this)

slope.raster <- terrain(dem.raster, opt='slope') aspect.raster <- terrain(dem.raster, opt='aspect') hill.raster <- hillShade(slope.raster, aspect.raster, 40, 270) hill.m <- rasterToPoints(hill.raster) hill.df <- data.frame(hill.m) colnames(hill.df) <- c("lon", "lat", "hill")

And here we go

ggplot() + geom_raster(data = hill.df, aes(lon, lat, fill = hill), alpha = .45) + scale_fill_gradientn(colours = grey.colors(100)) + geom_sf(data = my_bbox_buff_2500.sf, fill = NA) + coord_sf(xlim = c(st_bbox(my_bbox_buff_25000.sf)['xmin'], st_bbox(my_bbox_buff_25000.sf)['xmax']), ylim = c(st_bbox(my_bbox_buff_25000.sf)['ymin'], st_bbox(my_bbox_buff_25000.sf)['ymax'])) + geom_polygon(data = my_world_map, aes(x=long, y = lat, group = group), fill = NA, colour = 'black') + theme_bw()

Add bodies of water

As you might have guessed from the presence of a few flat surfaces, there are lakes in the region. We can use the excellent osmdata package to query the Open Street Map (OSM) Overpass API to download information about lakes and rivers. With the function osmdata::opq() we define the bbox for the query and with osmdata::add_osm_feature() we extract specific OSM features using the tags (key and value).

library(osmdata) osm_lakes.sf <- opq(bbox = st_bbox(my_bbox_buff_25000.sf)) %>% add_osm_feature(key = 'water', value = 'lake') %>% osmdata_sf() osm_lakes.sf <- osm_lakes.sf$osm_multipolygons osm_rivers.sf <- opq(bbox = st_bbox(my_bbox_buff_25000.sf)) %>% add_osm_feature(key = 'waterway', value = 'river') %>% osmdata_sf() osm_rivers.sf <- osm_rivers.sf$osm_lines ggplot() + geom_raster(data = hill.df, aes(lon, lat, fill = hill), alpha = .45) + scale_fill_gradientn(colours = grey.colors(100)) + geom_sf(data = osm_lakes.sf, fill = '#9ecae1', colour = NA) + geom_sf(data = osm_rivers.sf, colour = '#9ecae1', size = 0.05) + geom_sf(data = my_bbox_buff_2500.sf, fill = NA) + coord_sf(xlim = c(st_bbox(my_bbox_buff_25000.sf)['xmin'], st_bbox(my_bbox_buff_25000.sf)['xmax']), ylim = c(st_bbox(my_bbox_buff_25000.sf)['ymin'], st_bbox(my_bbox_buff_25000.sf)['ymax'])) + geom_polygon(data = my_world_map, aes(x=long, y = lat, group = group), fill = NA, colour = 'black') + theme_bw()

Add roads

Similarly we can download details about roads. The road network on OSM might be very dense, so it is better to do some filtering using the OSM tags.

osm_roads_primary.sf <- opq(bbox = st_bbox(my_bbox_buff_5000.sf)) %>% add_osm_feature(key = 'highway', value = 'trunk') %>% osmdata_sf() osm_roads_primary.sf <- osm_roads_primary.sf$osm_lines osm_roads_secondary.sf <- opq(bbox = st_bbox(my_bbox_buff_5000.sf)) %>% add_osm_feature(key = 'highway', value = 'secondary') %>% osmdata_sf() osm_roads_secondary.sf <- osm_roads_secondary.sf$osm_lines osm_roads_tertiary.sf <- opq(bbox = st_bbox(my_bbox_buff_5000.sf)) %>% add_osm_feature(key = 'highway', value = 'tertiary') %>% osmdata_sf() osm_roads_tertiary.sf <- osm_roads_tertiary.sf$osm_lines ggplot() + geom_raster(data = hill.df, aes(lon, lat, fill = hill), alpha = .45) + scale_fill_gradientn(colours = grey.colors(100)) + geom_sf(data = osm_lakes.sf, fill = '#9ecae1', colour = NA) + geom_sf(data = osm_rivers.sf, colour = '#9ecae1', size = 0.05) + geom_sf(data = osm_roads_primary.sf, colour = '#636363', size = 0.1) + geom_sf(data = osm_roads_secondary.sf, colour = '#636363', size = 0.05) + geom_sf(data = osm_roads_tertiary.sf, colour = '#636363', size = 0.02) + geom_sf(data = my_bbox_buff_2500.sf, fill = NA) + coord_sf(xlim = c(st_bbox(my_bbox_buff_5000.sf)['xmin'], st_bbox(my_bbox_buff_5000.sf)['xmax']), ylim = c(st_bbox(my_bbox_buff_5000.sf)['ymin'], st_bbox(my_bbox_buff_5000.sf)['ymax'])) + geom_polygon(data = my_world_map, aes(x=long, y = lat, group = group), fill = NA, colour = 'black') + theme_bw()

Identify densely populated areas

Densely populated areas are certainly very important in every map that wants to describe human activities. Again, I use OSM data to identify these areas. First, I query for features that are usually associated with human residency: residential roads and buildings. On OSM roads and buildings are not mapped with the same completeness everywhere but the density of their presence might provide a good enough approximation of where people actually live. The number of such features (polygons and lines) is usually very large. But actually I don’t need individual features, I just need the geography of densely populated areas.

First, I query the OSM API for all the buildings (osm_polygons), residential roads and unclassified roads (osm_lines). Second, I transform polygons and lines into points with sf::st_centroid() (which requires a CRS transformation). Finally, I store all the point simple feature objects into osm_residential_areas_pnt.df, with two columns respectively for longitude and latitude, and create from it the spatial object osm_residential_areas_pnt.sf.

osm_buildings.sf <- opq(bbox = st_bbox(my_bbox_buff_2500.sf)) %>% add_osm_feature(key = 'building') %>% osmdata_sf() osm_buildings.sf <- osm_buildings.sf$osm_polygons osm_buildings_pnt.sf <- osm_buildings.sf %>% st_transform(crs = 32632) %>% st_centroid() %>% st_transform(crs = 4326) osm_roads_residential.sf <- opq(bbox = st_bbox(my_bbox_buff_5000.sf)) %>% add_osm_feature(key = 'highway', value = 'residential') %>% osmdata_sf() osm_roads_residential.sf <- osm_roads_residential.sf$osm_lines osm_roads_residential_pnt.sf <- osm_roads_residential.sf %>% st_transform(crs = 32632) %>% st_centroid() %>% st_transform(crs = 4326) osm_roads_unclassified.sf <- opq(bbox = st_bbox(my_bbox_buff_5000.sf)) %>% add_osm_feature(key = 'highway', value = 'unclassified') %>% osmdata_sf() osm_roads_unclassified.sf <- osm_roads_unclassified.sf$osm_lines osm_roads_unclassified_pnt.sf <- osm_roads_unclassified.sf %>% st_transform(crs = 32632) %>% st_centroid() %>% st_transform(crs = 4326) osm_residential_areas_pnt.df <- rbind( do.call(rbind, st_geometry(osm_buildings_pnt.sf)), do.call(rbind, st_geometry(osm_roads_residential_pnt.sf)), do.call(rbind, st_geometry(osm_roads_unclassified_pnt.sf))) %>% as.data.frame() colnames(osm_residential_areas_pnt.df) <- c('lon', 'lat') osm_residential_areas_pnt.sf <- st_as_sf(osm_residential_areas_pnt.df, coords = c("lon", "lat"), crs = 4326) osm_residential_areas_pnt.sf <- st_transform(osm_residential_areas_pnt.sf, 32632)

Once I have the coordinate data.frame, with one row for each OSM feature, I pass it to the function dbscan::dbscan() of the dbscan package, which implements a Density-based spatial clustering algorithm to identify areas with the highest density of points . eps and minPts, respectively the size of the neighbourhood and the number of minimum points for each region, should be adjusted accordingly to the local geography. Once the number the of clusters are identified, I can create a polygon with the convex hull of the clustered points with the function grDevices::chull(). Each created polygon is stored the spatial object pop_dense_areas.sf.

library(dbscan) res <- dbscan(osm_residential_areas_pnt.df, eps = 0.0005, minPts = 10) pop_dense_areas.sf <- st_sf(id = 1:max(res$cluster), geometry = st_sfc(lapply(1:max(res$cluster), function(x) st_geometrycollection()))) st_crs(pop_dense_areas.sf) <- 4326 pop_dense_areas.sf <- st_transform(pop_dense_areas.sf, 32632) for (cl in 1:max(res$cluster)) { these_points <- osm_residential_areas_pnt.df[which(res$cluster == cl),] this_chull <- chull(these_points) this_mat <- as.matrix(these_points[this_chull,]) this_mat <- rbind(this_mat, this_mat[1,]) this_polygon <- st_geometry(st_polygon(x = list(this_mat))) st_crs(this_polygon) <- 4326 this_polygon <- st_transform(this_polygon, 32632) pop_dense_areas.sf$geometry[cl] <- this_polygon } pop_dense_areas.sf <- st_transform(pop_dense_areas.sf, 4326)

To give an idea of the actual result, you see below a densely populated area inferred from the cluster of points derived from building and residential roads. The algorithm successfully identified relatively more anthorpised regions. Tweaking of the two arguments of the density-based clustering function might be necessary depending on the size and human geography of the final map.

The final enrichment of my map is the addition of a few names of populated places (towns and villages in the OSM tagging system). Again I use the OSM API. I filter the results for the more important places based on their relative population. To avoid overcrowding the map with labels, I only select the 10 larger places. geom_sf doesn’t allow to label points (yet). It is then necessary to convert the spatial object into a simple data.frame with two coordinate columns. Names might be too long, so I add a \n (newline) after 10 characters with gsub('(.{1,10})(\\s|$)', '\\1\n', str). (Thanks to xiechao for this)

library(dplyr) osm_villages.sf <- opq(bbox = st_bbox(my_bbox_buff_2500.sf)) %>% add_osm_feature(key = 'place', value = 'village') %>% osmdata_sf() osm_villages.sf <- osm_villages.sf$osm_points osm_towns.sf <- opq(bbox = st_bbox(my_bbox_buff_2500.sf)) %>% add_osm_feature(key = 'place', value = 'town') %>% osmdata_sf() osm_towns.sf <- osm_towns.sf$osm_points osm_larger_places.df <- rbind(cbind(as.data.frame(osm_villages.sf)[,c('name','population')], st_coordinates(osm_villages.sf)), cbind(as.data.frame(osm_towns.sf)[,c('name','population')], st_coordinates(osm_towns.sf))) %>% mutate(population = as.numeric(as.character(population))) %>% top_n(10, population) osm_larger_places.df$name <- gsub('(.{1,10})(\\s|$)', '\\1\n', osm_larger_places.df$name)

And now let’s plot everything along with my initial data points:

ggplot() + geom_raster(data = hill.df, aes(lon, lat, fill = hill), alpha = .45) + scale_fill_gradientn(colours = grey.colors(100)) + geom_sf(data = pop_dense_areas.sf, fill = '#fee08b', colour = NA, alpha = 0.9) + geom_sf(data = osm_lakes.sf, fill = '#9ecae1', colour = NA) + geom_sf(data = osm_rivers.sf, colour = '#9ecae1', size = 0.4) + geom_sf(data = osm_roads_primary.sf, colour = '#636363', size = 0.1) + geom_sf(data = osm_roads_secondary.sf, colour = '#636363', size = 0.08) + geom_sf(data = osm_roads_tertiary.sf, colour = '#636363', size = 0.08) + geom_sf(data = my_points.sf, shape = 5, colour = 'black', size = .5) + geom_text(data = osm_larger_places.df, aes(x=X,y=Y,label=name), size = 2.5, alpha = .65) + coord_sf(xlim = c(st_bbox(my_bbox_buff_2500.sf)['xmin'], st_bbox(my_bbox_buff_2500.sf)['xmax']), ylim = c(st_bbox(my_bbox_buff_2500.sf)['ymin'], st_bbox(my_bbox_buff_2500.sf)['ymax'])) + guides(fill=FALSE)+ labs(x=NULL, y=NULL) + theme_bw()

References

Hahsler, Michael, and Matthew Piekenbrock. 2018. Dbscan: Density Based Clustering of Applications with Noise (Dbscan) and Related Algorithms. https://CRAN.R-project.org/package=dbscan.

Hijmans, Robert J. 2017. Raster: Geographic Data Analysis and Modeling. https://CRAN.R-project.org/package=raster.

Padgham, Mark, Bob Rudis, Robin Lovelace, and Maëlle Salmon. 2018. Osmdata: Import ’Openstreetmap’ Data as Simple Features or Spatial Objects. https://CRAN.R-project.org/package=osmdata.

Pebesma, Edzer. 2018. Sf: Simple Features for R. https://CRAN.R-project.org/package=sf.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2017. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Xie, Yihui. 2018. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – : Francesco Bailo :. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Exploring ggplot2 boxplots – Defining limits and adjusting style

Fri, 08/10/2018 - 02:00

(This article was first published on The USGS OWI blog , and kindly contributed to R-bloggers)

Boxplots are often used to show data distributions, and ggplot2 is often used to visualize data. A question that comes up is what exactly do the box plots represent? The ggplot2 box plots follow standard Tukey representations, and there are many references of this online and in standard statistical text books. The base R function to calculate the box plot limits is boxplot.stats. The help file for this function is very informative, but it’s often non-R users asking what exactly the plot means. Therefore, this blog post breaks down the calculations into (hopefully!) easy-to-follow chunks of code for you to make your own box plot legend if necessary. Some additional goals here are to create boxplots that come close to USGS style. Features in this blog post take advantage of enhancements to ggplot2 in version 3.0.0 or later.

First, let’s get some data that might be typically plotted in a USGS report using a boxplot. Here we’ll use chloride data (parameter code “00940”) measured at a USGS station on the Fox River in Green Bay, WI (station ID “04085139”). We’ll use the package dataRetrieval to get the data (see this tutorial for more information on dataRetrieval), and plot a simple boxplot by month using ggplot2:

library(dataRetrieval) library(ggplot2) # Get chloride data using dataRetrieval: chloride <- readNWISqw("04085139", "00940") # Add a month column: chloride$month <- month.abb[as.numeric(format(chloride$sample_dt, "%m"))] chloride$month <- factor(chloride$month, labels = month.abb) # Pull out the official parameter and site names for labels: cl_name <- attr(chloride, "variableInfo")[["parameter_nm"]] cl_site <- attr(chloride, "siteInfo")[["station_nm"]] # Create basic ggplot graph: ggplot(data = chloride, aes(x = month, y = result_va)) + geom_boxplot() + xlab("Month") + ylab(cl_name) + labs(title = cl_site)

Is that graph great? YES! And for presentations and/or journal publications, that graph might be appropriate. However, for an official USGS report, USGS employees need to get the graphics approved to assure they follow specific style guidelines. The approving officer would probably come back from the review with the following comments:

Reviewer’s Comments ggplot2 element to modify Remove background color, grid lines Adjust theme Add horizontal bars to the upper and lower whiskers Add stat_boxplot Have tick marks go inside the plot Adjust theme Tick marks should be on both sides of the y axis Add sec.axis to scale_y_continuous Remove tick marks from discrete data Adjust theme y-axis needs to start exactly at 0 Add expand_limits y-axis labels need to be shown at 0 and at the upper scale Add breaks and limits to scale_y_continuous Add very specific legend Create function ggplot_box_legend Add the number of observations above each boxplot Add custom stat_summary Change text size Adjust geom_text defaults Change font (we’ll use "serif" in this blog, although that is not the official USGS font) Adjust geom_text defaults

As you can see, it will not be as simple as creating a single custom ggplot theme to comply with the requirements. However, we can string together ggplot commands in a list for easy re-use. This blog is not going to get you perfect compliance with the USGS standards, but it will get much closer. Also, while these style adjustments are tailored to USGS requirements, the process described here may be useful for other graphic guidelines as well.

So, let’s skip to the exciting conclusion and use some code that will be described later (boxplot_framework and ggplot_box_legend) to create the same plot, now closer to those USGS style requirements:

library(cowplot) # NOTE! This is a preview of the FUTURE! # We'll create the functions ggplot_box_legend and boxplot_framework # later in this blog post. # So....by the end of this post, you will be able to: legend_plot <- ggplot_box_legend() chloride_plot <- ggplot(data = chloride, aes(x = month, y = result_va)) + boxplot_framework(upper_limit = 70) + xlab(label = "Month") + ylab(label = cl_name) + labs(title = cl_site) plot_grid(chloride_plot, legend_plot, nrow = 1, rel_widths = c(.6,.4))

As can be seen in the code chunk, we are now using a function ggplot_box_legend to make a legend, boxplot_framework to accommodate all of the style requirements, and the cowplot package to plot them together.

boxplot_framework: Styling the boxplot

Let’s get our style requirements figured out. First, we can set some basic plot elements for a theme. We can start with the theme_bw and add to that. Here we remove the grid, set the size of the title, bring the y-ticks inside the plotting area, and remove the x-ticks:

theme_USGS_box <- function(base_family = "serif", ...){ theme_bw(base_family = base_family, ...) + theme( panel.grid = element_blank(), plot.title = element_text(size = 8), axis.ticks.length = unit(-0.05, "in"), axis.text.y = element_text(margin=unit(c(0.3,0.3,0.3,0.3), "cm")), axis.text.x = element_text(margin=unit(c(0.3,0.3,0.3,0.3), "cm")), axis.ticks.x = element_blank(), aspect.ratio = 1, legend.background = element_rect(color = "black", fill = "white") ) }

Next, we can change the defaults of the geom_text to a smaller size and font.

update_geom_defaults("text", list(size = 3, family = "serif"))

We also need to figure out what other ggplot2 functions need to be added. The basic ggplot code for the chloride plot would be:

n_fun <- function(x){ return(data.frame(y = 0.95*70, label = length(x))) } ggplot(data = chloride, aes(x = month, y = result_va)) + stat_boxplot(geom ='errorbar', width = 0.6) + geom_boxplot(width = 0.6, fill = "lightgrey") + stat_summary(fun.data = n_fun, geom = "text", hjust = 0.5) + expand_limits(y = 0) + theme_USGS_box() + scale_y_continuous(sec.axis = dup_axis(label = NULL, name = NULL), expand = expand_scale(mult = c(0, 0)), breaks = pretty(c(0,70), n = 5), limits = c(0,70))

Breaking that code down:

Function What’s happening? stat_boxplot(geom =’errorbar’) The "errorbars" are used to make the horizontal lines on the upper and lower whiskers. This needs to happen first so it is in the back of the plot. geom_boxplot Regular boxplot stat_summary(fun.data = n_fun, geom = "text", hjust = 0.5) The stat_summary function is very powerful for adding specific summary statistics to the plot. In this case, we are adding a geom_text that is calculated with our custom n_fun. That function comes back with the count of the boxplot, and puts it at 95% of the hard-coded upper limit. expand_limits This forces the plot to include 0. theme_USGS_box Theme created above to help with grid lines, tick marks, axis size/fonts, etc. scale_y_continuous A tricky part of the USGS requirements involve 4 parts: Add ticks to the right side, have at least 4 "pretty" labels on the left axis, remove padding, and have the labels start and end at the beginning and end of the plot. Breaking that down further: scale_y_continuous(sec.axis = dup_axis Handy function to add tick marks to the right side of the graph. scale_y_continuous(expand = expand_scale(mult = c(0, 0)) Remove padding scale_y_continuous(breaks = pretty(c(0,70), n = 5)) Make pretty label breaks, assuring 5 pretty labels if the graph went from 0 to 70 scale_y_continuous(limits = c(0,70)) Assure the graph goes from 0 to 70.

Let’s look at a few other common boxplots to see if there are other ggplot2 elements that would be useful in a common boxplot_framework function.

Logrithmic boxplot

For another example, we might need to make a boxplot with a logarithm scale. This data is for phosphorus measurements on the Pheasant Branch Creek in Middleton, WI.

site <- "05427948" pCode <- "00665" # Get phosphorus data using dataRetrieval: phos_data <- readNWISqw(site, pCode) # Create a month column: phos_data$month <- month.abb[as.numeric(format(phos_data$sample_dt, "%m"))] phos_data$month <- factor(phos_data$month, labels = month.abb) # Get site name and paramter name for labels: phos_name <- attr(phos_data, "variableInfo")[["parameter_nm"]] phos_site <- attr(phos_data, "siteInfo")[["station_nm"]] n_fun <- function(x){ return(data.frame(y = 0.95*log10(50), label = length(x))) } prettyLogs <- function(x){ pretty_range <- range(x[x > 0]) pretty_logs <- 10^(-10:10) log_index <- which(pretty_logs < pretty_range[2] & pretty_logs > pretty_range[1]) log_index <- c(log_index[1]-1,log_index, log_index[length(log_index)]+1) pretty_logs_new <- pretty_logs[log_index] return(pretty_logs_new) } fancyNumbers <- function(n){ nNoNA <- n[!is.na(n)] x <-gsub(pattern = "1e",replacement = "10^", x = format(nNoNA, scientific = TRUE)) exponents <- as.numeric(sapply(strsplit(x, "\\^"), function(j) j[2])) base <- ifelse(exponents == 0, "1", ifelse(exponents == 1, "10","10^")) exponents[base == "1" | base == "10"] <- "" textNums <- rep(NA, length(n)) textNums[!is.na(n)] <- paste0(base,exponents) textReturn <- parse(text=textNums) return(textReturn) } phos_plot <- ggplot(data = phos_data, aes(x = month, y = result_va)) + stat_boxplot(geom ='errorbar', width = 0.6) + geom_boxplot(width = 0.6, fill = "lightgrey") + stat_summary(fun.data = n_fun, geom = "text", hjust = 0.5) + theme_USGS_box() + scale_y_log10(limits = c(0.001, 50), expand = expand_scale(mult = c(0, 0)), labels=fancyNumbers, breaks=prettyLogs) + annotation_logticks(sides = c("rl")) + xlab("Month") + ylab(phos_name) + labs(title = phos_site) phos_plot

What are the new features we have to consider for log scales?

Function What’s happening? stat_boxplot The stat_boxplot function is the same, but our custom function to calculate counts need to be adjusted so the position would be in log units. scale_y_log10 This is used instead of scale_y_continuous. annotation_logticks(sides = c("rl")) Adds nice log ticks to the right ("r") and left ("l") side. prettyLogs This function forces the y-axis breaks to be on every 10^x. This could be adjusted if a finer scale was needed. fancyNumbers This is a custom formatting function for the log axis. This function could be adjusted if other formatting was needed. Grouped boxplots

We might also want to make grouped boxplots. In ggplot, it’s pretty easy to add a “fill” to the aes argument. Here we’ll plot temperature distributions at 4 USGS stations. We’ll group the measurements by a “daytime” and “nighttime” factor. Temperature might be a parameter that would not be required to start at 0.

library(dplyr) # Get water temperature data for a variety of USGS stations temp_q_data <- readNWISuv(siteNumbers = c("04026561", "04063700", "04082400", "05427927"), parameterCd = '00010', startDate = "2018-06-01", endDate = "2018-06-03") temperature_name <- attr(temp_q_data, "variableInfo")[["variableName"]] # add an hour of day to create groups (daytime or nighttime) temp_q_data <- temp_q_data %>% renameNWISColumns() %>% mutate(hourOfDay = as.numeric(format(dateTime, "%H")), timeOfDay = case_when(hourOfDay < 20 & hourOfDay > 6 ~ "daytime", TRUE ~ "nighttime" # catchall )) n_fun <- function(x){ return(data.frame(y = 0.95*30, label = length(x))) } temperature_plot <- ggplot(data = temp_q_data, aes(x=site_no, y=Wtemp_Inst, fill=timeOfDay)) + stat_boxplot(geom ='errorbar', width = 0.6) + geom_boxplot(width = 0.6) + stat_summary(fun.data = n_fun, geom = "text", aes(group=timeOfDay), hjust = 0.5, position = position_dodge(0.6)) + expand_limits(y = 0) + scale_y_continuous(sec.axis = dup_axis(label = NULL, name = NULL), expand = expand_scale(mult = c(0, 0)), breaks = pretty(c(10,30), n = 5), limits = c(10,30)) + theme_USGS_box() + xlab("Station ID") + ylab(temperature_name) + scale_fill_discrete(name = "EXPLANATION") + theme(legend.position = c(0.175, 0.78)) temperature_plot

What are the new features we have to consider for log scales?

Function What’s happening? stat_summary(position) We need to move the counts to above the boxplots. This is done by shifting them the same amount as the width. stat_summary(aes(group=timeOfDay)) We need to include how the boxplots are grouped. scale_fill_discrete Need include a fill legend.

Additionally, the parameter name that comes back from dataRetrieval could use some formatting. The following function can fix that for both ggplot2 and base R graphics:

unescape_html <- function(str){ fancy_chars <- regmatches(str, gregexpr("&#\\d{3};",str)) unescaped <- xml2::xml_text(xml2::read_html(paste0("", fancy_chars, ""))) fancy_chars <- gsub(pattern = "&#\\d{3};", replacement = unescaped, x = str) fancy_chars <- gsub("Â","", fancy_chars) return(fancy_chars) }

We’ll use this function in the next section.

Framework function

Finally, we can bring all of those elements together into a single list for ggplot2 to use. While we’re at it, we can create a function that is flexible for both linear and logarithmic scales, as well as grouped boxplots. It’s a bit clunky because you need to specify the upper and lower limits of the plot. Much of the USGS style requirements depend on specific upper and lower limits, so I decided this was an acceptable solution for this blog post. There’s almost certainly a slicker way to do that, but for now, it works:

boxplot_framework <- function(upper_limit, family_font = "serif", lower_limit = 0, logY = FALSE, fill_var = NA, fill = "lightgrey", width = 0.6){ update_geom_defaults("text", list(size = 3, family = family_font)) n_fun <- function(x, lY = logY){ return(data.frame(y = ifelse(logY, 0.95*log10(upper_limit), 0.95*upper_limit), label = length(x))) } prettyLogs <- function(x){ pretty_range <- range(x[x > 0]) pretty_logs <- 10^(-10:10) log_index <- which(pretty_logs < pretty_range[2] & pretty_logs > pretty_range[1]) log_index <- c(log_index[1]-1,log_index, log_index[length(log_index)]+1) pretty_logs_new <- pretty_logs[log_index] return(pretty_logs_new) } fancyNumbers <- function(n){ nNoNA <- n[!is.na(n)] x <-gsub(pattern = "1e",replacement = "10^", x = format(nNoNA, scientific = TRUE)) exponents <- as.numeric(sapply(strsplit(x, "\\^"), function(j) j[2])) base <- ifelse(exponents == 0, "1", ifelse(exponents == 1, "10","10^")) exponents[base == "1" | base == "10"] <- "" textNums <- rep(NA, length(n)) textNums[!is.na(n)] <- paste0(base,exponents) textReturn <- parse(text=textNums) return(textReturn) } if(!is.na(fill_var)){ basic_elements <- list(stat_boxplot(geom ='errorbar', width = width), geom_boxplot(width = width), stat_summary(fun.data = n_fun, geom = "text", position = position_dodge(width), hjust =0.5, aes_string(group=fill_var)), expand_limits(y = lower_limit), theme_USGS_box()) } else { basic_elements <- list(stat_boxplot(geom ='errorbar', width = width), geom_boxplot(width = width, fill = fill), stat_summary(fun.data = n_fun, geom = "text", hjust =0.5), expand_limits(y = lower_limit), theme_USGS_box()) } if(logY){ return(c(basic_elements, scale_y_log10(limits = c(lower_limit, upper_limit), expand = expand_scale(mult = c(0, 0)), labels=fancyNumbers, breaks=prettyLogs), annotation_logticks(sides = c("rl")))) } else { return(c(basic_elements, scale_y_continuous(sec.axis = dup_axis(label = NULL, name = NULL), expand = expand_scale(mult = c(0, 0)), breaks = pretty(c(lower_limit,upper_limit), n = 5), limits = c(lower_limit,upper_limit)))) } } Examples with our framework

Let’s see if it works! Let’s build the last set of example figures using our new function boxplot_framework. I’m also going to use the ‘cowplot’ package to print them all together. I’ll also include the ggplot_box_legend which will be described in the next section.

legend_plot <- ggplot_box_legend() chloride_plot <- ggplot(data = chloride, aes(x = month, y = result_va)) + boxplot_framework(upper_limit = 70) + xlab(label = "Month") + ylab(label = cl_name) + labs(title = cl_site) phos_plot <- ggplot(data = phos_data, aes(x = month, y = result_va)) + boxplot_framework(upper_limit = 50, lower_limit = 0.001, logY = TRUE) + xlab(label = "Month") + #Shortened label since the graph area is smaller ylab(label = "Phosphorus in milligraphs per liter") + labs(title = phos_site) temperature_plot <- ggplot(data = temp_q_data, aes(x=site_no, y=Wtemp_Inst, fill=timeOfDay)) + boxplot_framework(upper_limit = 30, lower_limit = 10, fill_var = "timeOfDay") + xlab(label = "Station ID") + ylab(label = unescape_html(temperature_name)) + labs(title = "Daytime vs Nighttime Temperature Distribution") + scale_fill_discrete(name = "EXPLANATION") + theme(legend.position = c(0.225, 0.72), legend.title = element_text(size = 7)) plot_grid(chloride_plot, phos_plot, temperature_plot, legend_plot, nrow = 2)

ggplot_box_legend: What is a boxplot?

A non-trivial requirement to the USGS boxplot style guidelines is to make a detailed, prescribed legend. In this section we’ll first verify that ggplot2 boxplots use the same definitions for the lines and dots, and then we’ll make a function that creates the prescribed legend. To start, let’s set up random data using the R function sample and then create a function to calculate each value.

set.seed(100) sample_df <- data.frame(parameter = "test", values = sample(500)) # Extend the top whisker a bit: sample_df$values[1:100] <- 701:800 # Make sure there's only 1 lower outlier: sample_df$values[1] <- -350

Next, we’ll create a function that calculates the necessary values for the boxplots:

ggplot2_boxplot <- function(x){ quartiles <- as.numeric(quantile(x, probs = c(0.25, 0.5, 0.75))) names(quartiles) <- c("25th percentile", "50th percentile\n(median)", "75th percentile") IQR <- diff(quartiles[c(1,3)]) upper_whisker <- max(x[x < (quartiles[3] + 1.5 * IQR)]) lower_whisker <- min(x[x > (quartiles[1] - 1.5 * IQR)]) upper_dots <- x[x > (quartiles[3] + 1.5*IQR)] lower_dots <- x[x < (quartiles[1] - 1.5*IQR)] return(list("quartiles" = quartiles, "25th percentile" = as.numeric(quartiles[1]), "50th percentile\n(median)" = as.numeric(quartiles[2]), "75th percentile" = as.numeric(quartiles[3]), "IQR" = IQR, "upper_whisker" = upper_whisker, "lower_whisker" = lower_whisker, "upper_dots" = upper_dots, "lower_dots" = lower_dots)) } ggplot_output <- ggplot2_boxplot(sample_df$values)

What are those calculations?

  • Quartiles (25, 50, 75 percentiles), 50% is the median
  • Interquartile range is the difference between the 75th and 25th percentiles
  • The upper whisker is the maximum value of the data that is within 1.5 times the interquartile range over the 75th percentile.
  • The lower whisker is the minimum value of the data that is within 1.5 times the interquartile range under the 25th percentile.
  • Outlier values are considered any values over 1.5 times the interquartile range over the 75th percentile or any values under 1.5 times the interquartile range under the 25th percentile.

Let’s check that the output matches boxplot.stats:

# Using base R: base_R_output <- boxplot.stats(sample_df$values) # Some checks: # Outliers: all(c(ggplot_output[["upper_dots"]], ggplot_output[["lowerdots"]]) %in% c(base_R_output[["out"]])) ## [1] TRUE # whiskers: ggplot_output[["upper_whisker"]] == base_R_output[["stats"]][5] ## [1] TRUE ggplot_output[["lower_whisker"]] == base_R_output[["stats"]][1] ## [1] TRUE Boxplot Legend

Let’s use this information to generate a legend, and make the code reusable by creating a standalone function that we used in earlier code (ggplot_box_legend). There is a lot of ggplot2 code to digest here. Most of it is style adjustments to approximate the USGS style guidelines for a boxplot legend.

Show/Hide Code

ggplot_box_legend <- function(family = "serif"){ # Create data to use in the boxplot legend: set.seed(100) sample_df <- data.frame(parameter = "test", values = sample(500)) # Extend the top whisker a bit: sample_df$values[1:100] <- 701:800 # Make sure there's only 1 lower outlier: sample_df$values[1] <- -350 # Function to calculate important values: ggplot2_boxplot <- function(x){ quartiles <- as.numeric(quantile(x, probs = c(0.25, 0.5, 0.75))) names(quartiles) <- c("25th percentile", "50th percentile\n(median)", "75th percentile") IQR <- diff(quartiles[c(1,3)]) upper_whisker <- max(x[x < (quartiles[3] + 1.5 * IQR)]) lower_whisker <- min(x[x > (quartiles[1] - 1.5 * IQR)]) upper_dots <- x[x > (quartiles[3] + 1.5*IQR)] lower_dots <- x[x < (quartiles[1] - 1.5*IQR)] return(list("quartiles" = quartiles, "25th percentile" = as.numeric(quartiles[1]), "50th percentile\n(median)" = as.numeric(quartiles[2]), "75th percentile" = as.numeric(quartiles[3]), "IQR" = IQR, "upper_whisker" = upper_whisker, "lower_whisker" = lower_whisker, "upper_dots" = upper_dots, "lower_dots" = lower_dots)) } # Get those values: ggplot_output <- ggplot2_boxplot(sample_df$values) # Lots of text in the legend, make it smaller and consistent font: update_geom_defaults("text", list(size = 3, hjust = 0, family = family)) # Labels don't inherit text: update_geom_defaults("label", list(size = 3, hjust = 0, family = family)) # Create the legend: # The main elements of the plot (the boxplot, error bars, and count) # are the easy part. # The text describing each of those takes a lot of fiddling to # get the location and style just right: explain_plot <- ggplot() + stat_boxplot(data = sample_df, aes(x = parameter, y=values), geom ='errorbar', width = 0.3) + geom_boxplot(data = sample_df, aes(x = parameter, y=values), width = 0.3, fill = "lightgrey") + geom_text(aes(x = 1, y = 950, label = "500"), hjust = 0.5) + geom_text(aes(x = 1.17, y = 950, label = "Number of values"), fontface = "bold", vjust = 0.4) + theme_minimal(base_size = 5, base_family = family) + geom_segment(aes(x = 2.3, xend = 2.3, y = ggplot_output[["25th percentile"]], yend = ggplot_output[["75th percentile"]])) + geom_segment(aes(x = 1.2, xend = 2.3, y = ggplot_output[["25th percentile"]], yend = ggplot_output[["25th percentile"]])) + geom_segment(aes(x = 1.2, xend = 2.3, y = ggplot_output[["75th percentile"]], yend = ggplot_output[["75th percentile"]])) + geom_text(aes(x = 2.4, y = ggplot_output[["50th percentile\n(median)"]]), label = "Interquartile\nrange", fontface = "bold", vjust = 0.4) + geom_text(aes(x = c(1.17,1.17), y = c(ggplot_output[["upper_whisker"]], ggplot_output[["lower_whisker"]]), label = c("Largest value within 1.5 times\ninterquartile range above\n75th percentile", "Smallest value within 1.5 times\ninterquartile range below\n25th percentile")), fontface = "bold", vjust = 0.9) + geom_text(aes(x = c(1.17), y = ggplot_output[["lower_dots"]], label = "Outside value"), vjust = 0.5, fontface = "bold") + geom_text(aes(x = c(1.9), y = ggplot_output[["lower_dots"]], label = "-Value is >1.5 times and"), vjust = 0.5) + geom_text(aes(x = 1.17, y = ggplot_output[["lower_dots"]], label = "<3 times the interquartile range\nbeyond either end of the box"), vjust = 1.5) + geom_label(aes(x = 1.17, y = ggplot_output[["quartiles"]], label = names(ggplot_output[["quartiles"]])), vjust = c(0.4,0.85,0.4), fill = "white", label.size = 0) + ylab("") + xlab("") + theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank(), aspect.ratio = 4/3, plot.title = element_text(hjust = 0.5, size = 10)) + coord_cartesian(xlim = c(1.4,3.1), ylim = c(-600, 900)) + labs(title = "EXPLANATION") return(explain_plot) } ggplot_box_legend()

Bring it together

What’s nice about leaving this in the world of ggplot2 is that it is still possible to use other ggplot2 elements on the plot. For example, let’s add a reporting limit as horizontal lines to the phosphorous graph:

phos_plot_with_DL <- phos_plot + geom_hline(linetype = "dashed", yintercept = 0.01) explain_plot_DL <- ggplot_box_legend() + geom_segment(aes(y = -650, yend = -650, x = 0.6, xend = 1.6), linetype="dashed") + geom_text(aes(y = -650, x = 1.8, label = "Reporting Limit")) plot_grid(phos_plot_with_DL, explain_plot_DL, nrow = 1, rel_widths = c(.6,.4))

I hoped you like my “deep dive” into ggplot2 boxplots. Many of the techniques here can be used to modify other ggplot2 plots.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rstudio::conf(2019) diversity scholarships

Fri, 08/10/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

rstudio::conf(2019) continues our tradition of diversity scholarships, and this year we’re doubling the program (again!) to 38 recipients:

  • 32 domestic diversity scholarships available to anyone living in the US or Canada who is a member of a group that is under-represented at rstudio::conf.

  • 6 international scholarships available to residents of Africa, South or Central America, or Mexico.

Both groups will receive complementary conference and workshop registration, and funds for travel and accommodation (up to $1000 for domestic candidates and $3000 for international). At the conference, scholars will also have networking opportunities with past diversity scholarship recipients as well as with leaders in the field.

In the long run, we hope that the rstudio::conf participants reflect the full diversity of the world around us. We believe in building on-ramps so that people from diverse backgrounds can learn R, build their knowledge, and then contribute back to the community. We also recognise that there are many parts of the world that do not offer easy access to R events. This year we identified Africa and South America / Mexico as regions that, broadly speaking, have had fewer R events in the recent past. We will continue to re-evaluate the regions where our scholarships can have the greatest impact and will adjust this program as rstudio::conf grows.

Scholarship applications will be evaluated on two main criteria:

  • How will attending the conference impact you? What important problems will you be able to tackle that you couldn’t before? You will learn the most if you already have some experience with R, so show us what you’ve achieved so far.

  • How will you share your knowledge with others? We can’t help everyone, so we’re particularly interested in helping those who will go back to their communities and spread the love. Show us how you’ve shared your skills and knowledge in the past and tell us what you plan to do in the future.

The scholarships are competitive, so please don’t waste words on generalities. Instead get right to specifics about you and your achievements as they correspond to these criteria.

Apply now!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages