R news and tutorials contributed by (750) R bloggers
Updated: 45 min 56 sec ago

### Availability of Microsoft R Open 3.5.2 and 3.5.3

Fri, 05/03/2019 - 01:52

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It's taken a little bit longer than usual, but Microsoft R Open 3.5.2 (MRO) is now available for download for Windows and Linux. This update is based on R 3.5.2, and accordingly fixes a few minor bugs compared to MRO 3.5.1. The main change you will note is that new CRAN packages released since R 3.5.1 can now be used with this version of MRO.

Microsoft R Open 3.5.3, based on R 3.5.3, will be available next week, on May 10. Microsoft R Open 3.6.0, based on the recently-released R 3.6.0, is currently under development, and we'll make an announcement here when it's available too.

One thing to note: as of version 3.5.2 Microsoft R Open is no longer distributed for MacOS systems. If you want to continue to take advantage of the Accelerate framework on MacOS for, it's not too difficult to tweak the CRAN binary for MacOS to enable multi-threaded computing, and the additional open source bundled packages (like checkpoint and iterators) are available to install from CRAN or GitHub.

As always, we hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Deep (learning) like Jacque Cousteau – Part 1 – Sets

Thu, 05/02/2019 - 22:30

(This article was first published on Embracing the Random | R, and kindly contributed to R-bloggers)

(TL;DR: I’m going to go deep into deep learning. Sets are collections of things.)

I will be using a lot of LaTeX rendered with MathJax which doesn’t show up in the RSS feed. Please visit my site directly to see equations and all that goodness!

Here I go, deep type flow
Jacques Cousteau could never get this low

Ol’ Dirty Bastard from Wu-Tang Clan’s “Da Mystery of Chessboxin’”

Motivation for this series

I love deep learning. But a lot of the time I don’t feel comfortable with it at a foundational level. I need to do something about this! I’d describe my learning style as one of ‘obsessive detail orientation’. So let’s get into the detail together!

Our aim

Our aim is to develop an understanding of deep learning at a foundational level before moving onto deep learning itself. This means we will be starting with mathematics! We will learn how to apply these ideas in R.

What I write may not be as academically rigorous. However, to make sure that what I write is somewhat correct, I will be referring to these great books:

• Goodfellow, Ian, et al. Deep Learning
• Strang, Gilbert. Linear Algebra and Its Applications
• Shilov, Georgii Evgen’evich. Linear Algebra
• Lipschutz, Seymour. Schaum’s Outlines – Beginning Linear Algebra
• Stewart, Ian, and David Tall. The Foundations of Mathematics.

I will follow the notation outlined in Goodfellow, Ian, et al.

Let’s get started on our adventure!

Today’s topic: Sets

Before we touch any linear algebra, let’s (very) briefly describe what a set is in maths. Sets will become important when we encounter scalars!

A set is a collection of ‘things’

Here are some examples of sets:

• the integers between 1 and 10
• the letters in the English alphabet

The things inside our sets are called elements or members of their sets. Some sets may not contain any elements. This is the empty set, which is depicted using the symbol \emptyset.

The above two sets are finite sets. However sets can also be infinite.

What notation is used to depict sets?

Sets are normally descibed using curly braces. For example, the integers between 1 and 10 can be written like this:

\{1, 2, 3, 4, 5, 6, 7, 8, 9, 10\}

where each element of our set is explicitly listed.

But sometimes it may be easier to use an ellipsis so that we don’t have to write out all of the elements of our set. For example, we could write the previous set like this:

\{1, 2, 3, \dots, 10\}

Sometimes it may be impossible to write out all members of our set because it is an infinite set. For example – how can we depict all positive, even numbers using our set notation? We can do this!

\{2, 4, 6, \dots \}

What are some important, infinite sets? Natural numbers

The set of natural numbers is the set of all ‘whole’, positive numbers starting with 1 and increasing with no upperbound.

This set is depicted using an uppercase ‘N’:

\mathbb{N}

Examples of some natural numbers are 1, 2, 3.

Integers

The set of integers is the set of:

• all natural numbers,
• all natural numbers preceded with a negative sign, and
• zero.

This set is depicted using an uppercase ‘Z’:

\mathbb{Z}

Examples of some integers are -2, -1, 0, 1, 2.

Rational numbers

The set of rational numbers consists of numbers that can be described by dividing one integer by another (except for dividing an integer by zero).

This set is depicted using an uppercase ‘Q’ for ‘quotient’:

\mathbb{Q}

Examples of some rational numbers are -\frac{1}{2}, \frac{0}{4}, \frac{2}{3}

Real numbers

The set of real numbers consists of all rational numbers, along with those numbers that cannot be expressed by dividing two integers which are not ‘imaginary’ numbers. This additional set of numbers is called irrational numbers.

(Let’s ignore imaginary numbers as they aren’t important to us in achieving our goal!)

The set of real numbers is depicted using an upper case ‘R’:

\mathbb{R}

Examples of some real numbers are -1, 2, \frac{2}{5}, \pi, \sqrt{2}

How can we create sets in R?

One way is to use the sets package

library(sets)

Let’s define a set:

set_one <- set(1, 2, 3) print(set_one) ## {1, 2, 3}

The order in which the elements of the set are depicted doesn’t make a set unique. For example, these two sets are equivalent:

set_two <- set(3, 2, 1) print(set_one == set_two) ## [1] TRUE

We also discover that listing elements of a set multiple times doesn’t make a set unique:

set_three <- set(1, 1, 1) set_four <- set(1) print(set_three == set_four) ## [1] TRUE

We could also use some base R functions to emulate sets and their operations, but let’s leave it at this.

Conclusion

The area of set theory is huge and I could easily get lost in it. But we have covered off enough to talk about scalars so let’s move on.

WU-TANG!!!

Justin

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random | R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R for Data Science in a Day

Thu, 05/02/2019 - 17:55

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hi everyone,

Do you want to know more about R, get hands-on experience and build your first Data Science project?

Free up your schedule and save this date, 13th May!

R for Data Science in a Day

A 1-day long workshop during which you’ll learn the basics of R language, so you’ll feel confident and ready to approach different Data Science projects.

Furthermore, we’ll show you how to use R inside some of the most interesting applications of the Microsoft data platform:  Azure Machine Learning Studio and SQL Server.

Program 09.00
09.30 Registration and Welcome 09.30  13.00 Introduction to Data Manipulation in R with dplyr

• A bit of R history and online resources
• R and R-Studio installation and configuration
• Your first R markdown document
• R objects: data and functions
• Data import from external sources
• Data manipulation with tidyverse
13.00  14.00 Networking Lunch 14.00

17.30 Application of Machine Learning problems using R on platforms made available by Microsoft

• Introduction to Machine Learning
• Using R code in Azure Machine Learning Studio
• Introduction to Advanced Analytics in SQL Server
• Predictive modeling in SQL Server using R code
17.30  18.00 Q&A and Closure of the session

Prerequisites
• A basic knowledge of T-SQL programming on SQL Server is advised for a successful learning
• To execute the code showed during the session you’ll need to install:
• RStudio Desktop
• Microsoft R Open
• SQL Server 2017 Developer Edition with the In-database R Services options (choose only R, not Python)
• Andrea Spanò, Andrea Melloncelli e Mariachiara Fortuna – Quantide
• Luca Zavarella – SolidQ
What’s the price?

The event is free and open to everybody. Due to logistic reasons, the meeting is open to max 70 participants, and registration is needed.

You can signup on this event page, we can’t wait to meet you!

Where is it?

Microsoft House Italia, Viale Pasubio 21, Milan. The venue is near Moscova M2 underground station and Porta Garibaldi railway station.

The post R for Data Science in a Day appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Under Pi : gganimate test around quadrature of the circle

Thu, 05/02/2019 - 15:05

(This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers)

An updated look on squaring the circle using gganimate and R code.

It gives a geometric and visual construction, a good and practical representation of what Pi is. As n becomes larger, segments become smaller and smaller, Pi can then be seen as perfection and we almost intuit infinity.

require(sp) require(rgeos) library(ggplot2) library(dplyr, warn.conflicts = FALSE) x <- 0.5 # center x y <- 0.5 # center y n <- 1000 # nr of pts r <- 0.5 # radius poly <- function(n){ library(dplyr, warn.conflicts = FALSE) j <- n %% 2 theta <- (2 - j):(2 * (n) + 1) u <- tibble(theta = theta, x = cos(theta * 2 * pi / n), y = sin(theta * 2 * pi / n)) v <- SpatialPolygons(list(Polygons(list(Polygon(cbind(u$x,u$y))), "polygon"))) v <- ggplot2::fortify(v) %>% mutate(n = n) return(list(df = u, poly = v)) } pts <- seq(0, 2 * pi, length.out = n) plot(sin(pts), cos(pts), type = 'l', asp = 1) # test nmax = 60 alls <- 3:nmax %>% purrr::map(function(x)poly(x)$poly) %>% bind_rows() %>% mutate(estimate = round(sqrt(2 - 2 * cos(pi / n)) * n, 9)) circles <- bind_rows(lapply(tibble(x = cos(pts), y = sin(pts)), rep, nmax - 2 )) %>% mutate(n = rep(3:nmax, each = 1000)) %>% as_tibble() %>% mutate(estimate = round(sqrt(2 - 2 * cos(pi / n)) * n, 9), estimate = paste0('Pi estimate : ', format(estimate, digits = 10), '\n', 'Difference : ', format(round(pi - estimate, 9), digits = 9))) library(ggplot2) library(gganimate) p <- ggplot() + geom_point(data = circles, aes(x = x, y = y), col = 'cornflowerblue', lwd = 3, alpha = 0.5) + theme_void() + geom_polygon(data = alls, aes(x = long, y = lat, group = n), col = 'grey30', lwd = 1.3, fill = 'cornflowerblue', alpha = 0.5) + geom_text(data = distinct(circles, n, .keep_all = TRUE), aes(x = 0, y = 0, label = estimate), size = 6) + labs(title = 'Segments number : {frame_time}') + transition_time(n) p More information and some links of inspirations are here in the code: https://github.com/Guillaumepressiat/under_pi And I have made a really long gif here: https://raw.githubusercontent.com/GuillaumePressiat/under_pi/master/under_pi_patient_geek.gif var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Queensland road accidents mapped with Shiny and leaflet in R Thu, 05/02/2019 - 14:27 (This article was first published on R – Daniel Oehm | Gradient Descending, and kindly contributed to R-bloggers) The Queensland government collects data on road accidents dating back to 1st January 2001 and details characteristics of the incident including, • Location of the crash (lat / long coordinates) • ABS statistical area codes (SA2-4, LGA, remoteness) • Atmospheric and road conditions (weather, lighting, sealed / unsealed roads, speed limit zone, etc) • Severity of the incident (minor injury to fatality) • Types of vehicles involved (car, bus, truck, bike, etc) and • Description of the incident Mapping this data highlights hot spots where car accidents occur more often. In particular the dangerous areas in wet conditions, problematic intersections and the areas of Queensland which are more dangerous than others in terms of fatality rates. I developed a Shiny App utilising leaflet to easily explore the data (and just for fun). It features, • A collapsible panel for filtering the data by selecting statistical areas and other features. • An insights panel showing the breakdown of severity, vehicles involved, accidents over time and a Bayesian estimate of the fatality rate for the selected area. • Data explorer tab. This data is of road accidents, so the estimate of fatality rate in this case is the fatality rate given the vehicle was involved in an accident, rather than the fatality rate by road accident in the population. It is a slightly different take on how this statistic is usually published, but a useful one. The best way to view the app is to run the following code. Firstly, check to make sure you have the packages installed by running check_packages <- function(packages){ if(all(packages %in% rownames(installed.packages()))){ TRUE }else{ cat("Install the following packages before proceding\n", packages[!(packages %in% rownames(installed.packages()))], "\n") } } packages_needed <- c("tidyverse", "shiny", "leaflet", "leaflet.extras", "magrittr", "htmltools", "htmlwidgets", "showtext", "data.table") check_packages(packages_needed) If all good, now run the line below and it will load the app. runGitHub("doehm/road-accidents/", "doehm", launch.browser = TRUE) This will launch it directly on your machine. Or you can follow the link directly to the Shiny app. There are a lot of neat things we can do with this data and I’ll be adding to the app over time. Brisbane Inner A subset of the app focuses on the “Brisbane Inner” SA3 area to give a taste of what to expect. It shows car accidents in the city since 1st January 2013. When zooming in, hover over the marker to get a short description of the crash. View the full screen map here. Code bits Below is the underlying code of the example above leaflet map, but I strongly recommend running the code above to view the Shiny app. See Github for the full code. # queensland road accident data # libraries library(tidyverse) library(shiny) library(leaflet) library(leaflet.extras) library(magrittr) library(htmltools) library(htmlwidgets) library(showtext) library(data.table) # font try({ font_add_google(name = "Montserrat", family = "mont") showtext_auto() }, TRUE) # load data # or if it doesn't work grab the Rdata file from Github - see link above load_data <- function(){ if(!file.exists("locations.csv")){ cat('\n Download may take a few minutes...\n') url <- "http://www.tmr.qld.gov.au/~/media/aboutus/corpinfo/Open%20data/crash/locations.csv" download.file(url, destfile = "locations.csv", method="libcurl") } accidents_raw <- read_csv("locations.csv") return(accidents_raw) } accidents_raw <- load_data() %>% filter(Crash_Severity != "Property damage only") # sample of brisbane inner accidents <- accidents_raw %>% filter( Loc_ABS_Statistical_Area_3 == "Brisbane Inner", Crash_Year > 2013 ) %>% mutate(fatality = Count_Casualty_Fatality > 0) # basic leaflet m <- leaflet(accidents) %>% addProviderTiles(providers$Stamen.Toner, group = "Black and white") %>% addTiles(options = providerTileOptions(noWrap = TRUE), group="Colour") %>% addMarkers( lng = ~Crash_Longitude_GDA94, lat = ~Crash_Latitude_GDA94, clusterOptions = markerClusterOptions(), label = ~htmlEscape(Crash_DCA_Description) ) %>% addCircleMarkers( lng = ~Crash_Longitude_GDA94[accidents$fatality], lat = ~Crash_Latitude_GDA94[accidents$fatality], color = "#8B0000", stroke = FALSE, fillOpacity = 0.8, group = "Fatalities" ) %>% addHeatmap( lng = ~Crash_Longitude_GDA94, lat = ~Crash_Latitude_GDA94, radius = 17, blur = 25, cellSize = 25 ) %>% addLayersControl( overlayGroups = c("Fatalities"), baseGroups = c("Black and white","Colour"), options = layersControlOptions(collapsed = FALSE) )

The post Queensland road accidents mapped with Shiny and leaflet in R appeared first on Daniel Oehm | Gradient Descending.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### [R]eady for production: The Data Science Event 2019 with eoda and RStudio

Thu, 05/02/2019 - 11:41

(This article was first published on eoda english R news – Der Datenanalyse Blog von eoda, and kindly contributed to R-bloggers)

eoda and RStudio invite the German speaking R-community to the Data Science Event 2019 in Frankfurt on June 13th – the event for the productive use of R. Learn how you can seamlessly implement your analysis solutions with the optimal IT infrastructure into your business processes.

Discover best practice approaches in productive data science architectures, RStudio solutions for the professional application of R, and get important impulses for your own analysis environment with the help of experienced experts from RStudio and eoda.

Inspiring networking, exclusive insights into RStudio’s leading products and the know-how of eoda as one of the leading R integrators make this event obligatory for Data Scientists and Solution Engineers.

Look forward to a practical success story, first-hand RStudio news and the BarCamp in the afternoon, where you will have the opportunity to discuss exciting infrastructure topics of your choice with eoda and RStudio experts in small groups.

What you can expect: Agenda of the Data Science Event 2019

09:30-10:00 | Registrierung
10:00-10:45 | R in Produktivumgebungen: Das richtige Zusammenspiel zwischen Data Science und IT
10:45-11:00 | Success Story: Aufbau einer Analyseumgebung bei der REWE International AG
12:00-13:00 | Mittagessen
13:00-14:00 | RStudio what’s next mit Andrie de Vries (Solutions Engineer | RStudio)
14:00-14:30 | Kaffee & Snacks
14:30-16:30 | BarCamp: [R]eady for production

Location: Historischer Festsaal Frankfurt am Main

Register now and secure your free participation in the Data Science Event 2019 of eoda and RStudio.

To the registration.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news – Der Datenanalyse Blog von eoda. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Earthquake Analysis (2/4): Categorical Variables Exploratory Analysis

Thu, 05/02/2019 - 02:59

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Categories

Tags

This is the second part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific time window of 30 days. In the following, we are going to analyze the categorical variables of our dataset. The categorical variables can take on one of a limited, and usually fixed a number of possible values. Factor variables are categorical variables that can be either numeric or string variables. R stores categorical variables into a factor. Their analysis may require statistical tools different from the ones used for quantitative variables.

Packages

I am going to take advantage of the following packages.

suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(Hmisc)) suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(vcd)) suppressPackageStartupMessages(library(vcdExtra)) suppressPackageStartupMessages(library(gmodels))

Packages versions are herein listed.

packages <- c("ggplot2", "dplyr", "Hmisc", "lubridate", "vcd", "vcdExtra", "gmodels") version <- lapply(packages, packageVersion) version_c <- do.call(c, version) data.frame(packages=packages, version = as.character(version_c)) ## packages version ## 1 ggplot2 3.1.0 ## 2 dplyr 0.8.0.1 ## 3 Hmisc 4.2.0 ## 4 lubridate 1.7.4 ## 5 vcd 1.4.4 ## 6 vcdExtra 0.7.1 ## 7 gmodels 2.18.1

Running on Windows-10 the following R language version.

R.version ## _ ## platform x86_64-w64-mingw32 ## arch x86_64 ## os mingw32 ## system x86_64, mingw32 ## status ## major 3 ## minor 5.2 ## year 2018 ## month 12 ## day 20 ## svn rev 75870 ## language R ## version.string R version 3.5.2 (2018-12-20) ## nickname Eggshell Igloo Getting Data

As shown in the first post, we start our analysis by downloading the earthquake dataset from earthquake.usgs.gov site, specifically the last 30 days dataset flavor. Please note that such eartquake dataset is day by day updated to cover the last 30 days of data collection. Furthermore, it is not the most recent dataset available, as I collected it some weeks ago. If such dataset is not already present into our workspace, we download and save it to be loaded into the quakes local variable.

if ("all_week.csv" %in% dir(".") == FALSE) { url <- "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv" download.file(url = url, destfile = "all_week.csv") } quakes <- read.csv("all_month.csv", header=TRUE, sep=',', stringsAsFactors = FALSE) quakes$time <- ymd_hms(quakes$time) quakes$updated <- ymd_hms(quakes$updated) quakes$magType <- as.factor(quakes$magType) quakes$net <- as.factor(quakes$net) quakes$type <- as.factor(quakes$type) quakes$status <- as.factor(quakes$status) quakes$locationSource <- as.factor(quakes$locationSource) quakes$magSource <- as.factor(quakes$magSource) Exploratory Analysis – Categorical Variables

The categorical variables can be detected by testing if their class is factor.

(factor_vars <- names(which(sapply(quakes, class) == "factor"))) ## [1] "magType" "net" "type" "status" ## [5] "locationSource" "magSource" length(factor_vars) ## [1] 6

The describe() function within HMisc package can be useful for categorical variables as well.

describe(quakes[,factor_vars]) ## quakes[, factor_vars] ## ## 6 Variables 8407 Observations ## --------------------------------------------------------------------------- ## magType ## n missing distinct ## 8407 0 10 ## ## Value mb mb_lg md mh ml mun mw mwr mww ## Frequency 2 604 47 2423 14 5203 2 4 19 89 ## Proportion 0.000 0.072 0.006 0.288 0.002 0.619 0.000 0.000 0.002 0.011 ## --------------------------------------------------------------------------- ## net ## n missing distinct ## 8407 0 14 ## ## ak (2469, 0.294), ci (1344, 0.160), hv (253, 0.030), ismpkansas (8, ## 0.001), ld (4, 0.000), mb (157, 0.019), nc (1435, 0.171), nm (28, 0.003), ## nn (604, 0.072), pr (427, 0.051), se (15, 0.002), us (897, 0.107), uu ## (588, 0.070), uw (178, 0.021) ## --------------------------------------------------------------------------- ## type ## n missing distinct ## 8407 0 7 ## ## chemical explosion (2, 0.000), earthquake (8232, 0.979), explosion (58, ## 0.007), ice quake (16, 0.002), other event (3, 0.000), quarry blast (95, ## 0.011), rock burst (1, 0.000) ## --------------------------------------------------------------------------- ## status ## n missing distinct ## 8407 0 2 ## ## Value automatic reviewed ## Frequency 1691 6716 ## Proportion 0.201 0.799 ## --------------------------------------------------------------------------- ## locationSource ## n missing distinct ## 8407 0 15 ## ## Value ak ci hv ismp ld mb nc nm nn ok ## Frequency 2470 1344 253 8 4 157 1435 28 604 6 ## Proportion 0.294 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001 ## ## Value pr se us uu uw ## Frequency 427 15 890 588 178 ## Proportion 0.051 0.002 0.106 0.070 0.021 ## --------------------------------------------------------------------------- ## magSource ## n missing distinct ## 8407 0 15 ## ## Value ak ci hv ismp ld mb nc nm nn ok ## Frequency 2480 1344 253 8 4 157 1435 28 604 5 ## Proportion 0.295 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001 ## ## Value pr se us uu uw ## Frequency 427 15 881 588 178 ## Proportion 0.051 0.002 0.105 0.070 0.021 ## ---------------------------------------------------------------------------

We notice from the magType description that two records have a null string magType. We then replace them we the NA value.

quakes$magType[quakes$magType == ""] <- NA

To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.

The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).

For example, here is the one-way contingency table for the magType variable.

(tbl <- table(quakes$magType)) ## ## mb mb_lg md mh ml mun mw mwr mww ## 0 604 47 2423 14 5203 2 4 19 89 A graphical representation of the same as bar plot is shown. ggplot(data = quakes, aes(x=magType, fill = magType)) + geom_bar(stat='count') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + guides(fill=FALSE) One-way contingency table of events based on the network the event has been registered from. table(quakes$net) ## ## ak ci hv ismpkansas ld mb ## 2469 1344 253 8 4 157 ## nc nm nn pr se us ## 1435 28 604 427 15 897 ## uu uw ## 588 178

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = net, fill = net)) + geom_bar(stat = 'count') + guides(fill = FALSE)

One-way contingency table of the events based on their type.

table(quakes$type) ## ## chemical explosion earthquake explosion ## 2 8232 58 ## ice quake other event quarry blast ## 16 3 95 ## rock burst ## 1 A graphical representation of the same as bar plot is shown. ggplot(data = quakes, aes(x = type, fill = type)) + geom_bar(stat = 'count') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + guides(fill = FALSE) One-way contingency table of events based on their status. table(quakes$status) ## ## automatic reviewed ## 1691 6716

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = status, fill = status)) + geom_bar(stat='count') + guides(fill=FALSE)

One-way contingency table of events based on the location source.

table(quakes$locationSource) ## ## ak ci hv ismp ld mb nc nm nn ok pr se us uu uw ## 2470 1344 253 8 4 157 1435 28 604 6 427 15 890 588 178 A graphical representation of the same as bar plot is shown. ggplot(data = quakes, aes(x = locationSource, fill = locationSource)) + geom_bar(stat='count') + guides(fill=FALSE) One-way contingency table of the events based on the network that originally authored the reported magnitude for this event. table(quakes$magSource) ## ## ak ci hv ismp ld mb nc nm nn ok pr se us uu uw ## 2480 1344 253 8 4 157 1435 28 604 5 427 15 881 588 178

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = magSource, fill = magSource)) + geom_bar(stat='count') + guides(fill=FALSE)

Two-way contingency table based upon the network as data contributor and the magType, i.e. the method or algorithm used to calculate the preferred magnitude for the event

table(quakes$net, quakes$magType) ## ## mb mb_lg md mh ml mun mw mwr mww ## ak 0 1 0 0 0 2467 0 0 0 1 ## ci 0 0 0 0 3 1339 2 0 0 0 ## hv 0 0 0 145 0 108 0 0 0 0 ## ismpkansas 0 0 0 0 0 8 0 0 0 0 ## ld 0 0 0 0 0 4 0 0 0 0 ## mb 0 0 0 6 0 151 0 0 0 0 ## nc 0 0 0 1423 0 7 0 3 0 0 ## nm 0 0 0 28 0 0 0 0 0 0 ## nn 0 0 0 0 0 604 0 0 0 0 ## pr 0 0 0 425 0 2 0 0 0 0 ## se 0 0 0 15 0 0 0 0 0 0 ## us 0 603 47 0 0 140 0 0 19 88 ## uu 0 0 0 365 0 222 0 1 0 0 ## uw 0 0 0 16 11 151 0 0 0 0

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x=net, fill = magType)) + geom_bar(stat='count')

Starting from this two-way contingency table:

(tbl <- table(quakes$net, quakes$status)) ## ## automatic reviewed ## ak 1105 1364 ## ci 29 1315 ## hv 90 163 ## ismpkansas 0 8 ## ld 0 4 ## mb 0 157 ## nc 460 975 ## nm 0 28 ## nn 6 598 ## pr 0 427 ## se 0 15 ## us 0 897 ## uu 0 588 ## uw 1 177

its corresponding row proportions table is shown. Each row shows the fraction of automatic/reviewed earthquakes given a certain network. Each row sum up to one.

prop.table(tbl, 1) ## ## automatic reviewed ## ak 0.447549615 0.552450385 ## ci 0.021577381 0.978422619 ## hv 0.355731225 0.644268775 ## ismpkansas 0.000000000 1.000000000 ## ld 0.000000000 1.000000000 ## mb 0.000000000 1.000000000 ## nc 0.320557491 0.679442509 ## nm 0.000000000 1.000000000 ## nn 0.009933775 0.990066225 ## pr 0.000000000 1.000000000 ## se 0.000000000 1.000000000 ## us 0.000000000 1.000000000 ## uu 0.000000000 1.000000000 ## uw 0.005617978 0.994382022

Column proportions table. Each row shows the fraction of earthquakes network given a specific status (automatic/reviewed). Each column sums up to one.

prop.table(tbl, 2) ## ## automatic reviewed ## ak 0.6534594914 0.2030970816 ## ci 0.0171496156 0.1958010721 ## hv 0.0532229450 0.0242703990 ## ismpkansas 0.0000000000 0.0011911852 ## ld 0.0000000000 0.0005955926 ## mb 0.0000000000 0.0233770101 ## nc 0.2720283856 0.1451756998 ## nm 0.0000000000 0.0041691483 ## nn 0.0035481963 0.0890410959 ## pr 0.0000000000 0.0635795116 ## se 0.0000000000 0.0022334723 ## us 0.0000000000 0.1335616438 ## uu 0.0000000000 0.0875521144 ## uw 0.0005913661 0.0263549732

Let us condider the bar plot that can be the plot to represent the net based events counts and, at the same time, highlighting its status with different fill colors.

ggplot(data = quakes, aes(x = net, fill = status)) + geom_bar(stat = 'count')

We want to give a better graphical representation, where the different proportion in status can be better perceived. We then build a dataframe collecting our network + status earthquakes information in frequency form. Starting from it, we will show the resulting spineplot.

tbl_df <- as.data.frame(tbl) colnames(tbl_df) <- c("net", "status", "Freq") tbl_df$net <- factor(tbl_df$net) tbl_df ## net status Freq ## 1 ak automatic 1105 ## 2 ci automatic 29 ## 3 hv automatic 90 ## 4 ismpkansas automatic 0 ## 5 ld automatic 0 ## 6 mb automatic 0 ## 7 nc automatic 460 ## 8 nm automatic 0 ## 9 nn automatic 6 ## 10 pr automatic 0 ## 11 se automatic 0 ## 12 us automatic 0 ## 13 uu automatic 0 ## 14 uw automatic 1 ## 15 ak reviewed 1364 ## 16 ci reviewed 1315 ## 17 hv reviewed 163 ## 18 ismpkansas reviewed 8 ## 19 ld reviewed 4 ## 20 mb reviewed 157 ## 21 nc reviewed 975 ## 22 nm reviewed 28 ## 23 nn reviewed 598 ## 24 pr reviewed 427 ## 25 se reviewed 15 ## 26 us reviewed 897 ## 27 uu reviewed 588 ## 28 uw reviewed 177

The spineplot makes more evident count differences of the status among net providing with a common scale in the range [0,1].

(xtabs_res <- xtabs(Freq ~ net + status, data = tbl_df)) ## status ## net automatic reviewed ## ak 1105 1364 ## ci 29 1315 ## hv 90 163 ## ismpkansas 0 8 ## ld 0 4 ## mb 0 157 ## nc 460 975 ## nm 0 28 ## nn 6 598 ## pr 0 427 ## se 0 15 ## us 0 897 ## uu 0 588 ## uw 1 177 spineplot(xtabs_res)

The xtabs() function creates cross-tabulations of data using the formula style input. Further, applying the summary() to the xtabs() result, we get a chi-squared test of independence of all factors, while indicating the number of cases and dimensions.

type_net_count <- quakes %>% group_by(status, net) %>% dplyr::summarise(Freq = n()) xtabs_res <- xtabs(Freq ~ net + status, data = type_net_count) summary(xtabs_res) ## Call: xtabs(formula = Freq ~ net + status, data = type_net_count) ## Number of cases in table: 8407 ## Number of factors: 2 ## Test for independence of all factors: ## Chisq = 2082.2, df = 13, p-value = 0 ## Chi-squared approximation may be incorrect

Same spineplot as resulting by a slightly different approach.

spineplot(xtabs_res)

Margin tables are another way to summarise categorical data.

(mt <- margin.table(tbl, 2:1)) ## ## ak ci hv ismpkansas ld mb nc nm nn pr se ## automatic 1105 29 90 0 0 0 460 0 6 0 0 ## reviewed 1364 1315 163 8 4 157 975 28 598 427 15 ## ## us uu uw ## automatic 0 0 1 ## reviewed 897 588 177

Per row and per column sums can be added by means of the addmargins() function.

addmargins(mt) ## ## ak ci hv ismpkansas ld mb nc nm nn pr se ## automatic 1105 29 90 0 0 0 460 0 6 0 0 ## reviewed 1364 1315 163 8 4 157 975 28 598 427 15 ## Sum 2469 1344 253 8 4 157 1435 28 604 427 15 ## ## us uu uw Sum ## automatic 0 0 1 1691 ## reviewed 897 588 177 6716 ## Sum 897 588 178 8407

The Cross Tabulation result is shown.

mt <- margin.table(table(quakes$status, quakes$type), 2:1) CrossTable(mt, prop.chisq = FALSE, prop.c = TRUE, prop.r = TRUE, format = "SPSS") ## ## Cell Contents ## |-------------------------| ## | Count | ## | Row Percent | ## | Column Percent | ## | Total Percent | ## |-------------------------| ## ## Total Observations in Table: 8407 ## ## | ## | automatic | reviewed | Row Total | ## -------------------|-----------|-----------|-----------| ## chemical explosion | 0 | 2 | 2 | ## | 0.000% | 100.000% | 0.024% | ## | 0.000% | 0.030% | | ## | 0.000% | 0.024% | | ## -------------------|-----------|-----------|-----------| ## earthquake | 1688 | 6544 | 8232 | ## | 20.505% | 79.495% | 97.918% | ## | 99.823% | 97.439% | | ## | 20.079% | 77.840% | | ## -------------------|-----------|-----------|-----------| ## explosion | 3 | 55 | 58 | ## | 5.172% | 94.828% | 0.690% | ## | 0.177% | 0.819% | | ## | 0.036% | 0.654% | | ## -------------------|-----------|-----------|-----------| ## ice quake | 0 | 16 | 16 | ## | 0.000% | 100.000% | 0.190% | ## | 0.000% | 0.238% | | ## | 0.000% | 0.190% | | ## -------------------|-----------|-----------|-----------| ## other event | 0 | 3 | 3 | ## | 0.000% | 100.000% | 0.036% | ## | 0.000% | 0.045% | | ## | 0.000% | 0.036% | | ## -------------------|-----------|-----------|-----------| ## quarry blast | 0 | 95 | 95 | ## | 0.000% | 100.000% | 1.130% | ## | 0.000% | 1.415% | | ## | 0.000% | 1.130% | | ## -------------------|-----------|-----------|-----------| ## rock burst | 0 | 1 | 1 | ## | 0.000% | 100.000% | 0.012% | ## | 0.000% | 0.015% | | ## | 0.000% | 0.012% | | ## -------------------|-----------|-----------|-----------| ## Column Total | 1691 | 6716 | 8407 | ## | 20.114% | 79.886% | | ## -------------------|-----------|-----------|-----------| ## ##

With the help of the structable within vcd package, we show a 3-WAY contingency table.

structable(type ~ status + magType, data = quakes) ## type chemical explosion earthquake explosion ice quake other event quarry blast rock burst ## status magType ## automatic 0 0 0 0 0 0 0 ## mb 0 1 0 0 0 0 0 ## mb_lg 0 0 0 0 0 0 0 ## md 0 515 0 0 0 0 0 ## mh 0 0 0 0 0 0 0 ## ml 0 1172 3 0 0 0 0 ## mun 0 0 0 0 0 0 0 ## mw 0 0 0 0 0 0 0 ## mwr 0 0 0 0 0 0 0 ## mww 0 0 0 0 0 0 0 ## reviewed 0 0 0 0 0 0 0 ## mb 0 603 0 0 0 0 0 ## mb_lg 0 47 0 0 0 0 0 ## md 0 1893 2 0 0 13 0 ## mh 0 14 0 0 0 0 0 ## ml 0 3873 53 16 3 82 1 ## mun 2 0 0 0 0 0 0 ## mw 0 4 0 0 0 0 0 ## mwr 0 19 0 0 0 0 0 ## mww 0 89 0 0 0 0 0

Geographical Regions vs. Depth Category Earthquakes Analysis

Based on ref. [5] definitions we create new factor variables.

earthquakes <- quakes %>% filter(type == "earthquake") earthquakes <- earthquakes[complete.cases(earthquakes[, c("latitude", "longitude", "depth")]),] emisphere_ns <- as.factor(ifelse(earthquakes$latitude >= 0, "north", "south")) emisphere_ew <- as.factor(ifelse(earthquakes$longitude >= 0, "east", "west")) earthquakes$region <- ifelse(earthquakes$latitude > 0 & earthquakes$longitude > 0, "NE", ifelse(earthquakes$latitude > 0 & earthquakes$longitude < 0, "NW", ifelse(earthquakes$latitude < 0 & earthquakes$longitude > 0, "SE", "SW"))) earthquakes$depth_type <- ifelse(earthquakes$depth <= 70, "shallow", ifelse(earthquakes$depth <= 300, "intermediate", "deep")) earthquakes$region <- factor(earthquakes$region) earthquakes$depth_type <- factor(earthquakes$depth_type)

Resulting tables are shown.

(tbl <- table(earthquakes$region, earthquakes$depth_type)) ## ## deep intermediate shallow ## NE 9 65 197 ## NW 0 526 7109 ## SE 10 40 109 ## SW 31 45 91 prop.table(tbl, 1) ## ## deep intermediate shallow ## NE 0.03321033 0.23985240 0.72693727 ## NW 0.00000000 0.06889325 0.93110675 ## SE 0.06289308 0.25157233 0.68553459 ## SW 0.18562874 0.26946108 0.54491018 prop.table(tbl, 2) ## ## deep intermediate shallow ## NE 0.18000000 0.09615385 0.02624567 ## NW 0.00000000 0.77810651 0.94710898 ## SE 0.20000000 0.05917160 0.01452172 ## SW 0.62000000 0.06656805 0.01212363

We wonder if the region plays a role for the depth type classification. We run the following chi-square test to figure it out.

chisq.test(earthquakes$region, earthquakes$depth_type, simulate.p.value = TRUE) ## ## Pearson's Chi-squared test with simulated p-value (based on 2000 ## replicates) ## ## data: earthquakes$region and earthquakes$depth_type ## X-squared = 1322.4, df = NA, p-value = 0.0004998

And it is statistically significant as the reported p-value highlights.

The Cross Tabulation is as follows.

mt <- margin.table(tbl, 2:1) CrossTable(mt, prop.chisq = FALSE, prop.c = TRUE, prop.r = TRUE, format = "SPSS") ## ## Cell Contents ## |-------------------------| ## | Count | ## | Row Percent | ## | Column Percent | ## | Total Percent | ## |-------------------------| ## ## Total Observations in Table: 8232 ## ## | ## | NE | NW | SE | SW | Row Total | ## -------------|-----------|-----------|-----------|-----------|-----------| ## deep | 9 | 0 | 10 | 31 | 50 | ## | 18.000% | 0.000% | 20.000% | 62.000% | 0.607% | ## | 3.321% | 0.000% | 6.289% | 18.563% | | ## | 0.109% | 0.000% | 0.121% | 0.377% | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## intermediate | 65 | 526 | 40 | 45 | 676 | ## | 9.615% | 77.811% | 5.917% | 6.657% | 8.212% | ## | 23.985% | 6.889% | 25.157% | 26.946% | | ## | 0.790% | 6.390% | 0.486% | 0.547% | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## shallow | 197 | 7109 | 109 | 91 | 7506 | ## | 2.625% | 94.711% | 1.452% | 1.212% | 91.181% | ## | 72.694% | 93.111% | 68.553% | 54.491% | | ## | 2.393% | 86.358% | 1.324% | 1.105% | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## Column Total | 271 | 7635 | 159 | 167 | 8232 | ## | 3.292% | 92.748% | 1.931% | 2.029% | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## ##

The spineplot is able to highlight the better difference in proportions than the barplot. We herein below show the resulting spineplot.

type_region_count <- earthquakes %>% group_by(depth_type, region) %>% dplyr::summarise(Freq = n()) xtabs_res <- xtabs(Freq ~ depth_type + region, data = type_region_count) spineplot(xtabs_res)

Under independence hypothesis, we may compute expected frequencies as follows.

(exp <- independence_table(mt)) ## ## NE NW SE SW ## deep 1.646016 46.37391 0.9657434 1.014334 ## intermediate 22.254130 626.97522 13.0568513 13.713800 ## shallow 247.099854 6961.65087 144.9774052 152.271866

We now bring our contingency table in frequency form.

tbl_df <- as.data.frame(tbl) colnames(tbl_df) <- c("region", "depth_type", "Freq") tbl_df$region <- factor(tbl_df$region) tbl_df$depth_type <- factor(tbl_df$depth_type) tbl_df ## region depth_type Freq ## 1 NE deep 9 ## 2 NW deep 0 ## 3 SE deep 10 ## 4 SW deep 31 ## 5 NE intermediate 65 ## 6 NW intermediate 526 ## 7 SE intermediate 40 ## 8 SW intermediate 45 ## 9 NE shallow 197 ## 10 NW shallow 7109 ## 11 SE shallow 109 ## 12 SW shallow 91

Pattern of association can be revealed by the CMHtest() within vcdExtra package or the sieve plot.

CMHtest(Freq ~ region + depth_type, data = earthquakes) ## Cochran-Mantel-Haenszel Statistics for region by depth_type ## ## AltHypothesis Chisq Df Prob ## cor Nonzero correlation 271.56 1 5.1951e-61 ## rmeans Row mean scores differ 817.17 3 8.1840e-177 ## cmeans Col mean scores differ 609.18 2 5.2283e-133 ## general General association 1322.22 6 1.6789e-282

The sieve plot highlight frequencies different from the expected as determined by independency. The area of each rectangle is always proportional to expected frequency, however the observed frequency is shown by the number of squares in each rectangle.

sieve(Freq ~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_values)

Further, color and intensity of shadows together with the Pearson residuals legend highlight deviations from the expected.

The association plot puts deviation from independence in the foreground, in the sense that the area of each box is made proportional to the (observed – expected) frequency. Cells with observed > expected frequency rise above the baseline representing independence, while cells that contain less than the expected frequency fall below it.

assoc(~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_values)

The mosaic plot can be applied to N-way contingency tables. For a 2-way table as herein shown, the mosaic display is like a grouped barchart, where the heights (or widths) of the bars show the relative frequencies of one variable and widths (or heights) of the sections in each bar show the conditional frequencies of the second variable given the first.

mosaic(~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_residuals)

Please note that in above mosaic plot, the frequency equal to zero for NW region deep earthquake does not find an explicit graphical representation. There are several labeling option for the mosaic plot, see Table 5.1 at ref. [4]. In our example, we choose to label the cells with the Pearson residuals value. The Pearson residuals can be computed as herein shown.

resids <- (mt - exp)/sqrt(exp) round(resids, 1) ## NE NW SE SW ## deep 5.7 -6.8 9.2 29.8 ## intermediate 9.1 -4.0 7.5 8.4 ## shallow -3.2 1.8 -3.0 -5.0

Where mt is the margin table, exp is the expected value based on indipendence hypothesis as previously computed above.

If you want to go more in detail with categorical data exploratory analysis, read ref. [4] as a valuable source of technical content and examples.

If you have any questions, please feel free to comment below.

References
1. Earthquake dataset
2. Eathquake dataset terms
3. Introductory Statistics with R, 2nd Edition, P. Dalgaard, Springer
4. Discrete Data Analysis with R, M. Friendly D. Meyer, CRC press
5. Determining the Depth of an Earthquake

Related Post

Interested in guest posting? We would love to share your codes and ideas with our community.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Practical Introduction to Market Basket Analysis – Asociation Rules

Thu, 05/02/2019 - 02:00

Introduction

Ever wondered why items are displayed in a particular way in retail/online
stores. Why certain items are suggested to you based on what you have added to
the cart? Blame it on market basket analysis or association rule mining.

Resources

Below are the links to all the resources related to this post:

What?

Market basket analysis uses association rule mining under the hood to identify
products frequently bought together. Before we get into the nitty gritty of
market basket analysis, let us get a basic understanding of association rule
mining. It finds association between different objects in a set. In the case
of market basket analysis, the objects are the products purchased by a cusomter
and the set is the transaction. In short, market basket analysis

• is a unsupervised data mining technique
• that uncovers products frequently bought together
• and creates if-then scenario rules
Why ?

Market basket analysis creates actionable insights for:

• designing store layout
• online recommendation engines
• targeted marketing campaign/sales promotion/email campaign
• cross/up selling
• catalogue design

available through electronic point of sale systems. It generates
actionable insights for product placement, cross/up selling strategies,
targeted marketing campaigns, catalogue design, pricing strategies,
inventory control etc.

Use Cases

Association rule mining has applications in several industries including
retail, telecommunications, banking, insurance, manufacturing and medical.
Let us look at its applications in more detail in the following industries:

Retail

The introduction of electronic point of sale systems have allowed the
collection of immense amounts of data and retail organizations make prolifc
use of market basket analysis for

• designing store layout so that consumers can more easily find items that are
frequently purchased together
• recommending associated products that are frequently bought together,
“Customers who purchased this product also viewed this product…”
• emailing customers who bought products specific products with other products
and offers on those products that are likely to be interesting to them.
• grouping products that customers purchase frequently together in the store’s
product placement
• designing special promotions that combine or discount certain products
• optimizing the layout of the catalog of an eCommerce site
• controlling inventory based on product demands and what products sell better
together
Banks

Banks and financial institutions use market basket analysis to analyze credit
card purchases for fraud detection and cross sell insurance products,
investment products (mutual funds etc.), tax preparation, retirement planning,
wealth management etc. It can also be used for next best offer, sequence and
seasonal offers.

Telecommunications

The telecommunications industry is characterized by high volatility and low
customer loyalty due to lucrative offers for new customers from other service
providers. The more services a customer uses from a particular operator, the
harder it gets for him/her to switch to another operator. Market basket
analysis is used to bundle mobile, landline, TV and internet services to
customers to increase stickiness and reduce churn.

Simple Example

Before we move on to the case study, let us use a simple example to understand
the important terminologies that we will come across in the rest of the
tutorial. In the example, the transactions include the following products:

• mobile phones
• ear phones
• USB cable
• power bank
• screen guard
• mobile case cover
• modem/router
• mouse
• external hard drive
Steps

The two important steps in market basket analysis are:

• frequent itemset generation
• rules generation

We will discuss these steps in more detail in the case study.

Itemset

Itemset is the collection of items purchased by a customer. In our example,
mobile phone and screen guard are a frequent intemset. They are present in
3 out of 5 transactions.

Antecedent & Consequent

Antecedent is the items of the left hand side of the rule and consequent is
the right hand side of the rule. In our example, mobile phone is the antecedent
and screen guard is the consequent.

Support

Support is the probability of the antecedent event occuring. It is the relative
frequency of the itemset. If it is less than 50% then the association is
considered less fruitful. In our example, support is the relative frequency of
transactions that include both mobile phone and screen guard.

Confidence

Confidence is the probability the consequent will co-occur with the antecedent.
It expresses the operational efficiency of the rule. In our example, it is the
probability that a customer will purchase screen guard provided that he has

Lift

The lift ratio calculates the efficiency of the rule in finding consequences,
compared to a random selection of transactions. Generally, a Lift ratio of
greater than one suggests some applicability of the rule.To compute the lift
for a rule, divide the support of the itemset by the product of the support
for antecedent and consequent. Now, let us understand how to interpret lift.

Interpretation
• Lift = 1: implies no relationship between mobile phone and screen guard
(i.e., mobile phone and screen guard occur together only by chance)
• Lift > 1: implies that there is a positive relationship between mobile
phone and screen guard (i.e., mobile phone and screen guard occur together
more often than random)
• Lift < 1: implies that there is a negative relationship between mobile
phone and screen guard (i.e., mobile phone and screen guard occur together
less often than random)

Data

Two public data sets are available for users to explore and learn market basket
analysis:

The groceries data set is available in the arules package as well. In this
tutorial, we will use the UCI data set as it closely resembles real world data
sets giving us a chance to reshape the data and restructure it in format
required by the arules package.

Data Dictionary
• invoice number
• stock code
• description
• quantity
• invoice date
• unit price
• customer id
• country

any loss of continuity.

As shown above, the data set has one row per item. We have created a tiny R
package mbar,
for data pre-processing. It can be installed from GitHub as shown below:

We will use mbar_prep_data() from the mbar package to reshape the data so
that there is one row per transaction with items across columns excluding
the column names.

library(mbar) mba_data <- read_excel("online-retail.xlsx") transactions <- mbar_prep_data(mba_data, InvoiceNo, Description) head(transactions) ## # A tibble: 6 x 1,114 ## item_1 item_2 item_3 item_4 item_5 item_6 item_7 item_8 item_9 item_10 ## ## 1 WHITE~ WHITE~ CREAM~ KNITT~ RED W~ SET 7~ GLASS~ "" "" "" ## 2 HAND ~ HAND ~ "" "" "" "" "" "" "" "" ## 3 ASSOR~ POPPY~ POPPY~ FELTC~ IVORY~ BOX O~ BOX O~ BOX O~ HOME ~ LOVE B~ ## 4 JAM M~ RED C~ YELLO~ BLUE ~ "" "" "" "" "" "" ## 5 BATH ~ "" "" "" "" "" "" "" "" "" ## 6 ALARM~ ALARM~ ALARM~ PANDA~ STARS~ INFLA~ VINTA~ SET/2~ ROUND~ SPACEB~ ## # ... with 1,104 more variables: item_11 , item_12 , ## # item_13 , item_14 , item_15 , item_16 , ## # item_17 , item_18 , item_19 , item_20 , ## # item_21 , item_22 , item_23 , item_24 , ## # item_25 , item_26 , item_27 , item_28 , ## # item_29 , item_30 , item_31 , item_32 , ## # item_33 , item_34 , item_35 , item_36 , ## # item_37 , item_38 , item_39 , item_40 , ## # item_41 , item_42 , item_43 , item_44 , ## # item_45 , item_46 , item_47 , item_48 , ## # item_49 , item_50 , item_51 , item_52 , ## # item_53 , item_54 , item_55 , item_56 , ## # item_57 , item_58 , item_59 , item_60 , ## # item_61 , item_62 , item_63 , item_64 , ## # item_65 , item_66 , item_67 , item_68 , ## # item_69 , item_70 , item_71 , item_72 , ## # item_73 , item_74 , item_75 , item_76 , ## # item_77 , item_78 , item_79 , item_80 , ## # item_81 , item_82 , item_83 , item_84 , ## # item_85 , item_86 , item_87 , item_88 , ## # item_89 , item_90 , item_91 , item_92 , ## # item_93 , item_94 , item_95 , item_96 , ## # item_97 , item_98 , item_99 , item_100 , ## # item_101 , item_102 , item_103 , item_104 , ## # item_105 , item_106 , item_107 , item_108 , ## # item_109 , item_110 , ... EDA

Before we generate the rules using the arules package, let us explore
the data set a bit.

What time of day do people purchase? purchase_time <- mba_data %>% group_by(InvoiceDate) %>% slice(1) %>% mutate(time_of_day = hour(InvoiceDate)) %>% pull(time_of_day) %>% as.factor() %>% fct_count() purchase_time %>% ggplot() + geom_col(aes(x = f, y = n), fill = "blue") + xlab("Hour of Day") + ylab("Transactions") + ggtitle("Hourly Transaction Distribution")

How many items are purchased on an average? items <- mba_data %>% group_by(InvoiceNo) %>% summarize(count = n()) %>% pull(count) mean(items) ## [1] 20.92313 median(items) ## [1] 10 Most Purchased Items mba_data %>% group_by(Description) %>% summarize(count = n()) %>% arrange(desc(count)) ## # A tibble: 4,212 x 2 ## Description count ## ## 1 WHITE HANGING HEART T-LIGHT HOLDER 2369 ## 2 REGENCY CAKESTAND 3 TIER 2200 ## 3 JUMBO BAG RED RETROSPOT 2159 ## 4 PARTY BUNTING 1727 ## 5 LUNCH BAG RED RETROSPOT 1638 ## 6 ASSORTED COLOUR BIRD ORNAMENT 1501 ## 7 SET OF 3 CAKE TINS PANTRY DESIGN 1473 ## 8 1454 ## 9 PACK OF 72 RETROSPOT CAKE CASES 1385 ## 10 LUNCH BAG BLACK SKULL. 1350 ## # ... with 4,202 more rows Average Order Value total_revenue <- mba_data %>% group_by(InvoiceNo) %>% summarize(order_sum = sum(UnitPrice)) %>% pull(order_sum) %>% sum() total_transactions <- mba_data %>% group_by(InvoiceNo) %>% summarize(n()) %>% nrow() total_revenue / total_transactions ## [1] 96.47892 Read Data

It is now time to read data into R. We will use read.transactions()
read_csv() owing to the way it is structured. We will read the
transaction_data.csv file as it contains the data we had modified
in the previous step. We need to specify the following in order to

• name of the data set within quotes (single or double)
• the format of the data, if each line represnts a transaction, use basket,
and if each line represents an item in the transaction, use single
• the separator used to separate the items in a transaction

In our data set, each line represents a transaction and the items in the
transaction are separated by a ,.

basket_data <- read.transactions("transaction_data.csv", format = "basket", sep = ",") basket_data ## transactions in sparse format with ## 25901 transactions (rows) and ## 10085 items (columns)

represents a item and not a transaction. In that case, the format argument
should be set to the value single and the cols argument should specify
the names or positions of the columns that represent the transaction id and
item id. We tried to read data in this way as well but failed to do so.
However, the code is available below for other users to try and let us know if
you find a way to get it to work or spot any mistakes we may have made.

get_data <- read.transactions("retail.csv", format = "single", sep = ",", cols = c("InvoiceNo", "item"))

We were able to read the data when we removed the sep argument from the above
code, but the result from the summary() function was way different than what
we see in the next section i.e. it showed higher number of transactions and
items.

Data Summary

To get a quick overview of the data, use summary(). It will return the
following:

• number of transactions
• number of items
• most frequent items
• distribution of items
• five number summary
summary(basket_data) ## transactions as itemMatrix in sparse format with ## 25901 rows (elements/itemsets/transactions) and ## 10085 columns (items) and a density of 0.001660018 ## ## most frequent items: ## WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER ## 1999 1914 ## JUMBO BAG RED RETROSPOT PARTY BUNTING ## 1806 1488 ## LUNCH BAG RED RETROSPOT (Other) ## 1404 425005 ## ## element (itemset/transaction) length distribution: ## sizes ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 1454 4578 1727 1208 942 891 781 715 696 683 612 642 547 530 543 ## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ## 555 537 479 459 491 428 405 328 311 280 248 261 235 221 233 ## 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 ## 224 175 174 145 149 139 122 119 100 117 98 94 102 93 72 ## 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 ## 73 74 71 69 68 59 70 49 49 54 57 42 32 42 39 ## 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 ## 34 40 22 27 30 24 34 28 25 21 23 26 14 17 24 ## 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 ## 11 18 14 13 10 16 18 15 10 9 16 13 16 13 7 ## 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 ## 8 12 12 8 7 7 4 7 9 5 8 8 4 5 7 ## 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 ## 2 3 7 9 4 7 4 2 7 1 1 4 7 6 2 ## 120 121 122 123 124 125 126 127 129 130 131 132 133 134 135 ## 3 5 4 4 2 5 6 2 1 4 3 6 6 3 4 ## 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 ## 3 2 1 1 3 8 5 3 4 4 6 2 3 1 4 ## 151 152 153 154 155 156 157 158 159 160 162 163 164 167 168 ## 3 2 4 7 3 3 5 2 4 5 1 2 1 3 5 ## 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 ## 2 2 4 3 1 3 5 1 2 2 2 2 1 2 1 ## 184 185 186 187 189 190 192 193 194 196 197 198 201 202 204 ## 2 1 1 2 2 1 1 5 1 2 3 2 1 1 2 ## 205 206 207 208 209 212 213 215 219 220 224 226 227 228 230 ## 2 1 3 3 2 1 2 2 7 1 3 3 1 1 2 ## 232 234 236 238 240 241 244 248 249 250 252 256 257 258 260 ## 1 2 1 2 2 2 1 1 2 2 1 1 1 1 2 ## 261 263 265 266 270 272 281 284 285 298 299 301 303 304 305 ## 1 2 1 1 1 1 1 1 2 1 2 1 1 1 3 ## 312 314 316 320 321 326 327 329 332 333 338 339 341 344 348 ## 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 ## 350 360 365 367 375 391 394 398 400 402 405 411 419 422 429 ## 1 2 1 1 3 1 1 1 1 1 1 1 2 1 1 ## 431 442 447 460 468 471 477 509 514 530 587 627 1114 ## 2 1 1 1 1 1 1 1 1 1 1 1 1 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 2.00 8.00 16.74 20.00 1114.00 ## ## includes extended item information - examples: ## labels ## 1 *Boombox Ipod Classic ## 2 *USB Office Mirror Ball ## 3 ? Item Frequency Plot

The most frequent items in the data set can be plotted using
itemFrequencyPlot(). We can specify the number of items to be plotted and
whether the Y axis should represent the absolute or relative number of transactions
that include the item.

The topN argument can be used to specify the number of items to be plotted
and the type argument can be used to specify whether the Y axis represents
absolute/relative frequency of the items.

itemFrequencyPlot(basket_data, topN = 10, type = 'absolute')

In the below plot, the Y axis represents the relative frequency of the items
plotted.

itemFrequencyPlot(basket_data, topN = 10, type = 'relative')

Generate Rules

Finally, to the part you all have been waiting for, rules generation. The
apriori() function is used for generating the rules. We will first learn the
different inputs that must be specified and later on play around with them and
see how the rules generated change.

The first input is the data set, which in our case is basket_data. Next, we
will supply the mining parameters using the parameter argument:

• supp: minimum support for an itemset
• conf: minimum confidence
• maxlen: maximum number of items the antecedent may include
• target: the type of association mined i.e. rules

The parameter argument takes several additional inputs but to get started, it
is sufficient to know those mentioned above. All the inputs are supplied using
a list().

For our case study, we will specify the following:

• support: 0.009
• confidence: 0.8
• maxlen: 4

Keep in mind, mining association rules with very low values for support will
result in a large number of rules being generated, resulting in long execution
time and the process will eventually run out of memory.

rules <- apriori(basket_data, parameter = list(supp=0.009, conf=0.8, target = "rules", maxlen = 4)) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.8 0.1 1 none FALSE TRUE 5 0.009 1 ## maxlen target ext ## 4 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 233 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[10085 item(s), 25901 transaction(s)] done [1.16s]. ## sorting and recoding items ... [508 item(s)] done [0.03s]. ## creating transaction tree ... done [0.05s]. ## checking subsets of size 1 2 3 4 ## Warning in apriori(basket_data, parameter = list(supp = 0.009, conf = ## 0.8, : Mining stopped (maxlen reached). Only patterns up to a length of 4 ## returned! ## done [0.06s]. ## writing ... [22 rule(s)] done [0.00s]. ## creating S4 object ... done [0.02s].

Change the values of supp, conf and maxlen, and observe how the rules
generated change.

Rules Summary

Once the rules have been generated by apriori(), we can use summary() to
get some basic information such as rule length distribution.

summary(rules) ## set of 22 rules ## ## rule length distribution (lhs + rhs):sizes ## 2 3 4 ## 11 9 2 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.000 2.000 2.500 2.591 3.000 4.000 ## ## summary of quality measures: ## support confidence lift count ## Min. :0.009034 Min. :0.8035 Min. :22.59 Min. :234.0 ## 1st Qu.:0.010453 1st Qu.:0.8530 1st Qu.:25.02 1st Qu.:270.8 ## Median :0.013223 Median :0.8868 Median :55.94 Median :342.5 ## Mean :0.012760 Mean :0.9120 Mean :48.55 Mean :330.5 ## 3rd Qu.:0.014362 3rd Qu.:1.0000 3rd Qu.:61.23 3rd Qu.:372.0 ## Max. :0.018339 Max. :1.0000 Max. :71.30 Max. :475.0 ## ## mining info: ## data ntransactions support confidence ## basket_data 25901 0.009 0.8

The output from summary() does not display the rules though. To view the
rules, we have to use inspect().

Inspect Rules

The inspect() function will display the rules along with:

• support
• confidence
• lift
• count

Before you inspect the rules, you can sort it by support, confidence or
lift. In the below, output, we sort the rules by confidence in descending order
before inspecting them.

basket_rules <- sort(rules, by = 'confidence', decreasing = TRUE) inspect(basket_rules[1:10]) ## lhs rhs support confidence lift count ## [1] {BACK DOOR} => {KEY FOB} 0.009613528 1.0000000 61.23168 249 ## [2] {SET 3 RETROSPOT TEA} => {SUGAR} 0.014362380 1.0000000 69.62634 372 ## [3] {SUGAR} => {SET 3 RETROSPOT TEA} 0.014362380 1.0000000 69.62634 372 ## [4] {SET 3 RETROSPOT TEA} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [5] {SUGAR} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [6] {SHED} => {KEY FOB} 0.011273696 1.0000000 61.23168 292 ## [7] {SET 3 RETROSPOT TEA, ## SUGAR} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [8] {COFFEE, ## SET 3 RETROSPOT TEA} => {SUGAR} 0.014362380 1.0000000 69.62634 372 ## [9] {COFFEE, ## SUGAR} => {SET 3 RETROSPOT TEA} 0.014362380 1.0000000 69.62634 372 ## [10] {PINK REGENCY TEACUP AND SAUCER, ## REGENCY CAKESTAND 3 TIER, ## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.009999614 0.8900344 25.16679 259 Redundant & Non Redundant Rules Redundant Rules

A rule is redundant if a more general rules with the same or a higher
confidence exists. That is, a more specific rule is redundant if it is only
equally or even less predictive than a more general rule. A rule is more
general if it has the same RHS but one or more items removed from the LHS.

Example 1

In the above example, the first rule has the same support, condifence and lift
as the next two rules. The second item in the left hand side of the rule is not
adding any value and as such makes the rule redundant.

Example 2

In the above example, the first two rules have the same support, condifence and
lift. The third rule differs only with respect to lift.

Example 3

In the above example, the first and third rule have the same support,
condifence and lift. The second rule is different with respect to confidence
and lift.

Now that we have understood what redundant rules are and how to identify them,
let us use the below R code to inspect them.

inspect(rules[is.redundant(rules)]) ## lhs rhs support ## [1] {SET 3 RETROSPOT TEA,SUGAR} => {COFFEE} 0.01436238 ## [2] {COFFEE,SET 3 RETROSPOT TEA} => {SUGAR} 0.01436238 ## [3] {COFFEE,SUGAR} => {SET 3 RETROSPOT TEA} 0.01436238 ## confidence lift count ## [1] 1 55.94168 372 ## [2] 1 69.62634 372 ## [3] 1 69.62634 372 Non-redundant Rules

Now let us look at the non-redundant rules.

inspect(rules[!is.redundant(rules)]) ## lhs rhs support confidence lift count ## [1] {REGENCY TEA PLATE PINK} => {REGENCY TEA PLATE GREEN} 0.009034400 0.8863636 71.29722 234 ## [2] {BACK DOOR} => {KEY FOB} 0.009613528 1.0000000 61.23168 249 ## [3] {SET 3 RETROSPOT TEA} => {SUGAR} 0.014362380 1.0000000 69.62634 372 ## [4] {SUGAR} => {SET 3 RETROSPOT TEA} 0.014362380 1.0000000 69.62634 372 ## [5] {SET 3 RETROSPOT TEA} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [6] {COFFEE} => {SET 3 RETROSPOT TEA} 0.014362380 0.8034557 55.94168 372 ## [7] {SUGAR} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [8] {COFFEE} => {SUGAR} 0.014362380 0.8034557 55.94168 372 ## [9] {REGENCY TEA PLATE GREEN} => {REGENCY TEA PLATE ROSES} 0.010347091 0.8322981 55.99313 268 ## [10] {SHED} => {KEY FOB} 0.011273696 1.0000000 61.23168 292 ## [11] {SET/6 RED SPOTTY PAPER CUPS} => {SET/6 RED SPOTTY PAPER PLATES} 0.012084476 0.8087855 44.38211 313 ## [12] {SET/20 RED RETROSPOT PAPER NAPKINS, ## SET/6 RED SPOTTY PAPER CUPS} => {SET/6 RED SPOTTY PAPER PLATES} 0.009111617 0.8872180 48.68609 236 ## [13] {PINK REGENCY TEACUP AND SAUCER, ## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.018339060 0.8828996 24.96505 475 ## [14] {GREEN REGENCY TEACUP AND SAUCER, ## PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.018339060 0.8512545 22.59051 475 ## [15] {PINK REGENCY TEACUP AND SAUCER, ## REGENCY CAKESTAND 3 TIER} => {ROSES REGENCY TEACUP AND SAUCER} 0.011235087 0.8584071 22.78033 291 ## [16] {PINK REGENCY TEACUP AND SAUCER, ## REGENCY CAKESTAND 3 TIER} => {GREEN REGENCY TEACUP AND SAUCER} 0.011312305 0.8643068 24.43931 293 ## [17] {STRAWBERRY CHARLOTTE BAG, ## WOODLAND CHARLOTTE BAG} => {RED RETROSPOT CHARLOTTE BAG} 0.010771785 0.8110465 23.65644 279 ## [18] {PINK REGENCY TEACUP AND SAUCER, ## REGENCY CAKESTAND 3 TIER, ## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.009999614 0.8900344 25.16679 259 ## [19] {GREEN REGENCY TEACUP AND SAUCER, ## PINK REGENCY TEACUP AND SAUCER, ## REGENCY CAKESTAND 3 TIER} => {ROSES REGENCY TEACUP AND SAUCER} 0.009999614 0.8839590 23.45843 259 What influenced purchase of product X?

So far, we have learnt how to generate, inspect and prune rules. Now, how do
we use these rules? To make business sense, we need to come up with a set of
rules that can be used either for product placement in physical stores or
as recommendations in an online store or for targeted marketing via email
campaigns etc. To achieve that, we need to know 2 things:

• what products influenced the purchase of product X?
• what purchases did product X influence?

For our case study, we can modify the above questions as:

What influenced the purchase of sugar?

To view the products which influenced the purchase of sugar, we will
continue to use the apriori() function but add one more argument, appearance.
It restricts the appearance of the items. Since we want the right hand side of
the rules to have only one value, sugar, we will set the rhs argument to
sugar. The left hand side of the rules should include all the products that
influenced the purchase of sugar i.e. it will exclude sugar. We will use
the default argument and supply it the value lhs i.e. all items excluding
sugar can appear on the left hand side of the rule by default.

• default
• rhs
sugar_rules <- apriori(basket_data, parameter = list(supp = 0.009, conf = 0.8), appearance = list(default = "lhs", rhs = "SUGAR")) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.8 0.1 1 none FALSE TRUE 5 0.009 1 ## maxlen target ext ## 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 233 ## ## set item appearances ...[1 item(s)] done [0.00s]. ## set transactions ...[10085 item(s), 25901 transaction(s)] done [1.26s]. ## sorting and recoding items ... [508 item(s)] done [0.03s]. ## creating transaction tree ... done [0.05s]. ## checking subsets of size 1 2 3 4 done [0.08s]. ## writing ... [3 rule(s)] done [0.02s]. ## creating S4 object ... done [0.01s]. rules_sugar <- sort(sugar_rules, by = "confidence", decreasing = TRUE) inspect(rules_sugar) ## lhs rhs support confidence lift ## [1] {SET 3 RETROSPOT TEA} => {SUGAR} 0.01436238 1.0000000 69.62634 ## [2] {COFFEE,SET 3 RETROSPOT TEA} => {SUGAR} 0.01436238 1.0000000 69.62634 ## [3] {COFFEE} => {SUGAR} 0.01436238 0.8034557 55.94168 ## count ## [1] 372 ## [2] 372 ## [3] 372

For the support and confidence we have mentioned, we know the following
products influenced the purchase of sugar:

• COFFEE
• SET 3 RETROSPOT TEA
What purchases did product X influence?

Now that we know what products influenced the purchase of sugar, let us

What purchases did sugar influence?

In this case, we want sugar to be on the left hand side of the rule and all
the products it influenced to be on the right hand side. We set the lhs
argument to sugar and the default argument to rhs as all the products,
the purchase of which was influenced by sugar should appear on the left
hand side of the rule by default.

sugar_rules <- apriori(basket_data, parameter = list(supp = 0.009, conf = 0.8), appearance = list(default = "rhs", lhs = "SUGAR")) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.8 0.1 1 none FALSE TRUE 5 0.009 1 ## maxlen target ext ## 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 233 ## ## set item appearances ...[1 item(s)] done [0.00s]. ## set transactions ...[10085 item(s), 25901 transaction(s)] done [1.25s]. ## sorting and recoding items ... [508 item(s)] done [0.01s]. ## creating transaction tree ... done [0.06s]. ## checking subsets of size 1 2 done [0.02s]. ## writing ... [2 rule(s)] done [0.00s]. ## creating S4 object ... done [0.03s]. rules_sugar <- sort(sugar_rules, by = "confidence", decreasing = TRUE) inspect(rules_sugar) ## lhs rhs support confidence lift count ## [1] {SUGAR} => {SET 3 RETROSPOT TEA} 0.01436238 1 69.62634 372 ## [2] {SUGAR} => {COFFEE} 0.01436238 1 55.94168 372

For the support and confidence we have mentioned, we know the purchase of
the following products were influenced by sugar:

• COFFEE
• SET 3 RETROSPOT TEA
Top Rules

Let us take a look at the top rules by

Support supp_rules <- sort(rules, by = 'support', decreasing = TRUE) top_rules <- supp_rules[1:10] inspect(top_rules) ## lhs rhs support confidence lift count ## [1] {PINK REGENCY TEACUP AND SAUCER, ## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.01833906 0.8828996 24.96505 475 ## [2] {GREEN REGENCY TEACUP AND SAUCER, ## PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.01833906 0.8512545 22.59051 475 ## [3] {SET 3 RETROSPOT TEA} => {SUGAR} 0.01436238 1.0000000 69.62634 372 ## [4] {SUGAR} => {SET 3 RETROSPOT TEA} 0.01436238 1.0000000 69.62634 372 ## [5] {SET 3 RETROSPOT TEA} => {COFFEE} 0.01436238 1.0000000 55.94168 372 ## [6] {COFFEE} => {SET 3 RETROSPOT TEA} 0.01436238 0.8034557 55.94168 372 ## [7] {SUGAR} => {COFFEE} 0.01436238 1.0000000 55.94168 372 ## [8] {COFFEE} => {SUGAR} 0.01436238 0.8034557 55.94168 372 ## [9] {SET 3 RETROSPOT TEA, ## SUGAR} => {COFFEE} 0.01436238 1.0000000 55.94168 372 ## [10] {COFFEE, ## SET 3 RETROSPOT TEA} => {SUGAR} 0.01436238 1.0000000 69.62634 372 Confidence conf_rules <- sort(rules, by = 'confidence', decreasing = TRUE) top_rules <- conf_rules[1:10] inspect(top_rules) ## lhs rhs support confidence lift count ## [1] {BACK DOOR} => {KEY FOB} 0.009613528 1.0000000 61.23168 249 ## [2] {SET 3 RETROSPOT TEA} => {SUGAR} 0.014362380 1.0000000 69.62634 372 ## [3] {SUGAR} => {SET 3 RETROSPOT TEA} 0.014362380 1.0000000 69.62634 372 ## [4] {SET 3 RETROSPOT TEA} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [5] {SUGAR} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [6] {SHED} => {KEY FOB} 0.011273696 1.0000000 61.23168 292 ## [7] {SET 3 RETROSPOT TEA, ## SUGAR} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [8] {COFFEE, ## SET 3 RETROSPOT TEA} => {SUGAR} 0.014362380 1.0000000 69.62634 372 ## [9] {COFFEE, ## SUGAR} => {SET 3 RETROSPOT TEA} 0.014362380 1.0000000 69.62634 372 ## [10] {PINK REGENCY TEACUP AND SAUCER, ## REGENCY CAKESTAND 3 TIER, ## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.009999614 0.8900344 25.16679 259 Lift lift_rules <- sort(rules, by = 'lift', decreasing = TRUE) top_rules <- lift_rules[1:10] inspect(top_rules) ## lhs rhs support confidence lift count ## [1] {REGENCY TEA PLATE PINK} => {REGENCY TEA PLATE GREEN} 0.009034400 0.8863636 71.29722 234 ## [2] {SET 3 RETROSPOT TEA} => {SUGAR} 0.014362380 1.0000000 69.62634 372 ## [3] {SUGAR} => {SET 3 RETROSPOT TEA} 0.014362380 1.0000000 69.62634 372 ## [4] {COFFEE, ## SET 3 RETROSPOT TEA} => {SUGAR} 0.014362380 1.0000000 69.62634 372 ## [5] {COFFEE, ## SUGAR} => {SET 3 RETROSPOT TEA} 0.014362380 1.0000000 69.62634 372 ## [6] {BACK DOOR} => {KEY FOB} 0.009613528 1.0000000 61.23168 249 ## [7] {SHED} => {KEY FOB} 0.011273696 1.0000000 61.23168 292 ## [8] {REGENCY TEA PLATE GREEN} => {REGENCY TEA PLATE ROSES} 0.010347091 0.8322981 55.99313 268 ## [9] {SET 3 RETROSPOT TEA} => {COFFEE} 0.014362380 1.0000000 55.94168 372 ## [10] {COFFEE} => {SET 3 RETROSPOT TEA} 0.014362380 0.8034557 55.94168 372

Visualization

To visualize the rules, the authors of arules have created a companion
package, arulesViz. It offers several options for visualizing the rules
generated by apriori().

Scatter Plot

We can use the default plot() method to create a scatter plot. It will plot
the support on the X axis, the confidence on the Y axis and the lift is
represented by the opaqueness/alpha of the color of the points.

Network Plot

We can create a network plot using the method argument and supplying it the
value graph. You can see the directionality of the rule in the below plot.
For example, people who buy shed also buy key fob and similarly, people
who buy back door also buy key fob. It will be difficult to identify
the directionality of the rules when we are trying to plot too many rules.
The method argument takes several other values as well.

plot(top_rules, method = 'graph')

Things to keep in mind.. Directionality of rule is lost while using lift

The directionality of a rule is lost while using lift. In the below example,
the lift is same for both the following rules:

• {Mobile Phone} => {Screen Guard}
• {Screen Guard} => {Mobile Phone}

It is clear that the lift is the same irrespective of the direction of the rule.

Confidence as a measure can be misleading

If you look at the below example, the confidence for the second rule,
{Screen Guard} => {Mobile Phone}, is greater than the first rule,
{Mobile Phone} => {Screen Guard}. It does not mean that we can recommend
a mobile phone to a customer who is purchasing a screen guard. It is important
to ensure that we do not use rules just because they have high confidence
associated with them.

Summary
• market basket analysis is an unsupervised data mining technique
• uncovers products frequently bought together
• creates if-then scenario rules
• cost-effective, insightful and actionable
• association rule mining has applications in several industries
• directionality of rule is lost while using lift
• confidence as a measure can be misleading

Your knowledge of the domain/business matters a lot when you are trying to
use market basket analysis. Especially, when you are trying to select or
shortlist rules and use them for product placement in a store or for
recommending products online. It is a good practice to critically review the
rules before implementing them.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Process Mining (Part 3/3): More analysis and visualizations

Thu, 05/02/2019 - 02:00

(This article was first published on R on notast, and kindly contributed to R-bloggers)

Intro

A week ago, Havard Business Review published an article on process mining and provided reasons for companies to adopt it. If you need a refresher on the concepts of process mining, you can refer to my first post. Conducting process mining is easy with R’s bupaR package. bupaR allows you to create a variety of visualizations as you analyse event logs. It includes visualizations of workflow on the ground which you can then compare them against theoretical models to discover deviations. However, there are some analysis and visualizations which are not included in bupaR. I will cover these in this last post on process mining. (The datasets used in this post are healthcare related.)

Interpruption Index

The interruption index measures how much does a resource have to toggle between cases before completing the case instead of completing all the activities for a case before proceeding to the next case. The toggling between incomplete cases could be due to many reasons. However, if the reason is due to disruptions in the workflow, it will result in inefficiency and lost in productivity. (If there is a proper term for what I’m doing, please let me know in the comments below)

Index components

The interruption index is a ratio of two parameters, time block and case load.

Time block is where all the activity instances executed by a specific resource are arranged in chronological order within a defined time. For this post, it will be a day. The consecutive activity instances for a particular case are grouped together to form a time block. In the example below, there are four time blocks.

Activity Resource Time Day Case Time Block Admission Z 10:40 1 AA 1 Admission Z 11:11 1 BB 2 X-ray Z 11:30 1 BB 2 X-ray Z 12:00 1 AA 3 Scans Z 12:44 1 AA 3 Pre-op Z 13:59 1 BB 4

The caseload component in the ratio refers to the maximum number of cases seen by a specific resource on a particular day. In the example above, there is a maximum of 2 cases. Hence, the ratio is 2 (4/2= 2). The lowest score is 1 which means there is no toggling between cases and the day is the least interrupted.

Calculating the index

I will be using the sepsis eventlog from the bupaR package to illustrate the interprution index.

There are 26 resources providers in the dataset but we will only compare the index between 2 resources for simplification purposes.

# library library(plyr) library(tidyverse) library(bupaR) theme_set(theme_light()) n_resources(sepsis) ## [1] 26

Let’s do some wrangling to derive a dataframe which we will work with.

#derive desired df sepsis_df<-sepsis %>% filter_resource(c("A", "B")) %>% # filter 2 resources for our example data.frame() %>% # convert event log object select(case_id, activity, lifecycle, resource, timestamp) %>% # select relevant variables drop_na() #drop na observation

In this post the timeframe for the index is a day so we will create a day identifier, ID_day. The index requires us to calculate the caseload for each day for each resource.

sepsis_df<-sepsis_df %>% mutate(Date= as.Date(timestamp)) %>% mutate("ID_day" = group_indices_(., .dots = c("resource","Date"))) %>% # add ID_day group_by(ID_day) %>% arrange(ID_day, timestamp) %>% select(resource, ID_day, case_id) %>% mutate(caseload = n_distinct(case_id)) %>% # caseload ungroup() sepsis_df %>% head() ## # A tibble: 6 x 4 ## resource ID_day case_id caseload ## ## 1 A 1 XJ 1 ## 2 A 1 XJ 1 ## 3 A 1 XJ 1 ## 4 A 1 XJ 1 ## 5 A 2 WEA 1 ## 6 A 2 WEA 1

We now calculate the timeblock and then the interuption index.

#remove duplicate rows ix <- c(TRUE, rowSums(tail(sepsis_df, -1) == head(sepsis_df, -1)) != ncol(sepsis_df)) sepsis_df<-sepsis_df[ix,] #transpose case_id column sepsis_df <- ddply(sepsis_df, .(ID_day), transform, idx = paste("TB", 1:length(case_id), sep = "")) %>% spread(idx, case_id) #calculate timeblocks sepsis_df<-sepsis_df %>% mutate(timeblock = rowSums(!is.na(select(.,starts_with("TB"))))) %>% select(-starts_with("TB")) # remove reduntanct TB variables #calculate index sepsis_df$interupt_index<-sepsis_df$timeblock/ sepsis_df$caseload # sample size of index for each resource sepsis_df<-sepsis_df %>% add_count(resource, interupt_index) head(sepsis_df) ## # A tibble: 6 x 6 ## resource ID_day caseload timeblock interupt_index n ## ## 1 A 1 1 1 1 282 ## 2 A 2 1 1 1 282 ## 3 A 3 1 1 1 282 ## 4 A 4 1 1 1 282 ## 5 A 5 2 5 2.5 2 ## 6 A 6 1 1 1 282 Visualizing the index sepsis_df %>% ggplot(aes (interupt_index, n, size=n)) + geom_point() + facet_grid(~resource) Let’s plot the interruption index for Resource A and Resource B. Resource A and Resource B experienced an interruption index of 1 on most days. In other words, they completed all the activities for a specific case before moving to the next case. Resource B has more days with an index of 1 than Resource A. There is a greater variation of the interruption index for Resource A, with a maximum ratio of 3. Timeseries Heatmap bupaR has many fantastic built in functions to create various visualizations to address different questions on workflow. Unfortunately, bupaR does not have a function to create a time series heat map. I find time series heat map useful to identify peak and lull periods for each activity. Discovering peak periods allow managers to allocate more resources to prevent accumulation of bottlenecks. Understanding lull periods can free up excess capacity. Data wrangling We will use a different dataset, patients, to demonstrate how to create a time series heat map from an event log. The patients dataset is an event log of a series of activates conducted when patients are admitted till they are discharged. However, the current factor levelling for the activity variable, handling is not reflective of an expected workflow when patients are admitted. For instance, “Blood test” is the first activity and “Registration” is the fifth activity. patients_df<-data.frame(patients) #convert the eventlog object to a dataframe object levels(patients_df$handling) ## [1] "Blood test" "Check-out" "Discuss Results" ## [4] "MRI SCAN" "Registration" "Triage and Assessment" ## [7] "X-Ray"

We’ll need to relevel the activities in handling to a sequence likely seen in a hospital admission.

patients_df<-patients_df %>% mutate(handling= fct_relevel(handling, "Registration", "Triage and Assessment", "Blood test", "X-Ray", "MRI SCAN", "Discuss Results", "Check-out")) levels(patients_df$handling) ## [1] "Registration" "Triage and Assessment" "Blood test" ## [4] "X-Ray" "MRI SCAN" "Discuss Results" ## [7] "Check-out" Ploting the heatmap Traditionally, I will use the plasma colour palette in scale_fill_viridis_c for heatmaps. However, since I encountered this TidyTuesday tweet, I have adopted the Zissou1 colour palette from the wesanderson package for my heatmaps. patients_df %>% dplyr::mutate( time= format(time, format = "%H:%M:%S") %>% as.POSIXct(format = "%H:%M:%S"), #standardized the date for ploting hour= lubridate::floor_date(time, "hour")) %>% # round down time to nearest hour count(handling, hour) %>% # total instances of each activity at each hour add_count(handling, wt=n) %>% # total instances of each activity mutate(percent= ((n/nn)*100)) %>% #relative freq for each activity ggplot(aes(hour, handling, fill=percent)) + geom_tile(size=.5, color="white") + scale_fill_gradientn(colours = wesanderson::wes_palette("Zissou1", 20 ,type = "continuous"))+ theme_classic() + labs(x="24hour Clock", y="", title= "Peak and Lull Period of Patient Activities", subtitle= "percentage calculated is the relative frequency for a specific activity", fill="%") + scale_y_discrete(limits = rev(levels(patients_df$handling)))+ # reverse display of y-axis varaibles scale_x_datetime(date_breaks = ("1 hour"), date_labels = "%H") #display only 24H clock values

The heatmap reveals that the most common time to check out is 1700hours and the most common time to register is 1800hours. This makes logical sense, as existing patients need to be discharged first before the hospital can admit new patients. There is consecutive alternation between warm and cool colours for the blood test activity from 0900 hours to 1600hours. I hypothesize that the hourly fluctuation is due to insufficient machinery/manpower during office hours. During the lull periods, manpower and machinery are operating at maximum capacity to process blood collected from the previous peak period. There is inadequate machinery/manpower to accommodate more blood test thus less blood test are conducted.

To Sum Up

In this last post on process mining, we look at analysis and visualizations not included in the bupaR package but are still useful in understanding the workflow of business operations. We calculated the interruption index to examine the extend of disruption when a resource attends to a specific case. We also plotted a heat map to visualize busy and quiet periods of the various activities in an event log.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on notast. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Bayesian models in R

Wed, 05/01/2019 - 19:41

(This article was first published on poissonisfish, and kindly contributed to R-bloggers)

Greater Ani (Crotophaga major) is a cuckoo species whose females occasionally lay eggs in conspecific nests, a form of parasitism recently explored [source]

If there was something that always frustrated me was not fully understanding Bayesian inference. Sometime last year, I came across an article about a TensorFlow-supported R package for Bayesian analysis, called greta. Back then, I searched for greta tutorials and stumbled on this blog post that praised a textbook called Statistical Rethinking: A Bayesian Course with Examples in R and Stan by Richard McElreath. I had found a solution to my lingering frustration so I bought a copy straight away.

I spent the last few months reading it cover to cover and solving the proposed exercises, which are heavily based on the rethinking package. I cannot recommend it highly enough to whoever seeks a solid grip on Bayesian statistics, both in theory and application. This post ought to be my most gratifying blogging experience so far, in that I am essentially reporting my own recent learning. I am convinced this will make the storytelling all the more effective.

As a demonstration, the female cuckoo reproductive output data recently analysed by Riehl et al., 2019 [1] will be modelled using

• Poisson and zero-inflated Poisson regressions, based on the rethinking package;
• A logistic regression, based on the greta package.

In the process, we will conduct the MCMC sampling, visualise posterior distributions, generate predictions and ultimately assess the influence of social parasitism in female reproductive output. You should have some familiarity with standard statistical models. If you need to refresh some basics of probabilities using R have a look into my first post. All materials are available under https://github.com/monogenea/cuckooParasitism. I hope you enjoy as much as I did!

NOTE that most code chunks containing pipes are corrupted. PLEASE refer to the materials from the repo.

Introduction Frequentist perspective

It is human nature to try reduce complexity in learning things, to discretise quantities, and this is specially true in modern statistics. When we need to estimate any given unknown parameter we usually produce the most plausible value. Think of flipping a coin a thousand times, not knowing whether it is biased and how much. Let be the proportion of heads in the thousand trials. If I ask you to estimate , the probability of having heads in any given trial, what would your answer be?

Chances are you will say , which is a sensible choice (the hat means ‘estimate’). The obtained frequency of heads is the maximum-likelihood estimate (MLE) of in our experiment. This is the intuitive frequentist perspective endorsed by most people.

Bayesian perspective

The Bayesian perspective is more comprehensive. It produces no single value, but rather a whole probability distribution for the unknown parameter  conditional on your data. This probability distribution, , is called posterior. The posterior comes from one of the most celebrated works of Rev. Thomas Bayes that you have probably met before,

or, in plain words,

The posterior can be computed from three key ingredients:

• A likelihood distribution, ;
• A prior distribution, ;
• The ‘average likelihood’, .

All Bayes theorem does is updating some prior belief by accounting to the observed data, and ensuring the resulting probability distribution has density of exactly one.

The following reconstruction of the theorem in three simple steps will seal the gap between frequentist and bayesian perspectives.

Step 1. All possible ways (likelihood distribution)

Some five years ago, my brother and I were playing roulette in the casino of Portimão, Portugal. Among other things, you can bet on hitting either black (B) or red (r) with supposedly equal probability. For simplification, we assume and that I remember the ten past draws before we placed a bet:

Having based on these ten draws, my brother argued we should go for black. His reasoning was there would be a greater chance of hitting black than red, to which I kind of agreed. We eventually placed a bet on black and won. Knowing nothing of the chances of hitting either colour in this example, is the MLE of . This is the frequentist approach. But wouldn’t you assume ?

A different way of thinking is to consider the likelihoods obtained using different estimates of . If we estimate the likelihood from 100 estimates of  ranging from 0 to 1, we can confidently approximate its distribution. Here, the probability mass function of the binomial distribution with eight successes, i.e. , provides the likelihood of all different estimates of . We can demonstrate it with few lines of R code.

rangeP <- seq(0, 1, length.out = 100) plot(rangeP, dbinom(x = 8, prob = rangeP, size = 10), type = "l", xlab = "P(Black)", ylab = "Density")

As the name indicates, the MLE in the roulette problem is the peak of the likelihood distribution. However, here we uncover an entire spectrum comprising all possible ways could have been produced.

Step 2. Update your belief (prior distribution)

We are not even half-way in our Bayesian excursion. The omission of a prior, which is the same as passing a uniform prior, dangerously gives likelihood free rein in inference. These priors are also called ‘flat’. On the other hand, informative priors constrain parameter estimation, more so the narrower they are. This should resonate to those familiar with Lasso and ridge regularisation. Also, note that multiplying a likelihood distribution by a constant does not change its shape, even if it changes density.

Going back to the roulette example, assume I intervened and expressed my belief to my brother that must be 0.5 or close, e.g. . For comparison, overlay this prior distribution with the likelihood from the previous step.

lines(rangeP, dnorm(x = rangeP, mean = .5, sd = .1) / 15, col = "red")

The prior is now shown in red. In the code above, I divided the prior by a constant solely for scaling purposes. Keep in mind that distribution density only matters for the posterior.

Computing the product between the likelihood and my prior is straightforward, and gives us the numerator from the theorem. The next bit will compute and overlay the unstandardised posterior of , . The usage of a sequence of estimates for to reconstruct probability distributions is called grid approximation.

lik <- dbinom(x = 8, prob = rangeP, size = 10) prior <- dnorm(x = rangeP, mean = .5, sd = .1) lines(rangeP, lik * prior, col = "green")

In short, we have successfully used the ten roulette draws (black) to updated my prior (red) into the unstardardised posterior (green). Why am I calling it ‘unstandardised’? The answer comes with the denominator from the theorem.

Step 3. Make it sum up to one (standardising the posterior)

An important property of any probability density or mass function is that it integrates to one. This is the the role of that ugly denominator we simply called ‘average likelihood’. It standardises into the actual posterior with density of one. Knowing density is the sole difference, then the posterior is always proportional to the unstandardised posterior:

That funny symbol means ‘proportional to’. We will now finalise the roulette example by standardising the posterior computed above and comparing all pieces of the theorem.

unstdPost <- lik * prior stdPost <- unstdPost / sum(unstdPost) lines(rangeP, stdPost, col = "blue") legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"), text.col = 1:4, bty = "n")

Note how shape is preserved between the unstandardised and the actual posterior distributions. In this instance we could use the unstandardised form for various things such as simulating draws. However, when additional parameters and competing models come into play you should stick to the actual posterior.

We have finally reached the final form of the Bayes theorem, . The posterior of can now be used to draw probability intervals or simulate new roulette draws. And there, we moved from a frequentist perspective to a fully-fledge Bayesian one.

Note that in this one example there was a single datum, the number of successes in a total of ten trials. We will see that with multiple data, the single datum likelihoods and prior probabilities are all multiplied together. Moreover, when multiple parameters enter the model, the separate priors are all multiplied together as well. This might help with digesting the following example. In any case, remember it all goes into .

Now is time to step up to a more sophisticated analysis involving not one, but two parameters.

Simulation

We will quickly cover all three steps in a simple simulation. This will demonstrate inference over the two parameters and from a normal distribution. The purpose of this example is two-fold: i) to make clear that the addition of more and more parameters makes posterior estimation increasingly inefficient using the grid approximation, and ii) to showcase the ability of Bayesian models to capture the true underlying parameters.

Take a sample of 100 observations from the distribution . Only you and I know the true parameters, and . You can then use this sample to recover the original parameters using the following Bayesian pseudo-model,

with the last two terms corresponding to the priors of and , respectively. All you need is

1. To make all possible combinations of 200 values of spanning 0 and 10 with 200 values of spanning 1 and 3. These are the candidate estimates of both and to explain how the sample above was generated.
2. Compute the likelihood for every one of the combinations (grid approximation). This amounts to with i and j indexing parameter combinations and data points, respectively, or in log-scale , which turns out to be a lot easier in R;
3. Compute the product between (or sum in log-scale) and the corresponding prior of , and of , ;
4. Standardise the resulting product and recover original units if using log-scale. This standardisation, as you will note, divides the product of prior and likelihood distributions by its maximum value, unlike the total density mentioned earlier. This is a more pragmatic way of obtaining probability values later used in posterior sampling.

We have now a joint posterior distribution of and that can be sampled from. How closely does a sample of size 1,000 match the true parameters, and ?

# Define real pars mu and sigma, sample 100x trueMu <- 5 trueSig <- 2 set.seed(100) randomSample <- rnorm(100, trueMu, trueSig) # Grid approximation, mu %in% [0, 10] and sigma %in% [1, 3] grid <- expand.grid(mu = seq(0, 10, length.out = 200), sigma = seq(1, 3, length.out = 200)) # Compute likelihood lik <- sapply(1:nrow(grid), function(x){ sum(dnorm(x = randomSample, mean = grid$mu[x], sd = grid$sigma[x], log = T)) }) # Multiply (sum logs) likelihood and priors prod <- lik + dnorm(grid$mu, mean = 0, sd = 5, log = T) + dexp(grid$sigma, 1, log = T) # Standardize the lik x prior products to sum up to 1, recover unit prob <- exp(prod - max(prod)) # Sample from posterior dist of mu and sigma, plot postSample <- sample(1:nrow(grid), size = 1e3, prob = prob) plot(grid$mu[postSample], grid$sigma[postSample], xlab = "Mu", ylab = "Sigma", pch = 16, col = rgb(0,0,0,.2)) abline(v = trueMu, h = trueSig, col = "red", lty = 2)

The figure above displays a sample of size 1,000 from the joint posterior distribution . The true parameter values are highlighted by red dashed lines in the corresponding axes. Considering the use of a zero-centered prior for , is satisfying to observe the true value lands right in the centre of its marginal posterior. An equivalent observation can be made regarding . This is essentially the impact of the data in the inference. Try again with smaller sample sizes or more conservative, narrow priors. Can you anticipate the results?

Bayesian models & MCMC

Bayesian models are a departure from what we have seen above, in that explanatory variables are plugged in. As in traditional MLE-based models, each explanatory variable is associated with a coefficient, which for consistency we will call parameter. Because the target outcome is also characterised by a prior and a likelihood, the model then approximates the posterior by finding a compromise between all sets of priors and corresponding likelihoods This is in clear contrast to algebra techniques, such as QR decomposition from OLS. Finally, the introduction of link functions widens up the range of problems that can be modelled, e.g. binary or multi-label classification, ordinal categorical regression, Poisson regression and Binomial regression, to name a few. Such models are commonly called generalised linear models (GLMs).

One of the most attractive features of Bayesian models is that uncertainty with respect to the model parameters trickles down all the way to the target outcome level. No residuals. Even the uncertainty associated with outcome measurement error can be accounted for, if you suspect there is some.

So, why the current hype around Bayesian models? To a great extent, the major limitation to Bayes inference has historically been the posterior sampling. For most models, the analytical solution to the posterior distribution is intractable, if not impossible. The use of numerical methods, such as the grid approximation introduced above, might give a crude approximation. The inclusion of more parameters and different distribution families, though, have made the alternative Markov chain Monte Carlo (MCMC) sampling methods the choice by excellence.

However, that comes with a heavy computational burden. In high-dimensional settings, the heuristic MCMC methods chart the multivariate posterior by jumping from point to point. Those jumps are random in direction, but – and here is the catch – they make smaller jumps the larger the density in current coordinates. As a consequence, they focus on the maxima of the joint posterior distribution, adding enough resolution to reconstruct it sufficiently well. The issue is that every single jump requires updating everything, and everything interacts with everything. Hence, posterior approximation has always been the main obstacle to scaling up Bayesian methods to larger dimensions. The revival of MCMC methods in recent years is largely due to the advent of more powerful machines and efficient frameworks we will soon explore.

Metropolis-Hastings, Gibbs and Hamiltonian Monte Carlo (HMC) are some of the most popular MCMC methods. From these we will be working with HMC, widely regarded as the most robust and efficient.

greta vs. rethinking

Both greta and rethinking are popular R packages for conducting Bayesian inference that complement each other. I find it unfair to put the two against each other, and hope future releases will further enhance their compatibility. In any case, here is my impression of the pros and cons, at the time of writing:

Imputation

Missing value imputation is only available in rethinking. This is an important and simple feature, as in Bayesian models it works just like parameter sampling. It goes without saying, it helps rescuing additional information otherwise unavailable. As we will see, the cuckoo reproductive output data contains a large number of NAs.

Syntax

The syntax in both rethinking and greta is very different. Because greta relies on TensorFlow, it must be fully supported by tensors. This means that custom tensor operations require some hard-coded functions with TensorFlow operations. Moreover, greta models are built bottom-up, whereas rethinking models are built top-down. I find the top-down flow slightly more intuitive and compact.

Models available

When it comes to model types, the two packages offer different options. I noticed some models from rethinking are currently unavailable in greta and vice versa. Nick Golding, one of the maintainers of greta, was kind enough to implement an ordinal categorical regression upon my forum inquiry. Also, and relevant to what we are doing next, zero-inflated Poisson regression is not available in greta. Overall this does not mean greta has less to choose from. If you are interested in reading more, refer to the corresponding CRAN documentation.

HMC sampling

The two packages deploy HMC sampling supported by two of the most powerful libraries out there. Stan (also discussed in Richard’s book) is a statistical programming language famous for its MCMC framework. It has been around for a while and was eventually adapted to R via Rstan, which is implemented in C++. TensorFlow, on the other hand, is far more recent. Its cousin, TensorFlow Probability is a rich resource for Bayesian analysis. Both TensorFlow libraries efficiently distribute computation in your CPU and GPU, resulting in substantial speedups.

Plots methods

The two packages come with different visualisation tools. For posterior distributions, I preferred the bayesplot support for greta, whilst for simulation and counterfactual plots, I resorted to the more flexible rethinking plotting functions.

Let’s get started with R

Time to put all into practice using the rethinking and greta R packages. We will be looking into the data from a study by Riehl et al. 2019 [1] on female reproductive output in Crotophaga major, also known as the Greater Ani cuckoo.

The females of this cuckoo species display both cooperative and parasitic nesting behaviour. At the start of the season, females are more likely to engage in cooperative nesting than either solitary nesting or parasitism. While cooperative nesting involves sharing nests between two or more females, parasitism involves laying eggs in other conspecific nests, to be cared after by the hosts. However, the parasitic behaviour can be  favoured under certain conditions, such as nest predation. This leads us to the three main hypotheses of why Greater Ani females undergo parasitism:

• The ‘super mother’ hypothesis, whereby females simply have too many eggs for their own nest, therefore parasitising other nests;
• The ‘specialised parasites’ hypothesis, whereby females engage in a lifelong parasitic behaviour;
• The ‘last resort’ hypothesis, whereby parasitic behaviour is elicited after own nest or egg losses, such as by nest predation.

This study found better support to the third hypothesis. The authors fitted a mixed-effects logistic regression of parasitic behaviours, using both female and group identities as nested random effects. Among the key findings, they found that:

• Parasitic females laid more eggs than solely cooperative females;
• Parasitic eggs were significantly smaller than non-parasitic eggs;
• Loss rate was higher for parasitic eggs compared to non-parasitic ones, presumably due to host rejection;
• Exclusive cooperative behaviour and a mixed strategy between cooperative and parasitic behaviours yielded similar numbers in fledged offspring.

If you find the paper overly technical or literally unaccessible, you can read more about it in the corresponding News and Views, an open-access synthesis article from Nature.

The variables are the following:

• Year, year the record was made (2007-2017)
• Female_ID_coded, female unique identifier (n = 209)
• Group_ID_coded, group unique identifier (n = 92)
• Parasite, binary encoding for parasitism (1 / 0)
• Min_age, minimum age (3-13 years)
• Successful, binary encoding for reproductive success / fledging (1 / 0)
• Group_size, number of individuals per group (2-8)
• Mean_eggsize, mean egg size (24-37g)
• Eggs_laid, number of eggs laid (1-12)
• Eggs_incu, number of eggs incubated (0-9)
• Eggs_hatch, number of eggs hatched (0-5)
• Eggs_fledged, number of eggs fledged (0-5)

For modelling purposes, some of these variables will be mean-centered and scaled to unit variance. These will be subsequently identified using the Z suffix.

From the whole dataset, only 57% of the records are complete. Missing values are present in Mean_eggsize (40.6%), Eggs_laid (14.8%), Eggs_incu (10.7%), Eggs_hatch (9.2%), Eggs_fledged (5.2%), Group_size (4.1%) and Successful (1.8%). Needless to say, laid, incubated, hatched and fledged egg counts in this order, cover different reproductive stages and are all inter-dependent. Naturally, there is a carry-over of egg losses that impacts counts in successive stages.

The following models will incorporate main intercepts, varying intercepts (also known as random effects), multiple effects, interaction terms and imputation of missing values. Hopefully the definitions are sufficiently clear.

Zero-inflated Poisson regression of fledged egg counts

I found modelling the number of eggs fledged an interesting problem. As a presumable count variable, Eggs_fledged could be considered Poisson-distributed. However, if you summarise the counts you will note there is an excessively large number of zeros for a Poisson variable. This in unsurprising since a lot of eggs do not make it through, as detailed above. Fortunately, the zero-inflated Poisson regression (ZIPoisson) available from rethinking accommodates an additional probability parameter $latex p$ from a binomial distribution, which relocates part of the zero counts out of a Poisson component.

Load the relevant tab from the spreadsheet (“Female Reproductive Output”) and discard records with missing counts in Eggs_fledged. You should have a total of 575 records. The remaining missing values will be imputed by the model. Then, re-encode Female_ID_coded, Group_ID_coded and Year. This is because grouping factors must be numbered in order to incorporate varying intercepts with rethinking. This will help us rule out systematic differences among females, groups and years from the main effects. Finally, add the standardised versions of Min_age, Group_size and Mean_eggsize to the dataset.

(allTabs <- excel_sheets("data.xlsx")) # list tabs # Read female reproductive output fro % as.data.frame() fro %% mutate(female_id = as.integer(factor(Female_ID_coded)), year_id = as.integer(factor(Year)), group_id = as.integer(factor(Group_ID_coded)), Min_age_Z = scale(Min_age), Group_size_Z = scale(Group_size), Mean_eggsize_Z = scale(Mean_eggsize))

The log-link is a convenient way to restrict the model , i.e. the rate of a Poisson regression, to non-negative values. The logit-link, on the other hand, can be used to restrict model probabilities to values between zero and one. The logit-link is reverted by the logistic function. Here is my proposed model of fledged egg counts:

In terms of code, this is how it looks like. The HMC will be run using 5,000 iterations, 1,000 of which for warmup, with four independent chains, each with its own CPU core. Finally, the precis call shows the 95% highest-density probability interval (HPDI) of all marginal posterior distributions. You can visualise these using plot(precis(...)).

eggsFMod <- map2stan(alist( Eggs_fledged ~ dzipois(p, lambda), logit(p) <- ap, log(lambda) <- a + a_fem[female_id] + a_year[year_id] + a_group[group_id] + Parasite*bP + Min_age_Z*bA + Group_size_Z*bGS + Mean_eggsize_Z*bES + Parasite*Min_age_Z*bPA, Group_size_Z ~ dnorm(0, 3), Mean_eggsize_Z ~ dnorm(0, 3), a_fem[female_id] ~ dnorm(0, sigma1), a_year[year_id] ~ dnorm(0, sigma2), a_group[group_id] ~ dnorm(0, sigma3), c(sigma1, sigma2, sigma3) ~ dcauchy(0, 1), c(ap, a) ~ dnorm(0, 3), c(bP, bA, bGS, bES, bPA) ~ dnorm(0, 2)), data = fro, iter = 5e3, warmup = 1e3, chains = 4, cores = 4) # Check posterior dists precis(eggsFMod, prob = .95) # use depth = 2 for varying intercepts

There are no discernible effects on the number of fledged eggs, as zero can be found inside all reported 95% HPDI intervals. The effect of parasitism (bP) is slightly negative in this log-scale, which suggests an overall modest reduction in the Poisson rate.

The following code will extract a sample of size 16,000 from the joint posterior of the model we have just created. The samples of in particular, will be passed to the logistic function to recover the respective probabilities.

Then, the code produces a counterfactual plot from Poisson rate predictions. We will use counterfactual predictions to compare parasitic and non-parasitic females with varying age and everything else fixed. It will tell, in the eyes of your model, how many fledged eggs in average some hypothetical females produce. For convenience, we now consider the ‘average’ female, with average mean egg size and average group size, parasitic or not, and with varying standardised age. The calculations from the marginal parameter posteriors are straightforward. Finally, a simple exponentiation returns the predicted Poisson rates.

In summary, from the joint posterior sample of size 16,000 we i) took the marginal posterior to return the corresponding probabilities, and ii) predicted from the marginal posteriors of its constituting parameters by plugging in hand-selected values.

We can now examine the distribution of the sampled probabilities and predicted Poisson rates. In the case of the latter, both the mean and the 95% HPDI over a range of standardised age need to be calculated.

# Sample posterior post <- extract.samples(eggsFMod) # PI of P(no clutch at all) dens(logistic(post$ap), show.HPDI = T, xlab = "ZIP Bernoulli(p)") # Run simulations w/ averages of all predictors, except parasite 0 / 1 lambdaNoP <- exp(post$a + 0*post$bP + 0*post$bA + 0*post$bGS + 0*post$bES + 0*0*post$bPA) simFledgeNoPar <- rpois(n = length(lambdaNoP), lambda = lambdaNoP) lambdaP <- exp(post$a + 1*post$bP + 0*post$bA + 0*post$bGS + 0*post$bES + 1*0*post$bPA) simFledgePar <- rpois(n = length(lambdaP), lambda = lambdaP) table(simFledgeNoPar) table(simFledgePar) # Simulate with varying age rangeA <- seq(-3, 3, length.out = 100) # No parasite predictions <- sapply(rangeA, function(x){ exp(post$a + 0*post$bP + x*post$bA + 0*post$bGS + 0*post$bES + 0*x*post$bPA) }) hdpiPois <- apply(predictions, 2, HPDI, prob = .95) meanPois <- colMeans(predictions) plot(rangeA, meanPois, type = "l", ylim = c(0, 3), yaxp = c(0, 3, 3), xlab = "Min Age (standardized)", ylab = expression(lambda)) shade(hdpiPois, rangeA) # Parasite predictionsP <- sapply(rangeA, function(x){ exp(post$a + 1*post$bP + x*post$bA + 0*post$bGS + 0*post$bES + x*post$bPA) }) hdpiPoisP <- apply(predictionsP, 2, HPDI, prob = .95) meanPoisP <- colMeans(predictionsP) lines(rangeA, meanPoisP, lty = 2, col = "red") shade(hdpiPoisP, rangeA, col = rgb(1,0,0,.25)) The left panel shows the posterior probability distribution of , the parameter that goes into the binomial component of the model. It spans the interval between 0.20 and 0.50. Take this as the likelihood of producing a zero instead of following a Poisson distribution in any single Bernoulli trial. We now turn to the counterfactual plot in the right panel. If for a moment we distinguish predictions made assuming parasitic or non-parasitic behaviour as and , respectively, then it shows as a full black line, with the dark grey shading representing the 95% HPDI of , and the mean as a dashed red line, with the light red shading representing the 95% HPDI of . It seems that the age of a non-parasitic ‘average’ female does not associate with major changes in the number of fledged eggs, whereas the parasitic ‘average’ female does seem to have a modest increase the older it is. Note how uncertainty increases with more extreme ages, to both sides of the panel. In my perspective, parasitic and non-parasitic C. major females are undistinguishable with respect to fledged egg counts over most of their reproductive life. Poisson regression of laid egg counts The following model, also based on rethinking, aims at predicting laid egg counts instead. Unlike Eggs_fledged, Eggs_laid has a minimum value of one, with a smaller relative frequency unlike the zero-inflated situation we met before. The problem, however, is that we need to further filter the records to clear missing values left in Eggs_laid. You should be left with 514 records in total. For consistency, re-standardise the variables standardised in the previous exercise. # Try Eggs_laid ~ dpois froReduced % as.data.frame() # Re-do the variable scaling, otherwise the sampling throws an error froReduced %% mutate(female_id = as.integer(factor(Female_ID_coded)), year_id = as.integer(factor(Year)), group_id = as.integer(factor(Group_ID_coded)), Min_age_Z = scale(Min_age), Group_size_Z = scale(Group_size), Mean_eggsize_Z = scale(Mean_eggsize)) Except for the target outcome, the model is identical to the Poisson component in the previous ZIPoisson regression: And the corresponding implementation, with the same previous settings for HMC sampling. eggsLMod <- map2stan(alist( Eggs_laid ~ dpois(lambda), log(lambda) <- a + a_fem[female_id] + a_year[year_id] + a_group[group_id] + Parasite*bP + Min_age_Z*bA + Group_size_Z*bGS + Mean_eggsize_Z*bES + Parasite*Min_age_Z*bPA, Group_size_Z ~ dnorm(0, 3), Mean_eggsize_Z ~ dnorm(0, 3), a_fem[female_id] ~ dnorm(0, sigma1), a_year[year_id] ~ dnorm(0, sigma2), a_group[group_id] ~ dnorm(0, sigma3), c(sigma1, sigma2, sigma3) ~ dcauchy(0, 1), a ~ dnorm(0, 3), c(bP, bA, bGS, bES, bPA) ~ dnorm(0, 2)), data = froReduced, iter = 5e3, warmup = 1e3, chains = 4, cores = 4) # Check posterior dists precis(eggsFMod, prob = .95) # use depth = 2 for varying intercepts Here, you will note that the 95% HPDI of the bP posterior is to the left of zero, suggesting an overall negative effect of parasitism over the amount of eggs laid. The interaction term bPA too, displays a strong negative effect. Now apply the same recipe above: produce a sample of size 16,000 from the joint posterior; predict Poisson rates for the ‘average’ female, parasitic or not, with varying standardised age; exponentiate the calculations to retrieve the predicted ; compute the mean and 95% HPDI for the predicted rates over a range of standardised age. # Sample posterior post <- extract.samples(eggsLMod) # Run simulations w/ averages of all predictors, except parasite 0 / 1 lambdaNoP <- exp(post$a + 0*post$bP + 0*post$bA + 0*post$bGS + 0*post$bES + 0*0*post$bPA) simFledgeNoPar <- rpois(n = length(lambdaNoP), lambda = lambdaNoP) lambdaP <- exp(post$a + 1*post$bP + 0*post$bA + 0*post$bGS + 0*post$bES + 1*0*post$bPA) simFledgePar <- rpois(n = length(lambdaP), lambda = lambdaP) table(simFledgeNoPar) table(simFledgePar) # Sim with varying age # No parasite predictions <- sapply(rangeA, function(x){ exp(post$a + 0*post$bP + x*post$bA + 0*post$bGS + 0*post$bES + 0*x*post$bPA) }) hdpiPois <- apply(predictions, 2, HPDI, prob = .95) meanPois <- colMeans(predictions) plot(rangeA, meanPois, type = "l", ylim = c(0, 7), yaxp = c(0, 7, 7), xlab = "Min Age (standardized)", ylab = expression(lambda)) shade(hdpiPois, rangeA) # Parasite predictionsP <- sapply(rangeA, function(x){ exp(post$a + 1*post$bP + x*post$bA + 0*post$bGS + 0*post$bES + x*post$bPA) }) hdpiPoisP <- apply(predictionsP, 2, HPDI, prob = .95) meanPoisP <- colMeans(predictionsP) lines(rangeA, meanPoisP, lty = 2, col = "red") shade(hdpiPoisP, rangeA, col = rgb(1,0,0,.25)) The colour scheme is the same. The mean is shown as a full black line, with the dark grey shading representing the 95% HPDI of , and the mean is shown as a dashed red line, with the light red shading representing the 95% HPDI of . Compared to the previous one, this counterfactual plot displays a starker contrast between parasitising and non-parasitising females. The young ‘average’ parasitic female lays more eggs than the young ‘average’ non-parasitic female, and this difference seems to revert with age, i.e. the old ‘average’ parasitic female lays less eggs compared to the old ‘average’ non-parasitic female. Essentially, whilst strictly cooperative females have a constant clutch size over their reproductive life, parasitic behaviour in turn leads to a steady decline the older a female bird is. This time you will go one step further to simulate laid egg counts from the , with varying ages. In a nutshell, you established above the mean predicted and the corresponding 95% HPDI, and now those rate predictions will be used to sample counts from the corresponding Poisson distributions. To make interpretation easier, plot the mean and 95% HPDI in the same range as per above. Then, simply overlay the region of 95% HPDI for the resulting sampled laid egg counts. # Bonus! Sample counts from predictionsP, take 95% HDPI hdpiPoisP <- apply(predictionsP, 2, HPDI, prob = .95) meanPoisP <- colMeans(predictionsP) plot(rangeA, meanPoisP, type = "l", ylim = c(0, 15), yaxp = c(0, 15, 5), xlab = "Min Age (standardized)", ylab = expression(paste(lambda, " / no. eggs laid"))) shade(hdpiPoisP, rangeA) poisample <- sapply(1:100, function(k){ rpois(nrow(predictionsP), predictionsP[,k]) }) hdpiSample <- apply(poisample, 2, HPDI, prob = .95) shade(hdpiSample, rangeA) The mean is shown as a full line, the dark grey shading represents the 95% HPDI of , and the light grey shading represents the 95% HPDI of the resulting count samples. Logistic regression of female reproductive success We will now turn to a logistic regression of female reproductive success, using greta. Since greta limits the input to to complete cases, we need to select complete records. This gives a total of 346 records. Also, this being a different model, I used a different set of explanatory variables. The pre-processing, as you will note, is very much in line with that for the previous models. library(tensorflow) use_condaenv("greta") library(greta) library(tidyverse) library(bayesplot) library(readxl) # Read female reproductive output and discard records w/ NAs fro <- read_xlsx("data.xlsx", sheet = allTabs[2]) fro <- fro[complete.cases(fro),] # Use cross-classified varying intercepts for year, female ID and group ID female_id <- as.integer(factor(fro$Female_ID_coded)) year <- as.integer(factor(fro$Year)) group_id <- as.integer(factor(fro$Group_ID_coded)) # Define and standardize model vars Age <- as_data(scale(fro$Min_age)) Eggs_laid <- as_data(scale(fro$Eggs_laid)) Mean_eggsize <- as_data(scale(fro$Mean_eggsize)) Group_size <- as_data(scale(fro$Group_size)) Parasite <- as_data(fro$Parasite) To be consistent, I have again re-encoded Female_ID_coded, Group_ID_coded and Year as done with the rethinking models above. The logistic regression will be set up defined as follows: And finally the model implementation. We will once again produce a posterior sample of size 16,000, separated into four chains and up to ten CPU cores with 1,000 for warmup. Note the difference in the incorporation of Female_ID_coded, Group_ID_coded and Year as varying intercepts. # Define model effects sigmaML <- cauchy(0, 1, truncation = c(0, Inf), dim = 3) a_fem <- normal(0, sigmaML[1], dim = max(female_id)) a_year <- normal(0, sigmaML[2], dim = max(year)) a_group <- normal(0, sigmaML[3], dim = max(group_id)) a <- normal(0, 5) bA <- normal(0, 3) bEL <- normal(0, 3) bES <- normal(0, 3) bGS <- normal(0, 3) bP <- normal(0, 3) bPA <- normal(0, 3) # Model setup mu <- a + a_fem[female_id] + a_year[year] + a_group[group_id] + Age*bA + Eggs_laid*bEL + Mean_eggsize*bES + Parasite*bP + Group_size*bGS + Parasite*Age*bPA p <- ilogit(mu) distribution(fro$Successful) <- bernoulli(p) cuckooModel <- model(a, bA, bEL, bES, bP, bGS, bPA) # Plot plot(cuckooModel) # HMC sampling draws <- mcmc(cuckooModel, n_samples = 4000, warmup = 1000, chains = 4, n_cores = 10) # Trace plots mcmc_trace(draws) # Parameter posterior mcmc_intervals(draws, prob = .95)

We can visualise the marginal posteriors via bayesplot::mcmc_intervals or alternatively bayesplot::mcmc_areas. The figure above reflects a case similar to the ZIPoisson model. While parasitism displays a clear negative effect in reproductive success, note how strongly it interacts with age to improve reproductive success.

Proceeding to the usual counterfactual plot, note again that the above estimates are in the logit-scale, so we need the logistic function once again to recover the probability values.

# Simulation with average eggs laid, egg size and group size, w/ and w/o parasitism seqX <- seq(-3, 3, length.out = 100) probsNoPar <- sapply(seqX, function(x){ scenario <- ilogit(a + x*bA) probs <- calculate(scenario, draws) return(unlist(probs)) }) probsPar <- sapply(seqX, function(x){ scenario <- ilogit(a + x*bA + bP + x*bPA) probs <- calculate(scenario, draws) return(unlist(probs)) }) plot(seqX, apply(probsNoPar, 2, mean), type = "l", ylim = 0:1, xlab = "Min age (standardized)", ylab = "P(Successful)", yaxp = c(0, 1, 2)) rethinking::shade(apply(probsNoPar, 2, rethinking::HPDI, prob = .95), seqX) lines(seqX, apply(probsPar, 2, mean), lty = 2, col = "red") rethinking::shade(apply(probsPar, 2, rethinking::HPDI, prob = .95), seqX, col = rgb(1,0,0,.25))

As with the previous predicted Poisson rates, here the mean is shown as a full black line, with the dark grey shading representing the 95% HPDI of , and the mean is shown as a dashed red line, with the light red shading representing the 95% HPDI of . This new counterfactual plot shows us how parasitic females tend to be more successful the older they are, compared to non-parasitic females. Nonetheless, one could argue the increase in uncertainty makes the case a weak one.

Wrap-up

That was it. You should now have a basic idea of Bayesian models and the inherent probabilistic inference that prevents the misuse of hypothesis testing, commonly referred to as P-hacking in many scientific areas. Altogether, the models above suggest that

• Parasitism in older C. major females has a modest, positive contribution to the number of fledged eggs. The minor negative effect of parasitism is counterbalanced by its interaction with age, which displays a small positive effect on the number of fledged eggs. Non-parasitic females, on the other hand, show no discernible change in fledged egg counts with varying age. These observations are supported by the corresponding counterfactual plot;
• In the case of laid eggs, there seems to be a strong negative effect exerted by both parasitism and its interaction with age. The counterfactual plot shows that whereas young parasitic females lay more eggs than young non-parasitic females, there is a turning point in age where parasitism leads to comparatively fewer eggs. The additional simulation of laid egg counts further supports this last observation;
• Notably, reproductive success seems to be also affected by the interaction between age and parasitism status. The results are similar to that from the ZIPoisson model, showing an increase in success probability with increasing age that outpaces that for non-parasitic females.

Thus, relatively to non-parasitic females, the older the parasitic females the fewer laid eggs, and vice versa; yet, the older the parasitic females, the more fledged eggs. Reproductive success also seems to be boosted by parasitism in older females. There are many plausible explanations for this set of observations, and causation is nowhere implied. It could well be masking effects from unknown factors. Nonetheless, I find it interesting that older parasitising females can render as many or more fledglings from smaller clutches, compared to non-parasitising females. Could older parasitic females be simply more experienced? Could they have less laid eggs due to nest destruction? How to explain the similar rate of reproductive success?

Much more could be done, and I am listing some additional considerations for a more rigorous analysis:

• We haven’t formally addressed model comparison. In most cases, models can be easily compared on the basis of information criteria, such as deviance (DIC) and widely applicable (WAIC) information criteria to assess the compromise between the model fit and the number of degrees of freedom;
• We haven’t looked into the MCMC chains. During the model sampling, you probably read some warnings regarding ‘divergence interactions during sampling’ and failure to converge. The rethinking package helps you taming the chains by reporting the Gelman-Rubin convergence statistic Rhat and the number of effective parameters n_eff, whenever you invoke precis. If Rhat is not approximately one or n_eff is very small for any particular parameter, it might warrant careful reparameterisation. User-provided initial values to seed the HMC sampling can also help;
• We haven’t looked at measurement error or over-dispersed outcomes. These come handy when the target outcome has a very large variance or exhibits deviations to theoretical distributions;
• We haven’t consider mixed or exclusive cooperative or parasitic behaviour, so any comparison with the original study [1] is unfounded.

Finally, a word of appreciation to Christina Riehl, for clarifying some aspects about the dataset and Nick Golding, for his restless support in the greta forum.

This is me writing up the introduction to this post in Santorini, Greece. My next project will be about analysing Twitter data, so stay tuned. Please leave a comment, a correction or a suggestion!

References

[1] Riehl, Christina and J. Strong, Meghan (2019). Social parasitism as an alternative reproductive tactic in a cooperatively breeding cuckoo. Nature, 567(7746), 96-99.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: poissonisfish. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Detailed Guide to the Bar Chart in R with ggplot

Wed, 05/01/2019 - 11:40

(This article was first published on Learn R Programming & Build a Data Science Career | Michael Toth, and kindly contributed to R-bloggers)

When it comes to data visualization, flashy graphs can be fun. Believe me, I’m as big a fan of flashy graphs as anybody. But if you’re trying to convey information, especially to a broad audience, flashy isn’t always the way to go.

Whether it’s the line graph, scatter plot, or bar chart (the subject of this guide!), choosing a well-understood and common graph style is usually the way to go for most audiences, most of the time. And if you’re just getting started with your R journey, it’s important to master the basics before complicating things further.

So in this guide, I’m going to talk about creating a bar chart in R. Specifically, I’ll show you exactly how you can use the ggplot geom_bar function to create a bar chart.

A bar chart is a graph that is used to show comparisons across discrete categories. One axis–the x-axis throughout this guide–shows the categories being compared, and the other axis–the y-axis in our case–represents a measured value. The heights of the bars are proportional to the measured values.

For example, in this extremely scientific bar chart, we see the level of life threatening danger for three different actions. All dangerous, to be sure, but I think we can all agree this graph gets things right in showing that Game of Thrones spoilers are most dangerous of all.

Introduction to ggplot

Before diving into the ggplot code to create a bar chart in R, I first want to briefly explain ggplot and why I think it’s the best choice for graphing in R.

ggplot is a package for creating graphs in R, but it’s also a method of thinking about and decomposing complex graphs into logical subunits.

ggplot takes each component of a graph–axes, scales, colors, objects, etc–and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that’s both flexible and user-friendly. When components are unspecified, ggplot uses sensible defaults. This makes ggplot a powerful and flexible tool for creating all kinds of graphs in R. It’s the tool I use to create nearly every graph I make these days, and I think you should use it too!

To accompany this guide, I’ve created a free workbook that you can work through to apply what you’re learning as you read.

The workbook is an R file that contains all the code shown in this post as well as additional guided questions and exercises to help you understand the topic even deeper.

If you want to really learn how to create a bar chart in R so that you’ll still remember weeks or even months from now, you need to practice.

Investigating our dataset

Throughout this guide, we’ll be using the mpg dataset that’s built into ggplot. This dataset contains data on fuel economy for 38 popular car models. Let’s take a look:

The mpg dataset contains 11 columns:

• manufacturer: Car Manufacturer Name
• model: Car Model Name
• displ: Engine Displacement (liters)
• year: Year of Manufacture
• cyl: Number of Cylinders
• trans: Type of Transmission
• drv: f = front-wheel drive, r = rear-wheel drive, 4 = 4wd
• cty: City Miles per Gallon
• hwy: Highway Miles per Gallon
• fl: Fuel Type
• class: Type of Car
How to create a simple bar chart in R using geom_bar

ggplot uses geoms, or geometric objects, to form the basis of different types of graphs. Previously I have talked about geom_line for line graphs and geom_point for scatter plots. Today I’ll be focusing on geom_bar, which is used to create bar charts in R.

library(tidyverse) ggplot(mpg) + geom_bar(aes(x = class))

Here we are starting with the simplest possible ggplot bar chart we can create using geom_bar. Let’s review this in more detail:

First, we call ggplot, which creates a new ggplot graph. Basically, this creates a blank canvas on which we’ll add our data and graphics. Here we pass mpg to ggplot to indicate that we’ll be using the mpg data for this particular ggplot bar chart.

Next, we add the geom_bar call to the base ggplot graph in order to create this bar chart. In ggplot, you use the + symbol to add new layers to an existing graph. In this second layer, I told ggplot to use class as the x-axis variable for the bar chart.

You’ll note that we don’t specify a y-axis variable here. Later on, I’ll tell you how we can modify the y-axis for a bar chart in R. But for now, just know that if you don’t specify anything, ggplot will automatically count the occurrences of each x-axis category in the dataset, and will display the count on the y-axis.

And that’s it, we have our bar chart! We see that SUVs are the most prevalent in our data, followed by compact and midsize cars.

Changing bar color in a ggplot bar chart

Expanding on this example, let’s change the colors of our bar chart!

ggplot(mpg) + geom_bar(aes(x = class), fill = 'blue')

You’ll note that this geom_bar call is identical to the one before, except that we’ve added the modifier fill = 'blue' to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.

If you’re familiar with line graphs and scatter plots in ggplot, you’ve seen that in those cases we changed the color by specifing color = 'blue', while in this case we’re using fill = 'blue'.

In ggplot, color is used to change the outline of an object, while fill is used to fill the inside of an object. For objects like points and lines, there is no inside to fill, so we use color to change the color of those objects. With bar charts, the bars can be filled, so we use fill to change the color with geom_bar.

This distinction between color and fill gets a bit more complex, so stick with me to hear more about how these work with bar charts in ggplot!

Mapping bar color to a variable in a ggplot bar chart

Now, let’s try something a little different. Compare the ggplot code below to the code we just executed above. There are 2 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result. If you’ve read my previous ggplot guides, this bit should look familiar!

ggplot(mpg) + geom_bar(aes(x = class, fill = drv))

This graph shows the same data as before, but now instead of showing solid-colored bars, we now see that the bars are stacked with 3 different colors! The red portion corresponds to 4-wheel drive cars, the green to front-wheel drive cars, and the blue to rear-wheel drive cars. Did you catch the 2 changes we used to change the graph? They were:

1. Instead of specifying fill = 'blue', we specified fill = drv
2. We moved the fill parameter inside of the aes() parentheses

Before, we told ggplot to change the color of the bars to blue by adding fill = 'blue' to our geom_bar() call.

What we’re doing here is a bit more complex. Instead of specifying a single color for our bars, we’re telling ggplot to map the data in the drv column to the fill aesthetic.

This means we are telling ggplot to use a different color for each value of drv in our data! This mapping also lets ggplot know that it also needs to create a legend to identify the drive types, and it places it there automatically!

More Details on Stacked Bar Charts in ggplot

As we saw above, when we map a variable to the fill aesthetic in ggplot, it creates what’s called a stacked bar chart. A stacked bar chart is a variation on the typical bar chart where a bar is divided among a number of different segments.

In this case, we’re dividing the bar chart into segments based on the levels of the drv variable, corresponding to the front-wheel, rear-wheel, and four-wheel drive cars.

For a given class of car, our stacked bar chart makes it easy to see how many of those cars fall into each of the 3 drv categories.

The main flaw of stacked bar charts is that they become harder to read the more segments each bar has, especially when trying to make comparisons across the x-axis (in our case, across car class). To illustrate, let’s take a look at this next example:

# Note we convert the cyl variable to a factor to fill properly ggplot(mpg) + geom_bar(aes(x = class, fill = factor(cyl)))

As you can see, even with four segments it starts to become difficult to make comparisons between the different categories on the x-axis. For example, are there more 6-cylinder minivans or 6-cylinder pickups in our dataset? What about 5-cylinder compacts vs. 5-cylinder subcompacts? With stacked bars, these types of comparisons become challenging. My recommendation is to generally avoid stacked bar charts with more than 3 segments.

Dodged Bars in ggplot

Instead of stacked bars, we can use side-by-side (dodged) bar charts. In ggplot, this is accomplished by using the position = position_dodge() argument as follows:

# Note we convert the cyl variable to a factor here in order to fill by cylinder ggplot(mpg) + geom_bar(aes(x = class, fill = factor(cyl)), position = position_dodge(preserve = 'single'))

Now, the different segments for each class are placed side-by-side instead of stacked on top of each other.

Revisiting the comparisons from before, we can quickly see that there are an equal number of 6-cylinder minivans and 6-cylinder pickups. There are also an equal number of 5-cylinder compacts and subcompacts.

While these comparisons are easier with a dodged bar graph, comparing the total count of cars in each class is far more difficult.

Which brings us to a general point: different graphs serve different purposes! You shouldn’t try to accomplish too much in a single graph. If you’re trying to cram too much information into a single graph, you’ll likely confuse your audience, and they’ll take away exactly none of the information.

Scaling bar size to a variable in your data

Up to now, all of the bar charts we’ve reviewed have scaled the height of the bars based on the count of a variable in the dataset. First we counted the number of vehicles in each class, and then we counted the number of vehicles in each class with each drv type.

What if we don’t want the height of our bars to be based on count? What if we already have a column in our dataset that we want to be used as the y-axis height? Let’s say we wanted to graph the average highway miles per gallon by class of car, for example. How can we do that in ggplot?
There are two ways we can do this, and I’ll be reviewing them both. To start, I’ll introduce stat = 'identity':

# Use dplyr to calculate the average hwy_mpg by class by_hwy_mpg <- mpg %>% group_by(class) %>% summarise(hwy_mpg = mean(hwy)) ggplot(by_hwy_mpg) + geom_bar(aes(x = class, y = hwy_mpg), stat = 'identity')

Now we see a graph by class of car where the y-axis represents the average highway miles per gallon of each class. How does this work, and how is it different from what we had before?

Before, we did not specify a y-axis variable and instead let ggplot automatically populate the y-axis with a count of our data. Now, we’re explicityly telling ggplot to use hwy_mpg as our y-axis variable. And there’s something else here also: stat = 'identity'. What does that mean?

We saw earlier that if we omit the y-variable, ggplot will automatically scale the heights of the bars to a count of cases in each group on the x-axis. If we instead want the values to come from a column in our data frame, we need to change two things in our geom_bar call:

1. Add stat = 'identity' to geom_bar()

Why the error? If you don’t specify stat = 'identity', then under the hood, ggplot is automatically passing a default value of stat = 'count', which graphs the counts by group. A y-variable is not compatible with this, so you get the error message.

If this is confusing, that’s okay. For now, all you need to remember is that if you want to use geom_bar to map the heights of a column in your dataset, you need to add BOTH a y-variable mapping AND stat = 'identity'.

I’ll be honest, this was highly confusing for me for a long time. I hope this guidance helps to clear things up for you, so you don’t have to suffer the same confusion that I did. But if you have a hard time remembering this distinction, ggplot also has a handy function that does this work for you. Instead of using geom_bar with stat = 'identity', you can simply use the geom_col function to get the same result. Let’s see:

# Use dplyr to calculate the average hwy_mpg by class by_hwy_mpg <- mpg %>% group_by(class) %>% summarise(hwy_mpg = mean(hwy)) ggplot(by_hwy_mpg) + geom_col(aes(x = class, y = hwy_mpg))

You’ll notice the result is the same as the graph we made above, but we’ve replaced geom_bar with geom_col and removed stat = 'identity'. geom_col is the same as geom_bar with stat = 'identity', so you can use whichever you prefer or find easier to understand. For me, I’ve gotten used to geom_bar, so I prefer to use that, but you can do whichever you like!

Revisiting color in geom_bar

Above, we showed how you could change the color of bars in ggplot using the fill option. I mentioned that color is used for line graphs and scatter plots, but that we use fill for bars because we are filling the inside of the bar with color. That said, color does still work here, though it affects only the outline of the graph in question. Take a look:

ggplot(mpg) + geom_bar(aes(x = class), color = 'blue')

This created graphs with bars filled with the standard gray, but outlined in blue. That outline is what color affects for bar charts in ggplot!

I personally only use color for one specific thing: modifying the outline of a bar chart where I’m already using fill to create a better looking graph with a little extra pop. The standard fill is fine for most purposes, but you can step things up a bit with a carefully selected color outline:

ggplot(mpg) + geom_bar(aes(x = class), fill = '#003366', color = '#add8e6')

It’s subtle, but this graph uses a darker navy blue for the fill of the bars and a lighter blue for the outline that makes the bars pop a little bit.

This is the only time when I use color for bar charts in R. Do you have a use case for this? I’d love to hear it, so let me know in the comments!

A deeper review of aes() (aesthetic) mappings in ggplot

We saw above how we can create graphs in ggplot that use the fill argument map the cyl variable or the drv variable to the color of bars in a bar chart. ggplot refers to these mappings as aesthetic mappings, and they include everything you see within the aes() in ggplot.

Aesthetic mappings are a way of mapping variables in your data to particular visual properties (aesthetics) of a graph.

I know this can sound a bit theoretical, so let’s review the specific aesthetic mappings you’ve already seen as well as the other mappings available within geom_bar.

Reviewing the list of geom_bar aesthetic mappings

The main aesthetic mappings for a ggplot bar graph include:

• x: Map a variable to a position on the x-axis
• y: Map a variable to a position on the y-axis
• fill: Map a variable to a bar color
• color: Map a variable to a bar outline color
• linetype: Map a variable to a bar outline linetype
• alpha: Map a variable to a bar transparency

From the list above, we’ve already seen the x and fill aesthetic mappings. We’ve also seen color applied as a parameter to change the outline of the bars in the prior example.

I’m not going to review the additional aesthetics in this post, but if you’d like more details, check out the free workbook which includes some examples of these aesthetics in more detail!

Aesthetic mappings vs. parameters in ggplot

I often hear from my R training clients that they are confused by the distinction between aesthetic mappings and parameters in ggplot. Personally, I was quite confused by this when I was first learning about graphing in ggplot as well. Let me try to clear up some of the confusion!

Above, we saw that we could use fill in two different ways with geom_bar. First, we were able to set the color of our bars to blue by specifying fill = 'blue' outside of our aes() mappings. Then, we were able to map the variable drv to the color of our bars by specifying fill = drv inside of our aes() mappings.

What is the difference between these two ways of working with fill and other aesthetic mappings?

When you include fill, color, or another aesthetic inside the aes() of your ggplot code, you’re telling ggplot to map a variable to that aesthetic in your graph. This is what we did when we said fill = drv above to fill different drive types with different colors.

Each of the aesthetic mappings you’ve seen can also be used as a parameter, that is, a fixed value defined outside of the aes() aesthetic mappings. You saw how to do this with fill when we made the bar chart bars blue with fill = 'blue'. You also saw how we could outline the bars with a specific color when we used color = '#add8e6'.

Whenever you’re trying to map a variable in your data to an aesthetic to your graph, you want to specify that inside the aes() function. And whenever you’re trying to hardcode a specific parameter in your graph (making the bars blue, for example), you want to specify that outside the aes() function. I hope this helps to clear up any confusion you have on the distinction between aesthetic mappings and parameters!

Common errors with aesthetic mappings and parameters in ggplot

When I was first learning R and ggplot, this difference between aesthetic mappings (the values included inside your aes()), and parameters (the ones outside your aes()) was constantly confusing me. Luckily, over time, you’ll find that this becomes second nature. But in the meantime, I can help you speed along this process with a few common errors that you can keep an eye out for.

Trying to include aesthetic mappings outside your aes() call

If you’re trying to map the drv variable to fill, you should include fill = drv within the aes() of your geom_bar call. What happens if you include it outside accidentally, and instead run ggplot(mpg) + geom_bar(aes(x = class), fill = drv)? You’ll get an error message that looks like this:

Whenever you see this error about object not found, be sure to check that you’re including your aesthetic mappings inside the aes() call!

Trying to specify parameters inside your aes() call

On the other hand, if we try including a specific parameter value (for example, fill = 'blue') inside of the aes() mapping, the error is a bit less obvious. Take a look:

ggplot(mpg) + geom_bar(aes(x = class, fill = 'blue'))

In this case, ggplot actually does produce a bar chart, but it’s not what we intended.

For starters, the bars in our bar chart are all red instead of the blue we were hoping for! Also, there’s a legend to the side of our bar graph that simply says ‘blue’.

What’s going on here? Under the hood, ggplot has taken the string ‘blue’ and created a new hidden column of data where every value simple says ‘blue’. Then, it’s mapped that column to the fill aesthetic, like we saw before when we specified fill = drv. This results in the legend label and the color of all the bars being set, not to blue, but to the default color in ggplot.

If this is confusing, that’s okay for now. Just remember: when you run into issues like this, double check to make sure you’re including the parameters of your graph outside your aes() call!

You should now have a solid understanding of how to create a bar chart in R using the ggplot bar chart function, geom_bar!

I’ve found that working through code on my own is the best way for me to learn new topics so that I’ll actually remember them when I need to do things on my own in the future.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Learn R Programming & Build a Data Science Career | Michael Toth. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Set Analysis: A face off between Venn diagrams and UpSet plots

Wed, 05/01/2019 - 05:37

(This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

It’s time for me to come clean about something; I think Venn diagrams are fun! Yes that’s right, I like them. They’re pretty, they’re often funny, and they convey the straight forward overlap between one or two sets somewhat easily. Because I like making nerd comedy graphs, I considered sharing with y’all how to create Venn diagrams in R. But I couldn’t do that in good conscious without showing an alternative for larger and more complex set analysis. A few weeks ago, when I saw Matthew Hendrickson and Mara Averick’s excitement over the UpSetR plot, I knew what I should do.

Folks, what you are about to witness is a set analysis face off! We will be pairing off Venn diagrams and UpSet plots in a variety of scenarios for a true battle royale. Winner takes all and is able to claim the prize of set analysis master.

Working Environment

For this tutorial, we are going to be using R as our programming language. The entire code is hosted in my github repo, and you can also copy and paste to follow along below. If you are looking to understand your options for an R working environments, I recommend that you can check out IBM Watson Studio to run hosted R notebooks, or RStudio.

Round 1: Tiny and Fun Set Intersections

Kind folks, this is our warm up. In this round, we will be creating some fun and simple set intersections. Specifically, we will just be creating a very important graph which describes why I love Twitter.

To get started, we are going to install and load the packages required for this tutorial. If you do not have the packages already installed, please uncomment the install.packages() commands by removing the hashtag(#).

Install and Load Packages # install.packages("rJava") # install.packages("UpSetR") # install.packages("tidyverse") # install.packages("venneuler") # install.packages("grid") library(rJava) library(UpSetR) library(tidyverse) library(venneuler) library(grid)

Format the data

We will create a basic list which specifies the values of each of the circles and their overlap.

# Set the chart data expressionInput <- c(#rstats = 5, memes = 5, #rstats&memes = 3)

Create a Venn diagram

To create a simple Venn diagram, you can just pass in the list with the specified set and overlap values into the venneuler() function. The remaining code is just formatting to set the font size, title and subtitle.

# Create the Venn diagram # note on set up for java v11 jdk (v12 does not work with this) myExpVenn <- venneuler(expressionInput) par(cex=1.2) plot(myExpVenn, main = "Why I Love Twitter") grid.text( "@littlemissdata", x = 0.52, y = 0.2, gp = gpar( fontsize = 10, fontface = 3 ) )

Create an UpSet Plot

The great thing is that we can also create an UpSet plot using the same basic expression list. You simply pass the fromExpression() function into the upset() function. The remaining code is to format the labels and font size.

How to read an UpSet plot: UpSet plots offer a straight forward way for us to view set data by frequency. On the bottom left hand side horizontal bar chart, we show the entire size of each set. In this case, each set is of size 8. The vertical bar chart on the upper right hand side shows the sizes of isolated set participation. In the example, 5 values only belong to the #rstats set or only belong to the memes set. 3 values belong to both sets.

# Create an UpsetR Plot upset(fromExpression(expressionInput), order.by = "freq") grid.text( "Why I Love Twitter @littlemissdata", x = 0.80, y = 0.05, gp = gpar( fontsize = 10, fontface = 3 ) )

While the UpSet graph is an exciting new addition to our set analysis, I’m going to have to give this round to Venn diagrams. When trying to represent simple and easy to understand information, Venn diagrams are more visually appealing.

Round 2: Complicated Sets

Coming off of the round 1 win, Venn diagram may be feeling quite confident. However, the stakes are getting higher and we need to expect more of our visualizations in this round. We have more sets and interactions to visualize and more data to work with.

Data Introduction

The data is created using the 2017 Toronto Senior Survey from the Toronto Open Data Catalogue. I feel proud that my current city (Austin) and my previous city (Toronto) both have high quality open data catalogs. I feel strongly that data should be available to the people that pay for it.

This data set shows the output of a 2017 senior citizen survey to identify various needs of Toronto’s seniors’ population, in order to better inform decision making. To make our data processing easier, I have stripped down the columns that we will use and have performed a little pre-formatting. Please see below for a data dictionary and outline of what was changed.

Column Source Column ID Not previously included. This is a new unique key column. physicalActivity Survey Question: “1. In the past 3 months, how often did you participate in physical activities like walking?” physicalActivityPerMonth Survey Question: “1. In the past 3 months, how often did you participate in physical activities like walking?”. This has been transformed into numerical format. volunteerParticipation Survey Question: “5. During the past 3 months, how often did you participate in volunteer or charity work?” volunteerPerMonth Survey Question: “5. During the past 3 months, how often did you participate in volunteer or charity work?”. This has been transformed into numerical format. difficultFinancial Survey Question: “9. In the last year, have you had difficulty paying your rent, mortgage, Hydro bill, or other housing costs? For example, have you had to go without groceries to pay for rent or other monthly housing expenses?” supportSystem Survey Question: “13. Do you have people in your life who you can call on for help if you need it?” postalCode “Survey Question: 14. What are the first three characters of your postal code?” employmentStatus Survey Question: “15. What is your current employment status?” sex Survey Question: “16. What is your sex/gender?” primaryLanguage Survey Question: “18. In what language(s) would you feel most comfortable to receive services?” (first option listed) ageRange Survey Question: “19. Which age category do you belong to?” ttcTransportation Survey Question: “6. To get around Toronto, what modes of transportation do you use frequently? [TTC (bus, subway, or streetcar)]” walkTransportation Survey Question: “6. To get around Toronto, what modes of transportation do you use frequently? [Walk]” driveTransportation Survey Question: “6. To get around Toronto, what modes of transportation do you use frequently? [Drive]” cycleTransportation Survey Question: “6. To get around Toronto, what modes of transportation do you use frequently? [Cycle]” taxiTransportation Survey Question: ” 6. To get around Toronto, what modes of transportation do you use frequently? [Taxi or Uber]” communityRideTransportation Survey Question: “6. To get around Toronto, what modes of transportation do you use frequently? [Community Transportation Program, for example Toronto Ride or iRIDE]” wheelTransTransportation Survey Question: “6. To get around Toronto, what modes of transportation do you use frequently? [Wheel-Trans]” friendsTransportation Survey Question: “6. To get around Toronto, what modes of transportation do you use frequently? [Rides from family, friends or neighbours]” ageRange Survey Question: “19. Which age category do you belong to?”. minAgeRange Survey Question: “19. Which age category do you belong to?”. This has been converted to numerical format, taking the lowest age as the value.

Bring in the Data

We will start by bringing in the data, replacing the NA’s and renaming the columns for easier display.

rawSets <- read.csv( file = "https://raw.githubusercontent.com/lgellis/MiscTutorial/master/sets/seniorTransportation.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE ) # Replace the NA's rawSets[is.na(rawSets)] <- 0 # Rename the columns for easier display sets <- rawSets %>% rename(TTC = ttcTransportation, Walk = walkTransportation, Drive = driveTransportation, Cycle = cycleTransportation, Taxi = taxiTransportation, Community Ride = communityRideTransportation, Wheel Trans = wheelTransTransportation, Friends = friendsTransportation) dim(sets) head(sets)

The data comes with the sets in the form of a binary matrix.

Create a Venn Diagram

Now it’s time to create our Venn diagram. The data is currently in the form of a binary matrix, but to pass it into the venneuler() function, we need to get it into a list of set, ID pairs.

# Prep the data for a Venn diagram vennSets <- sets %>% gather(transportation, binary,6:13) %>% # take all binary mappings and convert to be a the set indicator filter(binary == 1) %>% # only include set matches select(ID, transportation) %>% # only include ID and set category mutate(transportation = factor(transportation)) # set the transportation column as a factor dim(vennSets)

The data has been transformed to have one set column and one ID column. An ID can be repeated for every set it belongs to.

Create the Venn diagram by passing the data frame into the venneuler() function. The rest of the code is for labelling and formatting.

v <- venneuler(data.frame(vennSets)) #Note that if you need to move around the labels so that they are not overlapping, you can use the new line breaks like the example below. #vlabels <- c("TTC", "Walk", "Drive", "Cycle\n\n\n", "\nTaxi", "Community Ride", "Wheel Trans", "Friends") par(cex = 0.7) plot(v, main = "Modes of Senior Transportation (Toronto 2017 Survey)", cex.main = 1.5) grid.text( "@littlemissdata", x = 0.52, y = 0.15, gp = gpar( fontsize = 10, fontface = 3 ) ) Create an UpSet Plot Create an UpSet plot by passing the original binary matrix into the upset() function. You can specify a number of parameters as outlined by this very clear vignette, but it also works very well outside of the box. Other than the upset() function, the rest of the code is for labels and formatting. upset(sets, nsets = 10, number.angles = 30, point.size = 3.5, line.size = 2, mainbar.y.label = "Modes of Senior Transportation (Toronto 2017 Survey)", sets.x.label = "Total Participants" ) grid.text( "@littlemissdata", x = 0.90, y = 0.05, gp = gpar( fontsize = 10, fontface = 3 ) ) Unfortunately, I think when the stakes got higher, Venn diagrams just could not keep up. While I think the Venn diagram is quite pretty, I really can’t make much sense out of it. The clarity provided by the UpSet plot can’t be matched. Round 2 goes to UpSet plots! Round 3: Explore In Context Set Information We are all tied up as we enter round 3, and it’s time to raise the stakes. In this round, we want to explore information about other variables in the data set within the context of the sets. Provide Context with Plot highlighting We will start by using colors to highlight specific areas of the graph that we care about. Highlight Seniors Who Both Walk and Cycle Using “Query=Intersects” UpSet plots have a very cool parameter called queries. Queries can be used to define a subset of the data that you would like to highlight in your graph. The queries property takes in a list of query lists which means that you can pass multiple queries into the same graph. Each query list allows you to set a number of properties about how the query should function. In this example we are viewing the Cycle and Walk set intersection (query and params). We want the query to be highlighted in a nice pink (color). We want to display the query as a highlighted overlap (active) and we will give it a name that we add to the chart legend (query.name) upset(sets, query.legend = "bottom", nsets = 10, number.angles = 30, point.size = 3.5, line.size = 2, mainbar.y.label = "Modes of Senior Transportation (Toronto 2017 Survey)", sets.x.label = "Total Participants", queries = list( list( query = intersects, params = list("Cycle", "Walk"), color = "#Df5286", active = T, query.name = "Physically Active Transportation" ) ) ) grid.text( "@littlemissdata", x = 0.90, y = 0.05, gp = gpar( fontsize = 10, fontface = 3 ) ) Highlight Seniors Who Exercise 1x/Week or Less Using “Query=Elements” In our next example, we are looking to highlight other data in the data frame within the context of the sets. In the normal UpSet graph, we want to highlight the rows identified as physically active less than 1x/week or less (queries, params) across all sets. We want the query to be highlighted in a nice pink (color). We want to display the query as a highlighted overlap (active) and we will give it a name that we add to the chart legend (query.name) upset(sets, query.legend = "bottom", nsets = 10, number.angles = 30, point.size = 3.5, line.size = 2, mainbar.y.label = "Modes of Senior Transportation (Toronto 2017 Survey)", sets.x.label = "Total Participants", queries = list( list( query = elements, params = list("physicalActivityPerMonth", 0,4), color = "#Df5286", active = T, query.name = "Physically Active 1x/Week or Less" ) ) ) grid.text( "@littlemissdata", x = 0.90, y = 0.05, gp = gpar( fontsize = 10, fontface = 3 ) ) Provide Context with Additional Graphs Called “Attribute Plots” Beyond highlighting within the UpSet main graph, we also have the option of bringing in additional plots which can display information about other variables within the context of the sets. Display an in context box plot of age for each set using boxplot.summary() function In our next example, we are looking to display a boxplot of the minimumAgeRange for every single set. We can do this very easily by just passing in the boxplot.summary parameter with the variable that we would like to summarize. upset(sets, query.legend = "bottom", nsets = 10, number.angles = 30, point.size = 3.5, line.size = 2, queries = list( list( query = elements, params = list("physicalActivityPerMonth", 0,4), color = "#Df5286", active = T, query.name = "Physically Active 1x/Week or Less" ) ), boxplot.summary = c("minAgeRange") ) grid.text( "@littlemissdata", x = 0.90, y = 0.05, gp = gpar( fontsize = 10, fontface = 3 ) ) Using “Attribute Plots” Display In-Context Aggregate Statistics for Other Columns Like queries, UpSet plots also allow you to pass in a list of attribute.plots which can display additional graphs depicting the full data frame within the context of your sets. In the example below, we keep our “Physically Active 1x/Week or Less” query and add three attribute plots; 2 histograms and a scatterplot. All have been set to also carry the query highlighting throughout these new plots. upset(sets, query.legend = "bottom", nsets = 10, number.angles = 30, point.size = 3.5, line.size = 2, mainbar.y.label = "Modes of Senior Transportation (Toronto 2017 Survey)", sets.x.label = "Total Participants", queries = list( list( query = elements, params = list("physicalActivityPerMonth", 0,4), color = "#Df5286", active = T, query.name = "Physically Active 1x/Week or Less" ) ), attribute.plots = list(gridrows = 50, plots = list(list(plot = histogram, x = "volunteerPerMonth", queries = T), list(plot = histogram, x = "minAgeRange", queries = T), list(plot = scatter_plot, x = "minAgeRange", y="volunteerPerMonth", queries = F) ), ncols = 3 ) ) grid.text( "@littlemissdata", x = 0.9, y = 0.02, gp = gpar( fontsize = 10, fontface = 3 ) ) Display Information About the Categories With “Set Metadata” Finally, we can use the set.metadata parameter to display aggregate statistics about the core sets. It is quite simple to implement. We start by creating a data frame with summarized set statistics. We need to convert from binary format to list format, and then we will aggregate and summarize the variable values grouping by the sets. In this example we are going to display the average physical activity per month of each set. aggregate <- sets %>% gather(transportation, binary,6:13) %>% filter(binary == 1) %>% # only include set matches group_by(transportation) %>% #get summary stats per transportation category summarize(physicalActivityPerMonth = mean(physicalActivityPerMonth)) aggregate Now that the hard part is done, we simply specify the set.metadata parameter to have the aggregate data set and we are ready to get our set summary data on the bottom left hand plot. upset(sets, set.metadata = list(data = aggregate, plots = list( list( type = "hist", column = "physicalActivityPerMonth", assign = 50 ) ))) By now may be wondering why we haven’t been talking about Venn diagrams in round 3. Simply put, they had to sit out of this round. While you could do some creative ideas to display context through color, it really isn’t on a comparable level to UpSet charts. As such, Venn diagrams are disqualified and I need to give this round to UpSet charts! Thank You Thank you for exploring set analysis visualization options with me. Please comment below if you enjoyed this blog, have questions, or would like to see something different in the future. Note that the full code is available on my github repo. If you have trouble downloading the files or cloning the repo from github, please go to the main page of the repo and select “Clone or Download” and then “Download Zip”. Alternatively or you can execute the following R commands to download the whole repo through R install.packages("usethis") library(usethis) use_course("https://github.com/lgellis/MiscTutorial/archive/master.zip") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Aviron, course des impressionnistes Wed, 05/01/2019 - 02:00 (This article was first published on NEONIRA, and kindly contributed to R-bloggers) Le 1er mai 2019, j’ai participé à la course des impressionnistes sous les couleurs de notre club [CERAMM](https://www.ceramm.fr/). Nous sommes engagés en 4 de couple sans barreur, homme. # Notre équipage Il est composé de 4 hommes, répartis comme suit sur le navire n° équipier| prénom | rôles particuliers :———:|:———–:|:———————————– 1 | Fabrice | à la nage 2 | Fabien | 3 | Javier | 4 | Frédéric | barreur “{r figa, echo = FALSE, out.width=’45%’, fig.asp=.5, fig.align=’center’} knitr::include_graphics(‘/images/aviron/a/b.png’) knitr::include_graphics(‘/images/aviron/a/c.png’) “ # Notre navire C’est un **4-** (quatre de couple sans barreur) de marque Zelani, doté de pelles haches. # Notre ambition Elle est double. Terminer la course et nous étalonner vis-à-vis de la concurrence. Personnellement, je pense que nous devrions pouvoir terminer la course dans un temps de 1:55:00 +/- 00:05:00, ce qui correspond à du 11km/h +/- .5km/h. ## Résultats bruts L’ensemble des [résultats](/images/aviron/a/Classement_General_21_Km_A.pdf) permet des analyses variées bien que nous soyons en présence de données numériques très limitées. Nos résultats bruts sont classement | ordre | commentaire ——————:|:———————|:———————————– général |47 / 49$| derniers des arrivants catégorie |$9 / 9$| derniers de notre catégorie ## Performances dans notre catégorie “{r echo=FALSE} suppressMessages(require(‘lubridate’, quietly = TRUE)) suppressMessages(require(‘data.table’, quietly = TRUE)) suppressMessages(require(‘magrittr’, quietly = TRUE)) suppressMessages(require(‘kableExtra’, quietly = TRUE)) dt <- data.table( equipe = c('CN VERSAILLES', 'CN NOGENT', 'CN MEAUX 1', 'ANFA 1', 'CSX POLYTECHNIQUE', 'CN MEAUX 2', 'RC PORT-MARLY', 'ANFA 2', 'CERAMM'), temps = c('1:27:03', '1:30:59', '1:31:49', '1:32:05', '1:34:25', '1:37:36', '1:49:36', '1:54:45', '2:13:24' ) ) dh <- data.table( an = 2019:2015, equipe = c('Frédéric BARCZA', 'De la Laurence Amaury', 'Poncelet Paul', 'De la Laurence Amaury', 'Sacotte Mathieu'), temps = c('2:13:24', '1:47:29', '2:13:53', '1:39:18', '2:05:40') ) timeToNum <- function(t) { s <- strsplit(t, ':') unlist(lapply(s, function(e) as.numeric(e[1]) * 3600 + as.numeric(e[2]) * 60 + as.numeric(e[3]))) } dt[, :=(vitesse_ms = round(21000 / timeToNum(temps), 2))] dt[, :=(vitesse_kmh = round(vitesse_ms * 3.6, 2))] dh[, :=(vitesse_ms = round(21000 / timeToNum(temps), 2))] dh[, :=(vitesse_kmh = round(vitesse_ms * 3.6, 2))] dh <- dh[order(-vitesse_ms)][, :=(performance = 1:5)][order(-an)] capitalize <- function(text_s) { if (missing(text_s)) return(NA_character_) sapply(text_s, function(e) { if (length(e) == 0) return(e) if (is.na(e)) return(e) if (!is.character(e)) return(e) n <- nchar(e) if (n == 0) return(e) if (n == 1) return(toupper(e)) paste0(toupper(substr(e, 1, 1)), substring(e, 2)) }, USE.NAMES = !is.null(names(text_s))) } dm <- data.table( equipier = c('Fabrice', 'Javier', 'Frédéric', 'Fabien'), taille_cm = c(185, 180, 179, 184), poids_kg = c(98, 91, 84, 79) ) tw <- 75 + 5 + 2 * 4 + sum(dm$poids_kg) # 75 = poids navire, 5 = poids embarqués, 2 = poids des fringues par rameur ec <- function(m, v) .5 * m * v^2 energy <- function(equipe_s) .5 * tw * dt[equipe == equipe_s]vitesse_ms^2 checkboxes <- c(empty = '\U2610', check = '\U2611', cross = '\U2612') generateCheckBoxes <- function(truth_b) { v <- checkboxes[ifelse(truth_b, 2, 1)] names(v) <- names(truth_b) v } showTable <- function(dx, title_s = NA, align_s = 'l') { colnames(dx) <- capitalize(colnames(dx)) x <- c(ncol(dx)) if (!is.na(title_s)) { names(x) <- toupper(title_s) dx %>% kable(align = align_s) %>% kable_styling() %>% add_header_above(x, align = ‘l’) } else { dx %>% kable(align = align_s) %>% kable_styling() } } prtab <- function(mp, title_s) { dx <- data.table(t(mp)) setnames(dx, names(mp)) showTable(dx, title_s) } buildManualList <- function(x) { paste0(x, collapse = ', ') } showGoals <- function(dt, title = NA) showTable(dt, title) %>% column_spec(1, bold = T, border_right = T, extra_css = ‘font-size: 20px’) showGoals(dt, ‘PERFORMANCE’) “ # Notre course parcours | commentaire ——————:|:——————————————————-1/4$| Départ correct, premier tronçon propre et appliqué, cadence un peu lente. Nous sommes parfois doublés.$2/4$| Passage boué moyen, deuxième tronçon propre. Notre chef de nage semble en surchauffe.$3/4$| Arrêt restauration. Notre chef de nage est en surchauffe. Reprise difficile, bateau ardent et chaotique. Bonne présence dans le bateau pour garder tout le monde concentré sur l’objectif.$4/4| Passage boué très médiocre. Arrêt nécessasire car chef de nage en crampes. Dernier tronçon laborieux, notre chef de nage fait de sont mieux en demi-coulisse. La fatigue est présente et occasionne des erreurs (collision légére entre pelles et bouée, fausse pelle, …). # Comparaison performance avec l’historique CERAMM “{r echo=FALSE} showGoals(dh[, -c(‘equipe’), with = FALSE], ‘HISTORIQUE DES PERFORMANCES’) “ # En conclusion J’oscille entre frustration et satisfaction. Frustration de ne pas avoir pu faire mieux. Non pas frustré par la manière (bon esprit de tous sur le bateau, ambiance concentrée malgré des conditions météos variables, volonté commune d’aller au bout) mais plutôt par le résultat. Le temps final résulte de trop nombreux arrêts (dont j’estime la durée totale à plus de 10 minutes, indispensables cependant compte tenu des conditions de l’équipage), d’une charge élevée (plus de 440 kg navire compris), et aussi d’imperfections techniques encore accentuées par le manque d’automatismes et de synchronicité de notre improbable équipage. Satisfaction d’être allé au bout. Je crois qu’elle est partagée. “{r figcover, echo = FALSE, out.width=’80%’, fig.asp=1.25, fig.align=’center’} knitr::include_graphics(‘/images/aviron/a/a.png’) “ Physiquement j’étais très bien dans l’effort. Mes derniers entrainements en intensité portent leur fruits. Quelques douleurs se sont révélées après notre retour à terre. Douleur aux aducteurs (disparues dès le lendemain) et nombreuses ampoules sur les deuxièmes phalanges (toujours présentes et c’est sensible). Enfin, organisation excellente de la part du RCPM. Ils sont rompus à l’exercice et le maitrisent à l’évidence très bien. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: NEONIRA. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Risk and return for B3 Wed, 05/01/2019 - 02:00 (This article was first published on R on msperlin, and kindly contributed to R-bloggers) One of the subjects that I teach in my undergraduate finance class is the relationship between risk and expected returns. In short, the riskier the investment, more returns should be expected by the investor. It is not a difficult argument to make. All that you need to understand is to remember that people are not naive in financial markets. Whenever they make a big gamble, the rewards should also be large. Rational investors, on theory, would not invest in risky stocks that are likelly to yield low returns. Going further, one the arguments I make to support this idea is looking at historical data. By assuming that expected returns is the average yearly return rate on a stock and the risk is the standard deviation of the same returns, we can check for a positive relationship by plotting the data in a scatter plot. In this post I’ll show how you can do it easily in R using BatchGetSymbols, GetBCBData and tidyverse. First, we will gather and organize all data sets. Here I’m using the stock components of Ibovespa, the Brazilian market index, and also CDI, a common risk free rate in Brazil. The next code will: 1. Import the data 2. organize it in the same structure (same columns) 3. bind it all together # get stock data library(tidyverse) library(BatchGetSymbols) library(GetBCBData) first.date <- '2008-01-01' # last date is Sys.Date by default # get stock data df.ibov <- GetIbovStocks() mkt.idx <- c('^BVSP') my.tickers <- c(mkt.idx, paste0(df.ibovtickers, '.SA') ) df.prices <- BatchGetSymbols(tickers = my.tickers, first.date = first.date, freq.data = 'yearly', be.quiet = TRUE)[[2]] tab.stocks <- df.prices %>% na.omit() %>% group_by(ticker) %>% summarise(mean.ret = mean(ret.adjusted.prices), sd.ret = sd(ret.adjusted.prices)) %>% mutate(ticker = str_replace_all(ticker, fixed('.SA'), '') ) tab.mkt.idx <- tab.stocks %>% filter(ticker %in% mkt.idx) tab.stocks <- tab.stocks %>% filter(!(ticker %in% mkt.idx)) # get CDI (risk free rate) my.id <- c(CDI = 4389) tab.CDI <- gbcbd_get_series(my.id, first.date = first.date) %>% rename(ticker = series.name ) %>% mutate(ref.date = format(ref.date, '%Y'), value = value/100) %>% group_by(ref.date, ticker) %>% summarise(ret = mean(value)) %>% group_by(ticker) %>% summarise(mean.ret = mean(ret), sd.ret = sd(ret))

Now that we have the data, lets use ggplot to build our graph.

library(ggplot2) p <- ggplot(tab.stocks, aes(x = sd.ret, y = mean.ret, group = ticker)) + geom_point() + geom_text(data = tab.stocks, aes(x = sd.ret, y = mean.ret, label = ticker), nudge_y = 0.03, check_overlap = TRUE, nudge_x = 0.05 ) + geom_point(data = tab.CDI, aes(x = sd.ret, y = mean.ret, color = ticker), size =5) + geom_point(data = tab.mkt.idx, aes(x = sd.ret, y = mean.ret, color = ticker), size =5) + labs(x = 'Risk (standard deviation)', y ='Expected Returns (average)', title = 'Mean X Variance map for B3', subtitle = paste0(nrow(tab.stocks), ' stocks, ', lubridate::year(min(df.prices$ref.date)), ' - ', lubridate::year(max(df.prices$ref.date)))) + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) print(p)

Looks pretty! What do we learn?

• Overall, most of the stocks did better than the risk free rate (CDI);

• There is a positive relationship between risk and return. The higher the standard deviation (x-axis), the higher the mean of returns (y-axis). However, notice that it is not a perfect relationship. If we followed the mean-variance gospel, there are lots of opportunities of arbitrage. We would mostly invest in those stocks in the upper-left part of the plot;

• Surprisingly, the market index, Ibovespa (^BVSP), is not well positioned in the graph. Since it is a diversified portfolio, I expected it to be closer to the frontier, around stock EQTL3.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Building a Shiny App as a Package

Tue, 04/30/2019 - 16:24

Shiny App as a Package

In a previous post, I’ve introduced the {golem} package, which is an opinionated framework for building production-ready Shiny Applications. This framework starts by creating a package skeleton waiting to be filled.

But, in a world where Shiny Applications are mostly created as a series of files, why bother with a package? This is the question I’ll be focusing on in this blog post.

What’s in a Shiny App?

OK, so let’s ask the question the other way round. Think about your last Shiny which was created as a single-file (app.R) or two files app (ui.R and server.R). You’ve got these two, and you put them into a folder.

So, let’s have a review of what you’ll need next for a robust application.

First, metadata. In other words, the name of the app, the version number (which is crucial to any serious, production-level project), what the application does, who to contact if something goes wrong.

Then, you need to find a way to handle the dependencies. Because you know, when you want to push your app into production, you can’t have this conversation with IT:

IT: Hey, I tried to ‘source(“app.R”)’ but I’ve got an error.

R-dev: What’s the error?

IT: It says “could not find package ‘shiny’”.

R-dev: Ah yes, you need to install {shiny}. Try to run ‘install.packages(“shiny”)’.

IT: Ok nice. What else?

R-dev: Let me think, try also ‘install.packages(“DT”)’… good? Now try ‘install.packages(“ggplot2”)’, and …

[…]

IT: Ok, now I source the ‘app.R’, right?

R-dev: Sure!

IT: Ok so it says ‘could not find function runApp()’

R-dev: Ah, you’ve got to do library(shiny) at the beginning of your script. And library(purrr), and library(jsonlite)*.

* Which will lead to a Namespace conflict on the flatten() function that can cause you some debugging headache. So, hey, it would be cool if we could have a Shiny app that only imports specific functions from a package, right?

So yes, dependencies matter. You need to handle them, and handle them correctly.

Now, let’s say you’re building a big app. Something with thousands of lines of code. Handling a one-file or two-file shiny app with that much lines is just a nightmare. So, what to do? Let’s split everything into smaller files that we can call!

And finally, we want our app to live long and prosper, which means we need to document it: each small pieces of code should have a piece of comment to explain what these specific lines do. The other thing we need for our application to be successful on the long run is tests, so that we are sure we’re not introducing any regression.

Oh, and that would be nice if people can get a tar.gz and install it on their computer and have access to a local copy of the app!

OK, so let’s sum up: we want to build an app. This app needs to have metadata and to handle dependencies corrrecly, which is what you get from the DESCRIPTION + NAMESPACE files of the package. Even more practical is the fact that you can do “selective namespace extraction” inside a package, i.e you can say “I want this function from this package”. Also, a big app needs to be split up in smaller .R files, which is the way a package is organized. And I don’t need to emphasize how documentation is a vital part of any package, so we solved this question too here. So is the testing toolkit. And of course, the “install everywhere” wish comes to life when a Shiny App is in a package.

The other plus side of Shiny as a Package Testing

Nothing should go to production without being tested. Nothing. Testing production apps is a wide question, and I’ll just stick to tests inside a Package here.

Frameworks for package testing are robust and widely documented. So you don’t have to put any extra-effort here: just use a canonical testing framework like {testthat}. Learning how to use it is not the subject of this post, so feel free to refer to the documentation. See also Chapter 5 of “Building a package that lasts”.

What should you test?

• First of all, as we’ve said before, the app should be split between the UI part and the backend (or ‘business logic’) part. These backend functions are supposed to run without any interactive context, just as plain old functions. So for these ones, you can do classical tests. As they are backend functions (so specific to a project), {golem} can’t provide any helpers for that.
• For the UI part, remember that any UI function is designed to render an HTML element. So you can save a file as HTML, and then compare it to a UI object with the golem::expect_html_equal().
library(shiny) ui <- tagList(h1("Hello world!")) htmltools::save_html(ui, "ui.html") golem::expect_html_equal(ui, "ui.html") # Changes ui <- tagList(h2("Hello world!")) golem::expect_html_equal(ui, "ui.html")

This can for example be useful if you need to test a module. A UI module function returns an HTML tag list, so once your modules are set you can save them and use them inside tests.

my_mod_ui <- function(id){ ns <- NS("id") tagList( selectInput(ns("this"), "that", choices = LETTERS[1:4]) ) } my_mod_ui_test <- tempfile(fileext = "html") htmltools::save_html(my_mod_ui("test"), my_mod_ui_test) # Some time later, and of course saved in the test folder, # not as a temp file golem::expect_html_equal(my_mod_ui("test"), my_mod_ui_test)

{golem} also provides two functions, expect_shinytag() and expect_shinytaglist(), that test if an objet is of class "shiny.tag" or "shiny.tag.list".

• Testing package launch: when launching golem::use_recommended_tests(), you’ll find a test built on top of {processx} that allows to check if the application is launch-able. Here’s a short description of what happens:
# Standard testthat things context("launch") library(processx) testthat::test_that( "app launches",{ # We're creating a new process that runs the app x <- process$new( "R", c( "-e", # As we are in the tests/testthat dir, we're moving # two steps back before launching the whole package # and we try to launch the app "setwd('../../'); pkgload::load_all();run_app()" ) ) # We leave some time for the app to launch # Configure this according to your need Sys.sleep(5) # We check that the app is alive expect_true(x$is_alive()) # We kill it x$kill() } ) Note: this specific configuration will possibly fail on Continuous integration platform as Gitlab CI or Travis. A workaround is to, inside your CI yml, first install the package with remotes::install_local(), and then replace the setwd (...) run_app() command with myuberapp::run_app(). For example: • in .gitlab-ci.yml: test: stage: test script: - echo "Running tests" - R -e 'remotes::install_local()' - R -e 'devtools::check()' • in test-golem.R: testthat::test_that( "app launches",{ x <- process$new( "R", c( "-e", "datuberapp::run_app()" ) ) Sys.sleep(5) expect_true(x$is_alive()) x$kill() } ) Documenting

Documenting packages is a natural thing for any R developer. Any exported function should have its own documentation, hence you are “forced” to document any user facing-function.

Also, building a Shiny App as a package allows you to write standard R documentation:

• Vignettes that explain how to use your app
• A {pkgdown} that can be used as an external link for your application.
Deploy Local deployment

As your Shiny App is a standard package, it can be built as a tar.gz, sent to your colleagues, friends, and family, and even to the CRAN. It can also be installed in any R-package repository. Then, if you’ve built your app with {golem}, you’ll just have to do:

library(myuberapp) run_app()

RStudio Connect & Shiny Server

Both these platforms expect a file app configuration, i.e an app.R file or ui.R / server.R files. So how can we integrate this “Shiny App as Package” into Connect or Shiny Server?

• Using an internal package manager like RStudio Package Manager, where the package app is installed, and then you simply have to create an app.R with the small piece of code from the section just before.
• Uploading the package folder to the server. In that scenario, you use the package folder as the app package, and upload the whole thing. Then, write an app.R that does:

And of course, don’t forget to list this file in the .Rbuildignore!

Docker containers

In order to dockerize your app, simply install the package as any other package, and use as a CMD R -e 'options("shiny.port"=80,shiny.host="0.0.0.0");myuberapp::run_app()'. Of course change the port to the one you need.

You’ll get the Dockerfile you need with golem::add_dockerfile().

Resources

The post Building a Shiny App as a Package appeared first on (en) The R Task Force.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Machine Learning explained for statistics folk

Tue, 04/30/2019 - 15:58

(This article was first published on R – Dataviz – Stats – Bayes, and kindly contributed to R-bloggers)

I’m running a one-day workshop called “From Statistics To Machine Learning” in central London on 28 October, for anyone who learnt some statistics and wants to find out about machine learning methods.

I guess you might feel frustrated. There’s a lot of interest and investment in machine learning, but it’s hard to know where to start learning if you have a stats background. Writing is either aimed squarely at experts in computer science, with a lot of weird jargon that looks almost familiar, but isn’t quite (features? loss functions? out-of-bag?), or is written for gullible idiots with more money than sense.

This workshop is a mixture of about 25% talks, 25% demonstration with R, and 50% small-group activities where you will consider scenarios detailing realistic data analysis needs for public sector, private sector and NGOs. Your task will be to discuss possible machine learning approaches and to propose an analysis plan, detailing pros and cons and risks to the organisation from data problems and unethical consequences. You’ll also have time to practice communicating the analysis plan in plain English to decision-makers. With only 18 seats, it’ll be a small enough session for everyone to learn from each other and really get involved. This is not one where you sit at the back and catch up on your emails!

I’ve built this out of training I’ve done for clients who wanted machine learning and AI demystified. The difference here is that you’re expected to have some statistical know-how, so we can start talking about the strengths and weaknesses of machine learning in a rigorous way that is actually useful to you in your work straight away.

Here are some topics we’ll cover:

• Terminology and jargon
• Supervised and unsupervised learning
• Ensembles, bagging and boosting
• Neural networks, image data and adversarial thinking
• AI and ethical concerns
• Reinforcement and imitation learning
• Big data’s challenges, opportunities and hype
• Speed and memory efficiency
• Concepts of model building such as cross-validation and feature engineering
• Options for software, outsourcing and software-as-a-service
• Data science workplaces combining statistical expertise with machine learning: what makes them happy and healthy

You can bring a laptop to try out some of the examples in R, but this is not essential.

It’s £110 all inclusive, which is about as cheap as I can make it with a central London venue. I’m mindful that a lot of people interested in this might be students or academics or public sector analysts, and I know you don’t exactly have big training budgets to dip into. Having said that, you can haggle with me if you’re a self-funding student or out of work or whatever.

Venue is TBC but I’ll confirm soon. If I get my way it’ll be one where you get breakfast (yes!), bottomless coffee and tea (yes!!) and a genuinely nice lunch (yes!!!), all included in the cost. Then, we can repair to the craft beer company afterwards and compare laptop stickers.

Book ’em quick here! I’ve had a lot of interest in this and it might go fast.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Dataviz – Stats – Bayes. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Spatial data and maps conference – FOSS4G

Tue, 04/30/2019 - 12:36

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I’m helping organise a conference on (geo)spatial open source software – FOSS4G. We’re hosting it in the great city of Edinburgh, Scotland in September 2019.

Abstract submissions: https://uk.osgeo.org/foss4guk2019/talks_workshops.html

We’re very interested in hearing your tales of R, QGIS, Python, GRASS, PostGIS, etc.!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Rich didn’t earn their Wealth, they just got Lucky

Tue, 04/30/2019 - 12:00

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

Tomorrow, on the First of May, many countries celebrate the so called International Workers’ Day (or Labour Day): time to talk about the unequal distribution of wealth again!

A few months ago I posted a piece with the title “If wealth had anything to do with intelligence…” where I argued that ability, e.g. intelligence, as an input has nothing to do with wealth as an output. It drew a lot of criticism (as expected), most of it unfounded in my opinion but one piece merits some discussion: the fact that the intelligence quotient (IQ) is normally distributed by construction. The argument goes that intelligence per se may be a distribution with fat tails too but by the way the IQ is constructed the metric is being transformed into a well formed gaussian distribution. To a degree this is certainly true, yet I would still argue that the distribution of intelligence and all other human abilities are far more well behaved than the extremely unequal distribution of wealth. I wrote in a comment:

There are many aspects in your comment that are certainly true. Obviously there are huge problems in measuring “true” mental abilities, which is the exact reason why people came up with a somewhat artificial “intelligence quotient” with all its shortcomings.

What would be interesting to see is (and I don’t know if you perhaps have a source about this) what the outcome of an intelligence test would look like without the “quotient” part, i.e. without subsequently normalizing the results.

I guess the relationship wouldn’t be strictly linear but it wouldn’t be as extreme as the wealth distribution either.

What I think is true in any case, independent of the distributions, is when you rank all people by intelligence and by wealth respectively you wouldn’t really see any stable connection – and that spirit was the intention of my post in the first place and I still stand by it, although some of the technicalities are obviously debatable.

So, if you have a source, Dear Reader, you are more than welcome to share it in the comments – I am always eager to learn!

I ended my post with:

But if it is not ability/intelligence that determines the distribution of wealth what else could account for the extreme inequality we perceive in the world?

In this post I will argue that luck is a good candidate, so read on…

In 2014 there was a special issue of the renowned magazine Science titled “The science of inequality”. In one of the articles (Cho, A.: “Physicists say it’s simple”) the following thought experiment is being proposed:

Suppose you randomly divide 500 million in income among 10,000 people. There’s only one way to give everyone an equal, 50,000 share. So if you’re doling out earnings randomly, equality is extremely unlikely. But there are countless ways to give a few people a lot of cash and many people a little or nothing. In fact, given all the ways you could divvy out income, most of them produce an exponential distribution of income.

So, the basic idea is to randomly throw 9,999 darts at a scale ranging from zero to 500 million and study the resulting distribution of intervals:

library(MASS) w <- 5e8 # wealth p <- 1e4 # no. of people set.seed(123) d <- diff(c(0, sort(runif(p-1, max = w)), w)) # wealth distribution h <- hist(d, col = "red", main = "Exponential decline", freq = FALSE, breaks = 45, xlim = c(0, quantile(d, 0.99))) fit <- fitdistr(d, "exponential") curve(dexp(x, rate = fit$estimate), col = "black", type = "p", pch = 16, add = TRUE) The resulting distribution fits an exponential distribution very well. You can read some interesting discussions concerning this result on CrossValidated StackExchange: How can I analytically prove that randomly dividing an amount results in an exponential distribution (of e.g. income and wealth)? Just to give you an idea of how unfair this distribution is: the richest six persons have more wealth than the poorest ten percent together: sum(sort(d)[9994:10000]) - sum(sort(d)[0:1000]) ## [1] 183670.8 If you think that this is ridiculous just look at the real global wealth distribution: here it is not six but three persons who own more than the poorest ten percent! Now, what does that mean? Well, equality seems to be the exception and (extreme) inequality the rule. The intervals were found randomly, no interval had any special skills, just luck – and the result is (extreme) inequality – as in the real world! If you can reproduce the wealth distribution of a society stochastically this could have the implication that it weren’t so much the extraordinary skills of the rich which made them rich but they just got lucky. Some rich people are decent enough to admit this. In his impressive essay “Why Poverty Is Like a Disease” Christian H. Cooper, a hillbilly turned investment banker writes: So how did I get out? By chance. It’s easy to attach a post-facto narrative of talent and hard work to my story, because that’s what we’re fed by everything from Hollywood to political stump speeches. But it’s the wrong story. My escape was made up of a series of incredibly unlikely events, none of which I had real control over. […] I am the exception that proves the rule—but that rule is that escape from poverty is a matter of chance, and not a matter of merit. A consequence would be that you cannot really learn much from the rich. So throw away all of your self help books on how to become successful. I will end with a cartoon, which brings home this message, on a closely related concept, the so called survivorship bias (which is also important to keep in mind when backtesting trading strategies in quantitative finance, the topic of an upcoming post… so stay tuned!): Source: xkcd.com/1827 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Google Next 2019 – observations from the Mango viewing deck Tue, 04/30/2019 - 11:01 (This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers) The first thing that hit me about this year’s Google Next was the scale. Thousands of people wandering around Moscone Centre in San Fran with their name badges proudly displayed really brought home that this is a platform of which its users are proud. I was told by several people that the size of the conference has doubled in size year on year, which, given there were 35,000 people at this year’s event, may well prove a logistical nightmare next year. I was really struck by the massive enthusiasm of the Google team, the way in which the company’s leadership is aware of Google’s market position and how they seem to be keen to take pragmatic and strategic decisions based on where they are, versus where they might like to be. The opening up to other cloud platforms via the Anthos announcement seems a good way for Google to position itself as the honest broker in the field – they have identified legacy apps and codebases as difficult to turnover, something which I think many organisations will feel comfortable. There were rafts of customer testimonials and whilst many of them did not seem to contain much in the way of ROI details, the mere fact that Google could call these C-level Fortune 500 companies onto the stage speaks towards a clarity of intent and purpose. The nature of many of these types of events is one that is fairly broad, and considering that Mango is a relatively niche player, it is sometimes difficult to find the areas and talks that may resonate with our posture. That was true of these sessions., but not entirely surprising. The widescale abuse of terms like AI and Machine Learning carries on apace, and we at Mango need to find ways to gently persuade people that when they think of AI, they’re actually meaning Machine Learning, and when they talk about Machine Learning they might well be talking about, well, models. The current push in the industry is to be able to add these complex components at a click of a button, by an untrained analyst who can then roll it into production without proper validation or testing. These are dangerous situations and reminded me of the importance of doing some of the hard yards first i.e creating an analytic culture and community to ensure that the “right” analytics are used with the “right” data. It’s clear however that the opening up of cloud platforms is creating an arena in which advanced analytics will play a crucial role, and presents massive opportunities for Mango in working with Google and our customers. We’ve loved being back in San Francisco and its always lovely to be around passionate and energetic advocates. Hopefully London Next later in the year will be equally energetic and fun. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Zooming in on maps with sf and ggplot2 Tue, 04/30/2019 - 10:58 (This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers) When working with geo-spatial data in R, I usually use the sf package for manipulating spatial data as Simple Features objects and ggplot2 with geom_sf for visualizing these data. One thing that comes up regularly is “zooming in” on a certain region of interest, i.e. displaying a certain map detail. There are several ways to do so. Three common options are: • selecting only certain areas of interest from the spatial dataset (e.g. only certain countries / continent(s) / etc.) • cropping the geometries in the spatial dataset using sf_crop() • restricting the display window via coord_sf() I will show the advantages and disadvantages of these options and especially focus on how to zoom in on a certain point of interest at a specific “zoom level”. We will see how to calculate the coordinates of the display window or “bounding box” around this zoom point. Loading the required libraries and data We’ll need to load the following packages: library(ggplot2) library(sf) library(rnaturalearth) Using ne_countries() from the package worldmap, we can get a spatial dataset with the borders of all countries in the world. We load this data as a “simple feature collection” by specifying returnclass = 'sf'. The result is a spatial dataset that can be processed with tools from the sf package. worldmap <- ne_countries(scale = 'medium', type = 'map_units', returnclass = 'sf') # have a look at these two columns only head(worldmap[c('name', 'continent')]) Simple feature collection with 6 features and 2 fields geometry type: MULTIPOLYGON dimension: XY bbox: xmin: -180 ymin: -18.28799 xmax: 180 ymax: 83.23324 epsg (SRID): 4326 proj4string: +proj=longlat +datum=WGS84 +no_defs name continent geometry 0 Fiji Oceania MULTIPOLYGON (((180 -16.067... 1 Tanzania Africa MULTIPOLYGON (((33.90371 -0... 2 W. Sahara Africa MULTIPOLYGON (((-8.66559 27... 3 Canada North America MULTIPOLYGON (((-122.84 49,... 4 United States of America North America MULTIPOLYGON (((-122.84 49,... 5 Kazakhstan Asia MULTIPOLYGON (((87.35997 49... We see that we have a spatial dataset consisting of several attributes (only name and continent are shown above) for each spatial feature which is stored in the geometry column. The header above the table displays further information such as the coordinate reference system (CRS) which will be important later. For now, it is important to know that the coordinates in the plot are longitude / latitude coordinates in degrees in a CRS called WGS84. The longitude reaches from -180° (180° west of the meridian) to +180° (180° east of the meridian). The latitude reaches from -90° (90° south of the equator) to +90° (90° north of the equator). Let’s plot this data to confirm what we have: ggplot() + geom_sf(data = worldmap) + theme_bw() Selecting certain areas of interest from a spatial dataset This is quite straight forward: We can plot a subset of a larger spatial dataset and then ggplot2 automatically sets the display window accordingly, so that it only includes the selected geometries. Let’s select and plot France only: france <- worldmap[worldmap$name == 'France',] ggplot() + geom_sf(data = france) + theme_bw()

Let’s try to select and plot all countries from Europe:

europe <- worldmap[worldmap\$continent == 'Europe',] ggplot() + geom_sf(data = europe) + theme_bw()

Well, this is probably not the section that you want to display when you want to make a plot of Europe. It includes whole Russia (since its continent attribute is set to "Europe") which makes the plot span the whole northern globe. At the same time, nearby countries such as Turkey or countries from northern Africa are missing, which you might want to include to show “context” around your region of interest.

Cropping geometries in the spatial dataset or restricting the display window

To fix this, you may crop the geometries in your spatial dataset as if you took a knife a sliced through the shapes on the map to take out a rectangular region of interest. The alternative would be to restrict the display window of your plot, which is like taking the map and folding it so that you only see the region of interest. The rest of the map still exists, you just don’t see it.

So both approaches are quite similar, only that the first actually changes the data by cutting the geometries, whereas the latter only “zooms” to a detail but keeps the underlying data untouched.

Let’s try out the cropping method first, which can be done with sf_crop(). It works by passing the spatial dataset (the whole worldmap dataset in our case) and specifying a rectangular region which we’d like to cut. Our coordinates are specified as longitude / latitude in degrees, as mentioned initially. We set the minimum longitude (i.e. the left border of the crop area) as xmin and the maximum longitude (i.e. right border) as xmax. The minimum / maximum latitude (i.e. lower and upper border) are set analogously as ymin and ymax. For a cropping area of Europe, we can set the longitude range to be in [-20, 45] and latitude range to be in [30, 73]:

europe_cropped <- st_crop(worldmap, xmin = -20, xmax = 45, ymin = 30, ymax = 73) ggplot() + geom_sf(data = europe_cropped) + theme_bw()

You can see that the shapes were actually cropped at the specified ranges when you look at the gaps near the plot margins. If you want to remove these gaps, you can add + coord_sf(expand = FALSE) to the plot.

We can also see that we now have clipped shapes when we project them to a different CRS. We can see that in the default projection that is used by ggplot2, the grid of longitudes and latitudes, which is called graticule, is made of perpendicular lines (displayed as light gray lines). When we use a Mollweide projection, which is a “pseudo-cylindrical” projection, we see how the clipped shapes are warped along the meridians, which do not run in parallel in this projection but converge to the polar ends of the prime meridian:

Restricting the display window for plotting achieves a very similar result, only that we don’t need to modify the spatial dataset. We can use coord_sf() to specify a range of longitudes (xlim vector with minimum and maximum) and latitudes (ylim vector) to display, similar to as we did it for st_crop():

ggplot() + geom_sf(data = worldmap) + coord_sf(xlim = c(-20, 45), ylim = c(30, 73), expand = FALSE) + theme_bw()

You see that the result is very similar, but we use the whole worldmap dataset without cropping it. Notice that I used expand = FALSE which removes the padding around the limits of the display window (this is what produced the “gaps” at the plot margins before).

Now what if we wanted to use the Mollweide projection again as target projection for this section? Let’s try:

ggplot() + geom_sf(data = worldmap) + coord_sf(crs = st_crs('+proj=moll'), xlim = c(-20, 45), ylim = c(30, 73), expand = FALSE) + theme_bw()

What happened here? The problem is that we specified xlim and ylim as WGS84 coordinates (longitude / latitude in degrees) but coord_sf needs these coordinates to be in the same CRS as the target projection, which is the Mollweide projection in our case. Mollweide uses meters as unit, so we actually specified a range of -20m to 45m on the x-scale and 30m to 73m on the y-scale, which is a really tiny patch near the center of projection.

So in order to fix that, we’d need to transform the coordinates of the display window to the target CRS first. First, let’s get a visual overview on what we want to do:

What we see is the whole worldmap dataset projected to our target CRS (standard Mollweide projection). There are two points, A = (-20°, 30°) and B = (45°, 73°), which define our display window, highlighted with red lines. The display window is a trapezoid because the originally rectangular display window defined by longitude / latitude coordinates is warped along the meridians, just as our cropped shape of Europe before. When you have a look at the axes, you’ll see that there are no degrees displayed here anymore. Instead, the graticule shows the target CRS, which has the unit meters (this is why the numbers on the axes are so large). Our corner points of the display window A and B must be converted to this CRS, too, and we can do that as follows:

At first, we set the target CRS and transform the whole worldmap dataset to that CRS.

target_crs <- '+proj=moll' worldmap_trans <- st_transform(worldmap, crs = target_crs)

Next, we specify the display window in WGS84 coordinates as longitude / latitude degrees. We only specify the bottom left and top right corners (points A and B). The CRS is set to WGS84 by using the EPSG code 4236.

disp_win_wgs84 <- st_sfc(st_point(c(-20, 30)), st_point(c(45, 73)), crs = 4326) disp_win_wgs84 Geometry set for 2 features geometry type: POINT dimension: XY bbox: xmin: -20 ymin: 30 xmax: 45 ymax: 73 epsg (SRID): 4326 proj4string: +proj=longlat +datum=WGS84 +no_defs POINT (-20 30) POINT (45 73)

These coordinates can be transformed to our target CRS via st_transform.

disp_win_trans <- st_transform(disp_win_wgs84, crs = target_crs) disp_win_trans Geometry set for 2 features geometry type: POINT dimension: XY bbox: xmin: -1833617 ymin: 3643854 xmax: 2065900 ymax: 8018072 epsg (SRID): NA proj4string: +proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs POINT (-1833617 3643854) POINT (2065900 8018072)

We can extract the coordinates from these points and pass them as limits for the x and y scale respectively. Note also that I set datum = target_crs which makes sure that the graticule is displayed in the target CRS. Otherwise ggplot2 by default displays a WGS84 graticule.

disp_win_coord <- st_coordinates(disp_win_trans) ggplot() + geom_sf(data = worldmap_trans) + coord_sf(xlim = disp_win_coord[,'X'], ylim = disp_win_coord[,'Y'], datum = target_crs, expand = FALSE) + theme_bw()

This works, but if you look closely, you’ll notice that the display window is slightly different than the one specified in WGS84 coordinates before (the red trapezoid). For example, Turkey is completely missing. This is because the coordinates describe the rectangle between A and B in Mollweide projection, and not the red trapezoid that comes from the original projection.

Calculating the display window for given a center point and zoom level

Instead of specifying a rectangular region of interest, I found it more comfortable to set a center point of interest and specify a “zoom level” which specifies how much area around this spot should be displayed.

I oriented myself on how zoom levels are defined for OpenStreetMap: A level of zero shows the whole world. A level of 1 shows a quarter of the world map. A level of two shows 1/16th of the world map and so on. This goes up to a zoom level of 19. So if z is the zoom level, we see a region that covers 1/4^z of the world. This means the range on the longitude axis that is displayed is 360°/2^z (because the longitude spans the full horizontal circle around the world from -180° to +180°) and the range on the latitude axis is 180°/2^z (latitude spans the half vertical circle from -90° to +90°).

Let’s set a center point zoom_to, a zoom_level and calculate the ranges for longitude and latitude (lon_span and lat_span).

zoom_to <- c(13.38, 52.52) # Berlin zoom_level <- 3 lon_span <- 360 / 2^zoom_level lat_span <- 180 / 2^zoom_level

Now we can calculate the longitude and latitude ranges for the the display window by subtracting/adding the half of the above ranges from/to the zoom center coordinates respectively.

lon_bounds <- c(zoom_to[1] - lon_span / 2, zoom_to[1] + lon_span / 2) lat_bounds <- c(zoom_to[2] - lat_span / 2, zoom_to[2] + lat_span / 2)

We can now plot the map with the specified display window for zoom level 3, highlighting the zoom center as red dot:

ggplot() + geom_sf(data = worldmap) + geom_sf(data = st_sfc(st_point(zoom_to), crs = 4326), color = 'red') + coord_sf(xlim = lon_bounds, ylim = lat_bounds) + theme_bw()

So this works quite good when using the default projection and WGS84 coordinates. How about another projection? Let’s put Kamchatka at the center of our map, set a zoom level, and use a Lambert azimuthal equal-area projection (LAEA) centered at our zoom point:

zoom_to <- c(159.21, 56.41) # ~ center of Kamchatka zoom_level <- 2.5 # Lambert azimuthal equal-area projection around center of interest target_crs <- sprintf('+proj=laea +lon_0=%f +lat_0=%f', zoom_to[1], zoom_to[2])

When we worked with WGS84 coordinates before, we calculated the displayed range of longitude and latitude at a given zoom level z in degrees by dividing the full circle for the longitude (360°) by 2^z and the half circle (180°) for the latitude in the same way. A LAEA projection uses meters as unit instead of degrees, so instead of dividing a circle of 360°, we divide the full circumference of the Earth in meters, which is about C = 40,075,016m, by 2^z to get the display range for the x-axis (corresponding to longitude range before). Similar as before, we use the half circumference (assuming that the Earth is a perfect sphere, which she is not) for the displayed range on the y-axis, so we get C / 2^(z+1).

C <- 40075016.686 # ~ circumference of Earth in meters x_span <- C / 2^zoom_level y_span <- C / 2^(zoom_level+1)

We can now calculate the lower left and upper right corner of our display window as before, by subtracting/adding half of x_span and y_span from/to the zoom center coordinate as before. But of course, the zoom center has to be converted to the target CRS, too. We put the zoom point at the center of the projection, so in this special case we already know that it is at (0m, 0m) in the target CRS. Anyway I will provide a, because when you use a standardized CRS with EPSG code, your zoom point probably won’t be the center of projection:

zoom_to_xy <- st_transform(st_sfc(st_point(zoom_to), crs = 4326), crs = target_crs) zoom_to_xy Geometry set for 1 feature geometry type: POINT dimension: XY bbox: xmin: 1.570701e-09 ymin: 7.070809e-10 xmax: 1.570701e-09 ymax: 7.070809e-10 epsg (SRID): NA proj4string: +proj=laea +lat_0=56.41 +lon_0=159.21 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs POINT (1.570701e-09 7.070809e-10)

Rounding errors aside, this confirms that the zoom point is at the center of projection. We can now calculate the display window coordinates as stated before¹:

disp_window <- st_sfc( st_point(st_coordinates(zoom_to_xy - c(x_span / 2, y_span / 2))), st_point(st_coordinates(zoom_to_xy + c(x_span / 2, y_span / 2))), crs = target_crs )

And now passing the display window coordinates to coord_sf, we can plot the result:

ggplot() + geom_sf(data = worldmap) + geom_sf(data = zoom_to_xy, color = 'red') + coord_sf(xlim = st_coordinates(disp_window)[,'X'], ylim = st_coordinates(disp_window)[,'Y'], crs = target_crs, datum = target_crs) + theme_bw()

You can try this out with different zoom levels, zoom center points and projections. Note that some combinations of zoom levels and projections will cause projection errors, since some projection are only specified within certain bounds and not for the whole world. I put the full R script in this gist.

¹ Again, we could simplify this in the special case where the zoom point is the center of projection. Also note that I had to wrap each coordinate in st_point(st_coordinates(...)), which seems redundant but was the only way to get it working. It looks like the sf objects lose the CRS information when applying vector operations to them.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...