Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 11 hours 52 min ago

Bayesian Statistics: Analysis of Health Data

Mon, 03/11/2019 - 02:16

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Regression Models

    Tags

    1. Bayesian Analysis
    2. Linear Regression
    3. R Programming
    4. t-test

    The premise of Bayesian statistics is that distributions are based on a personal belief about the shape of such a distribution, rather than the classical assumption which does not take such subjectivity into account.

    In this regard, Bayesian statistics defines distributions in the following way:

    • Prior: Beliefs about a distribution prior to observing any data.
    • Likelihood: Distribution based on the observed data.
    • Posterior: Distribution that combines observed data with prior beliefs.

    Classical statistics relies largely on the t-test to determine significance of a particular variable, and does not take subjective predictions of the data into account. However, the issue with such an approach is that no constraint is placed on the data, and as Richard Morey explains in his blog, an alternative hypothesis becomes “completely unfalsifiable”.

    In this regard, the Bayes factor penalizes hypotheses that are not specific enough, allowing for a more concrete assessment of how accurate a specific prediction is.

    Problem and hypothesis

    As an example, let us consider the hypothesis that BMI increases with age. Conversely, the null hypothesis argues that there is no evidence for a positive correlation between BMI and age.

    In this regard, even if we did find a positive correlation between BMI and age, the hypothesis is virtually unfalsifiable given that the existence of no relationship whatever between these two variables is highly unlikely.

    Therefore, as opposed to using a simple t-test, a Bayes Factor analysis needs to have specific predictions to assess the accuracy of the same.

    t-test vs. Bayes Factor t-test

    For this problem, the diabetes dataset from the UCI Machine Learning Repository is used. The variables Age and BMI are extracted, the data is ordered by age, and a t-test is calculated across two separate age groups.

    library(BayesFactor) # Data Processing health<-read.csv("diabetes.csv") attach(health) newdata <- data.frame(Age,BMI) newdata <- newdata[order(Age),] ## Compute difference scores diffScores = newdata$BMI[1:384] - newdata$BMI[385:768] ## Traditional two-tailed t test t.test(diffScores) data: diffScores t = -2.2622, df = 383, p-value = 0.02425 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -2.4371984 -0.1706141 sample estimates: mean of x -1.303906

    With a p-value below 0.05, the t-test shows significance at the 5% level, indicating that the null hypothesis of the mean is equal to 0 is rejected. However, the issue still remains in that the degree of evidence in favor of H1 over H0 cannot be quantified in detail.

    In this regard, a Bayes Factor t-test is run across the different scores.

    bf = ttestBF(x = diffScores) bf Bayes factor analysis -------------- [1] Alt., r=0.707 : 0.7139178 ±0.01% Against denominator: Null, mu = 0 --- Bayes factor type: BFoneSample, JZS

    A score of 0.7139 is yielded. Typically, a score of > 1 signifies anecdotal evidence for H0 compared to H1. The exact thresholds are defined by Wagenmakers et. al, 2011, and a copy of the table can be found at the following The 20% Statistician post.

    Let’s come back to the issue of the posterior distribution. In the case that we are unable to calculate the posterior distribution, it can be estimated using Markov chain Monte Carlo methods (MCMC).

    The chains are estimated and the distributions are plotted:

    chains = posterior(bf, iterations = 1000) summary(chains) plot(chains[,1:2]) Iterations = 1:1000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 1000 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE mu -1.2667 0.56619 0.017904 0.017904 sig2 128.3996 9.42371 0.298004 0.298004 delta -0.1121 0.05032 0.001591 0.001591 g 1.8037 10.59165 0.334937 0.334937 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% mu -2.38323 -1.6611 -1.2626 -0.87649 -0.17530 sig2 111.81072 121.7486 128.2914 134.16920 147.16271 delta -0.21117 -0.1458 -0.1123 -0.07679 -0.01459 g 0.06183 0.1763 0.3868 0.94495 8.79583

    Here is the graph of the distributions:

    regressionBF

    Bayesian analysis can also be used to compare probabilities across several regression models.

    As an example, let’s take the following regression model:

    reg1 = lm(BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health) summary(reg1) Call: lm(formula = BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health) Residuals: Min 1Q Median 3Q Max -33.502 -4.559 -0.220 4.400 29.738 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.669900 1.375338 15.029 < 2e-16 *** Insulin 0.008192 0.002486 3.295 0.00103 ** Age -0.044086 0.028317 -1.557 0.11992 BloodPressure 0.106763 0.014281 7.476 2.11e-13 *** Glucose 0.038988 0.009258 4.211 2.84e-05 *** Pregnancies 0.011194 0.094427 0.119 0.90567 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.377 on 762 degrees of freedom Multiple R-squared: 0.1302, Adjusted R-squared: 0.1245 F-statistic: 22.81 on 5 and 762 DF, p-value: < 2.2e-16

    Again, we see that certain variables are indicated as statistically significant while others are not. However, the same problem arises in that p-values are not regarded as direct measures of evidence against the null hypothesis under the Bayesian school of thought, and thus more concrete probability measures should be used.

    In this regard, a Bayesian regression is run, with a Bayes factor analysis indicating the highest and lowest probability regressions.

    bf = regressionBF(BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health) length(bf) [1] 31 head(bf, 3) Bayes factor analysis -------------- [1] Insulin + BloodPressure + Glucose : 1.352293e+19 ±0% [2] Insulin + Age + BloodPressure + Glucose : 7.763347e+18 ±0% [3] Insulin + BloodPressure + Glucose + Pregnancies : 2.389623e+18 ±0.01% Against denominator: Intercept only --- Bayes factor type: BFlinearModel, JZS tail(bf, 3) Bayes factor analysis -------------- [1] Age : 0.1321221 ±0% [2] Pregnancies : 0.09066687 ±0% [3] Age + Pregnancies : 0.01652438 ±0.01% Against denominator: Intercept only --- Bayes factor type: BFlinearModel, JZS which.max(bf) Insulin + BloodPressure + Glucose 19

    From here, we see that the following three combinations of variables in each regression demonstrate the reported highest probability in computing BMI:

    • Insulin + BloodPressure + Glucose
    • Insulin + Age + BloodPressure + Glucose
    • Insulin + BloodPressure + Glucose + Pregnancies

    Conversely, these three combinations of variables demonstrated the lowest reported probability:

    • Age
    • Pregnancies
    • Age + Pregnancies

    As indicated by which.max(bf), 19 regression models were generated in total.

    In order to generate a plot of the probabilities for each combination, the probabilities of the top reported models can be divided against the probabilities of all the generated models:

    bf2 = head(bf) / max(bf) bf2 Bayes factor analysis -------------- [1] Insulin + BloodPressure + Glucose : 1 ±0% [2] Insulin + Age + BloodPressure + Glucose : 0.5740878 ±0% [3] Insulin + BloodPressure + Glucose + Pregnancies : 0.176709 ±0.01% [4] Insulin + Age + BloodPressure + Glucose + Pregnancies : 0.08419853 ±0% [5] Age + BloodPressure + Glucose : 0.02216035 ±0% [6] BloodPressure + Glucose : 0.01570813 ±0.01% Against denominator: BMI ~ Insulin + BloodPressure + Glucose --- Bayes factor type: BFlinearModel, JZS plot(bf2)

    Additionally, “top-down” and “bottom-up” analyses can be generated, wherein the former each independent variable is eliminated one at a time, whereas in the latter each independent variable is added one at a time.

    bf_top Bayes factor top-down analysis -------------- When effect is omitted from Insulin + Age + BloodPressure + Glucose + Pregnancies , BF is... [1] Omit Pregnancies : 6.818265 ±0% [2] Omit Glucose : 0.001291526 ±0% [3] Omit BloodPressure : 2.526265e-11 ±0% [4] Omit Age : 2.098719 ±0.01% [5] Omit Insulin : 0.0350438 ±0% Against denominator: BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies --- Bayes factor type: BFlinearModel, JZS plot(bf_top)

    # Bottom-up bf_btm = regressionBF(BMI ~ Insulin + Age + BloodPressure + Glucose + Pregnancies, data = health, whichModels = "bottom") bf_btm |=================================================================| 100% Bayes factor analysis -------------- [1] Insulin : 275178 ±0% [2] Age : 0.1321221 ±0% [3] BloodPressure : 2.926062e+12 ±0% [4] Glucose : 12805251 ±0% [5] Pregnancies : 0.09066687 ±0% Against denominator: Intercept only --- Bayes factor type: BFlinearModel, JZS plot(bf_btm)

    According to these two graphs, Blood Pressure is indicated as the most influential variable in explaining the indicative probability favoring H1 over H0.

    Conclusion

    Ultimately, the area of Bayesian statistics is very large and the examples above cover just the tip of the iceberg.

    However, in this particular example we have looked at:

    • The comparison between a t-test and the Bayes Factor t-test
    • How to estimate posterior distributions using Markov chain Monte Carlo methods (MCMC)
    • Use of regressionBF to compare probabilities across regression models

    Many thanks for your time.

    Related Post

    1. How to do regression analysis for multiple independent or dependent variables
    2. Robust Regressions: Dealing with Outliers in R
    3. Multilevel Modelling in R: Analysing Vendor Data
    4. Logistic Regression with Python using Titanic data
    5. Failure Pressure Prediction Using Machine Learning
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    reciproceed : recipe, proceed, reciprocally

    Mon, 03/11/2019 - 02:05

    (This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers)

    use R, YAML and bookdown to automate generation of procedure books

    Context: How to do this thing that only your colleague does when he’s not here?

    I recently put on github a project that contains a little framework to list procedures in a structured way that can be shared with all your colleagues or buddies.

    You may know situations where someone in your team has build a project that he is the only one to mastering. How to make it works when he is not here ? Maybe he has written a doc / procedure that explains his project but you don’t know (or don’t remember) where it is on the network (or on the web)?

    As a solution, an index of all procedures can be useful, one that lists:

    • summary of all procedures (as simple steps)
    • links to git repositories
    • links to more detailed procedures (docx, pdf, html)
    • links to data sources used by projects
    • etc.

    reciproceed sounds like “reciprocally” meaning that once all procedures are listed this way, all the team will be involved reciprocally.

    Idea

    To do this, an idea is: using a YAML file and bookdown to build a shareable procedure listing like a book.

    Example

    As an example, I collected some soup recipes and fake procedures, see the source YAML file.

    Soup recipes are used as fake data (meaning that they are structured and have chronological steps like procedures) but they are not fake soups. By the way they came from bbcfoods.

    And the result book is here:

    Soup recipe example

    Maybe

    Maybe someone else will be interested by this and even will participate to improve it.

    It’s here, https://github.com/GuillaumePressiat/reciproceed

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data Manipulation Corner Cases

    Mon, 03/11/2019 - 01:31

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    Let’s try some "ugly corner cases" for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.

    Let’s see what happens when we try to stick a fork in the power-outlet.

    For our example let’s set up some data.frames.

    # create some exmaple tables d1 <- data.frame(v1 = c("a", "b"), stringsAsFactors = FALSE) d2 <- data.frame(v2 = c("x", "y"), stringsAsFactors = FALSE) d3 <- data.frame(x = 1, y = 2, z = 3)

    And we will also copy this data into a database.

    # copy the example tables to a database db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbWriteTable(db, "d1d", d1) DBI::dbWriteTable(db, "d2d", d2) DBI::dbWriteTable(db, "d3d", d3)

    Now let’s try a couple of basic tasks using dplyr.

    # try to use dplyr to work with the original tables # in memory suppressPackageStartupMessages(library("dplyr")) ## Warning: package 'dplyr' was built under R version 3.5.2 packageVersion("dplyr") ## [1] '0.8.0.1'

    First we try a cross-join.

    # try to do a cross-join left_join(d1, d2, by = character(0)) ## Error: `by` must specify variables to join by # join is rejected

    Obviously that is something deliberately prohibited by the package authors.

    Let’s try dividing all columns by a value.

    # try to divide all columns by y mutate_all(d3, ~(./y)) ## x y z ## 1 0.5 1 3 # z is divided by 1, not by the original # y-value

    Something went wrong, notice z was not altered. We will return to this later.

    Now let’s try the same operations using dplyr on a database.

    # try the same calculations in database # using dplyr packageVersion("dbplyr") ## [1] '1.3.0' # get references to the database tables d1d <- dplyr::tbl(db, "d1d") d2d <- dplyr::tbl(db, "d2d") d3d <- dplyr::tbl(db, "d3d") # try to do a cross-join left_join(d1d, d2d, by = character(0)) ## # Source: lazy query [?? x 2] ## # Database: sqlite 3.22.0 [:memory:] ## v1 v2 ## ## 1 a x ## 2 a y ## 3 b x ## 4 b y # this time it works

    In this case the cross join works (as databases expect to support this operation).

    Okay, let’s try the divide by columns example again.

    # try to divide all columns by y mutate_all(d3d, ~(./y)) ## Error in (structure(function (..., .x = ..1, .y = ..2, . = ..1) : object 'x' not found # this time it errors-out

    Boom (ouch).

    Now let’s try that again with the rquery package.

    # try the join task using rquery library("rquery") # get references to the database tables d1r <- rquery::db_td(db, "d1d") d2r <- rquery::db_td(db, "d2d") d3r <- rquery::db_td(db, "d3d") # try the cross-join ops <- natural_join(d1r, d2r, by = character(0), jointype = "LEFT") execute(db, ops) ## v1 v2 ## 1 a x ## 2 a y ## 3 b x ## 4 b y

    The cross-join worked in the database.

    Now we start on the column division. We have to specify the set of operations explicitly, but that really isn’t too big of burden (as rquery supplies tools for this).

    # try to get column alterations vars <- colnames(d3) expressions <- paste(vars, "%:=%", vars, "/", "y") print(expressions) ## [1] "x %:=% x / y" "y %:=% y / y" "z %:=% z / y" ops2 <- extend_se(d3r, expressions) cat(format(ops2)) ## table(`d3d`; ## x, ## y, ## z) %.>% ## extend(., ## x %:=% x / y) %.>% ## extend(., ## y %:=% y / y) %.>% ## extend(., ## z %:=% z / y) # Oh! This isn't what I want, see # how y gets updated before it is used on column z.

    We took the extra step of examining the operations before we tried them. Notice the rquery::extend() operation got factored into explicit steps. This introduces inspectable guarantees as to what value each column has at each step. This is enough to show us that the y column will be all ones before it is used to try and adjust the z column. This is not what we wanted, though this query is guaranteed to have similar semantics both on-database and in-memory.

    Now let’s build, examine and execute a better version of the query.

    # try to get column alterations again vars <- setdiff(colnames(d3), "y") expressions <- paste(vars, "%:=%", vars, "/", "y") print(expressions) ## [1] "x %:=% x / y" "z %:=% z / y" ops2b <- d3r %.>% extend_se(., expressions) %.>% extend(., y = 1) cat(format(ops2b)) ## table(`d3d`; ## x, ## y, ## z) %.>% ## extend(., ## x %:=% x / y, ## z %:=% z / y) %.>% ## extend(., ## y := 1) execute(db, ops2b) ## x z y ## 1 0.5 1.5 1

    With the rqdatatable package we can use data.table to process in-R data.

    # try the join task using rqdatatable library("rqdatatable") # immediate notation natural_join(d1, d2, by = character(0)) ## v1 v2 ## 1: a x ## 2: a y ## 3: b x ## 4: b y # join task in operators notation ex_data_table(ops, tables = list(d1d = d1, d2d = d2)) ## v1 v2 ## 1: a x ## 2: a y ## 3: b x ## 4: b y # column alterations in operators notation ex_data_table(ops2b, tables = list(d3d = d3)) ## x y z ## 1: 0.5 1 1.5

    And there we are. Frankly we didn’t really have to stretch things very far to break things (including building up non-signalling data mistakes; inspect your intermediate results!!).

    We have been trying to keep the number of unexpected behaviors in rquery down by keeping the rquery implementation very simple (and very thin) and then relying on either the database or data.table for the actual implementation and semantics. There are, of course, going to be cases where rquery needs a fix- but we have been able to find and apply such fixes quite quickly. We have also found fixing rquery is much faster than coding around bugs.

    # clean up DBI::dbDisconnect(db) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Summary of My Home-Brew Binning Algorithms for Scorecard Development

    Sun, 03/10/2019 - 21:27

    (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

    Thus far, I have published four different monotonic binning algorithms for the scorecard development and think that it might be a right timing to do a quick summary. R functions for these binning algorithms are also available on https://github.com/statcompute/MonotonicBinning.

    The first one was posted back in 2017 (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) based on my SAS macro (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) that has been widely used by sasORs. This R function mono_bin() is designed to generate monotonic bins with roughly equal densities, e.g. size of records in each bin. There are two potential issues for this binning algorithm. Albeit robust, the binning outcome is too coarse and and therefore might not be granular enough to capture the data nature. In addition, although the algorithm is fully automatic and able to converge globally, it requires iterations that might be computationally expensive for big datasets.

    In light of aforementioned shortcomings, I developed the second one based on the isotonic regression (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression and https://statcompute.wordpress.com/2018/11/23/more-robust-monotonic-binning-based-on-isotonic-regression) that successfully addresses both the coarse binning and iterations.

    The third one was developed last year just out of my own curiosity (https://statcompute.wordpress.com/2018/10/14/monotonic-binning-with-equal-sized-bads-for-scorecard-development) for the purpose to generate monotonic bins with roughly equal-sized bads, e.g. Y = 1. From the performance standpoint, this one is not materially different from the first one. It is more like a brainteaser for myself.

    The last one (https://statcompute.wordpress.com/2018/11/25/improving-binning-by-bootstrap-bumping) was mainly motivated by the idea of Bootstrap Bumping from Tibshirani and Knight (1997) and implements the bumping on top of the second one above based on the isotonic regression. The outcome of this one is satisfactory in two folds. First of all, since the bumping is based on bootstrap samples drawn from the original dataset, the concern about over-fitting due to the sample bias can be somewhat addressed. Secondly, through the bumping search across all bootstrap samples, chances are that a closer-to-optimal solution can be achieved.

    R functions for all 4 binning algorithms on the GitHub are built on top of an utility function manual_bin() (https://github.com/statcompute/MonotonicBinning/blob/master/code/manual_bin.R). In the R code, I tried my best to make it as generic as possible without importing additional packages. The only exception is that the parallel package is used in the bump_bin() function to speed up the computation. My next task might be writing a scoring function to make these binning algorithms useful in production.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Wrangling Content Security Policies in R

    Sun, 03/10/2019 - 18:10

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    (header image via MDN Web Docs)

    The past two posts have used R to look at the safety/security (just assume those terms are in scare quotes from now on in every post) of web-y thing-ys. We’ll continue that theme with this post where we focus on a [sadly unused] HTTP server header: Content Security Policy (referred to as CSP in the remainder of the post).

    To oversimplify things, the CSP header instructs a browser on what you’ve authorized to be part of your site. You supply directives for different types of content which tell browsers what sites can load content into the current page in an effort to prevent things like cross-site scripting attacks, malicious iframes, clickjacking, cryptojacking, and more. If you think this doesn’t happen, think again.

    The web is awash in resources (some aforelinked) on how to configure CSPs and doesn’t need another guide (and this post is about working with CSPs as a data source in R). However, if you aren’t currently using CSPs on any sites you own and want a fairly easy way to develop a policy, grab a copy of the latest Firefox, install Mozilla’s own CSP Toolkit extension and just browse your site with that plugin active. It will generate as minimal of a policy as possible as it looks at all the resource requests your site makes. Another method is to use Content-Security-Policy-Report-Only header vs the Content-Security-Policy header since the former one will just report potential resource loading issues instead of blocking the resource loading.

    How About Some R Code Now, Mmmkay?

    Even if you do not have CSPs of your own they’re easy to find since they come along for the ride (when they exist) in HTTP server response headers. To make it easier to see if a site has a CSP and also load a CSP I’ve put together a small package (installation instructions & source locations are on that link) which can do this for us. (If you visit that page you’ll see a failed image load for the CRAN Status Badge. Open up Developer Tools or the javascript console in your browser and you’ll see the CSP rules hard at work:

    The cspy package is based on Shape Security’s spiffy Java library which implements a bona fide parser for CSP rules (as opposed to hacky-string-slicing which we’re also going to do in this post) and has a set of safety checker rules written in R but based on ones defined by Google. I violated the “supporting JARs in one pkg; actual functionality in another pkg” “rule” since Shape Security only does ~1.5 releases a year and it’s a super small JAR with no dependencies.

    Let’s find an R-focused site with a CSP we can work with (assume the tidyverse and stringi are loaded into the R session). We’ll start with a super spiffy R org: RStudio!:

    cspy::has_csp("https://www.rstudio.com/") ## [1] FALSE

    Oh, um…well, what about another super spiffy R org: Mango Solutions!:

    cspy::has_csp("http://mango-solutions.com/") ## [1] FALSE

    Heh. Well…ah&hellip. Hrm. What about Data Camp! They have tons of great R things:

    cspy::has_csp("https://www.datacamp.com/") ## [1] TRUE

    W00t!

    To be fair, the main brand sites for both RStudio and Mango use WordPress and WordPress makes it super hard to write a decent CSP. However, (IMO) that’s kinda no excuse for not having one (since I run WordPress and have one). But, Data Camp does! So, let’s fetch it and take a look:

    (dc_csp <- cspy::fetch_csp("https://www.datacamp.com/")) ## default-src 'self' *.datacamp.com *.datacamp-staging.com ## base-uri 'self' ## connect-src 'self' *.datacamp.com *.datacamp-staging.com https://com-datacamp.mini.snplow.net https://www2.profitwell.com https://www.facebook.com https://www.optimalworkshop.com https://in.hotjar.com https://stats.g.doubleclick.net https://*.google-analytics.com https://frstre.com https://onesignal.com https://*.algolia.net https://*.algolianet.com https://wootric-eligibility.herokuapp.com https://d8myem934l1zi.cloudfront.net https://production.wootric.com ## font-src 'self' *.datacamp.com *.datacamp-staging.com https://fonts.gstatic.com data: ## form-action 'self' *.datacamp.com *.datacamp-staging.com https://*.paypal.com https://www.facebook.com ## frame-ancestors 'self' *.datacamp.com *.datacamp-staging.com ## frame-src 'self' *.datacamp.com *.datacamp-staging.com *.zuora.com https://www.google.com https://vars.hotjar.com https://beacon.tapfiliate.com https://b.frstre.com https://onesignal.com https://tpc.googlesyndication.com https://*.facebook.com https://*.youtube.com ## img-src 'self' *.datacamp.com *.datacamp-staging.com https://s3.amazonaws.com https://checkout.paypal.com https://cdnjs.cloudflare.com https://www.gstatic.com https://maps.googleapis.com https://maps.gstatic.com https://www.facebook.com https://q.quora.com https://bat.bing.com https://*.ads.linkedin.com https://www.google.com https://*.g.doubleclick.net https://*.google-analytics.com https://www.googletagmanager.com blob data: ## script-src 'self' 'unsafe-inline' 'unsafe-eval' *.datacamp.com *.datacamp-staging.com https://*.googletagmanager.com https://*.google-analytics.com https://browser.sentry-cdn.com https://js.braintreegateway.com https://*.zuora.com https://*.paypal.com https://js-agent.newrelic.com https://bam.nr-data.net https://maps.googleapis.com https://assets.calendly.com http://com-datacamp.mini.snplow.net https://cdn.jsdelivr.net https://*.algolia.net https://www.gstatic.com https://www.google.com https://cdnjs.cloudflare.com https://static.tapfiliate.com https://cdn.onesignal.com https://onesignal.com https://static.hotjar.com https://script.hotjar.com https://*.linkedin.com https://sjs.bizographics.com/insight.min.js https://bat.bing.com/bat.js https://a.quora.com/qevents.js https://connect.facebook.net https://www.googleadservices.com https://www.googletagmanager.com https://*.googlesyndication.com https://dna8twue3dlxq.cloudfront.net/js/profitwell.js https://disutgh7q0ncc.cloudfront.net/beacon.js data: ## style-src 'self' 'unsafe-inline' *.datacamp.com *.datacamp-staging.com https://cdnjs.cloudflare.com https://fonts.googleapis.com https://assets.calendly.com ## worker-src 'self' ## report-uri https://infra-reporter.datacamp.com/csp-report

    Wow. Literally: wow. We’ll let their recent breach and lack of 2FA for logins slide for a bit b/c this is a robust CSP so it’s obvious they do care about you (regardless of whatever CSP helper tools are out there these CSPs are kind of a pain to setup so the fact that this one is so thorough is in indicator they “get it”).

    R folks are likely looking at that untidy mess and cringing, though. We’ll tidy it right up, soon, but let’s make sure it’s a valid CSP:

    glimpse(validate_csp(dc_csp)) ## Observations: 1 ## Variables: 8 ## $ message "A draft of the next version of CSP deprecates report-uri in favour … ## $ type "INFO" ## $ start_line 1 ## $ start_column 2680 ## $ start_offset 2679 ## $ end_line 1 ## $ end_column 2690 ## $ end_offset 2689

    The validator only noted that a soon-to-be deprecated directive is being used (but that’s just informational and perfectly fine until CSP v3 is out of Draft and browsers support it) and the validation output has positional values for where the violations occur (which shows the benefits of using a true parser vs string splitter).

    The custom print method for these csp objects puts each directive on a separate line and has left the resource values intact. As noted, it’s ugly and untidy so let’s take care of that:

    (csp_df <- as.data.frame(dc_csp)) ## # A tibble: 111 x 3 ## directive value origin ## ## 1 default-src 'self' https://www.datacamp.com/ ## 2 default-src *.datacamp.com https://www.datacamp.com/ ## 3 default-src *.datacamp-staging.com https://www.datacamp.com/ ## 4 base-uri 'self' https://www.datacamp.com/ ## 5 connect-src 'self' https://www.datacamp.com/ ## 6 connect-src *.datacamp.com https://www.datacamp.com/ ## 7 connect-src *.datacamp-staging.com https://www.datacamp.com/ ## 8 connect-src https://com-datacamp.mini.snplow.net https://www.datacamp.com/ ## 9 connect-src https://www2.profitwell.com https://www.datacamp.com/ ## 10 connect-src https://www.facebook.com https://www.datacamp.com/ ## # … with 101 more rows

    Now we have one row for each directive/value pair. The origin (where the policy came from) also comes along for the ride in the event you want to bind a bunch of these together.

    Let’s see how many third-party sites they had to catch and add to this policy:

    filter(csp_df, stri_detect_fixed(value, ".")) %>% pull(value) %>% stri_replace_all_fixed("*.", "") %>% urltools::url_parse() %>% distinct(domain) %>% filter(!stri_detect_regex(domain, "(datacamp\\.com|datacamp-staging\\.com)$")) %>% arrange(domain) %>% pull(domain) ## [1] "a.quora.com" "ads.linkedin.com" "algolia.net" ## [4] "algolianet.com" "assets.calendly.com" "b.frstre.com" ## [7] "bam.nr-data.net" "bat.bing.com" "beacon.tapfiliate.com" ## [10] "browser.sentry-cdn.com" "cdn.jsdelivr.net" "cdn.onesignal.com" ## [13] "cdnjs.cloudflare.com" "checkout.paypal.com" "com-datacamp.mini.snplow.net" ## [16] "connect.facebook.net" "d8myem934l1zi.cloudfront.net" "disutgh7q0ncc.cloudfront.net" ## [19] "dna8twue3dlxq.cloudfront.net" "facebook.com" "fonts.googleapis.com" ## [22] "fonts.gstatic.com" "frstre.com" "g.doubleclick.net" ## [25] "google-analytics.com" "googlesyndication.com" "googletagmanager.com" ## [28] "in.hotjar.com" "js-agent.newrelic.com" "js.braintreegateway.com" ## [31] "linkedin.com" "maps.googleapis.com" "maps.gstatic.com" ## [34] "onesignal.com" "paypal.com" "production.wootric.com" ## [37] "q.quora.com" "s3.amazonaws.com" "script.hotjar.com" ## [40] "sjs.bizographics.com" "static.hotjar.com" "static.tapfiliate.com" ## [43] "stats.g.doubleclick.net" "tpc.googlesyndication.com" "vars.hotjar.com" ## [46] "wootric-eligibility.herokuapp.com" "www.facebook.com" "www.google.com" ## [49] "www.googleadservices.com" "www.googletagmanager.com" "www.gstatic.com" ## [52] "www.optimalworkshop.com" "www2.profitwell.com" "youtube.com" ## [55] "zuora.com"

    That list is one big reason these CSPs can be gnarly to configure and also why some argue there’s potentially little benefit in them. Why little benefit? Let’s see what sites they’ve allowed browsers to load and execute javascript content from:

    filter(csp_df, directive == "script-src") %>% select(value) %>% filter(!stri_detect_regex(value, "(datacamp\\.com$|datacamp-staging\\.com$|'|data:)")) %>% pull(value) ## [1] "https://*.googletagmanager.com" "https://*.google-analytics.com" ## [3] "https://browser.sentry-cdn.com" "https://js.braintreegateway.com" ## [5] "https://*.zuora.com" "https://*.paypal.com" ## [7] "https://js-agent.newrelic.com" "https://bam.nr-data.net" ## [9] "https://maps.googleapis.com" "https://assets.calendly.com" ## [11] "http://com-datacamp.mini.snplow.net" "https://cdn.jsdelivr.net" ## [13] "https://*.algolia.net" "https://www.gstatic.com" ## [15] "https://www.google.com" "https://cdnjs.cloudflare.com" ## [17] "https://static.tapfiliate.com" "https://cdn.onesignal.com" ## [19] "https://onesignal.com" "https://static.hotjar.com" ## [21] "https://script.hotjar.com" "https://*.linkedin.com" ## [23] "https://sjs.bizographics.com/insight.min.js" "https://bat.bing.com/bat.js" ## [25] "https://a.quora.com/qevents.js" "https://connect.facebook.net" ## [27] "https://www.googleadservices.com" "https://www.googletagmanager.com" ## [29] "https://*.googlesyndication.com" "https://dna8twue3dlxq.cloudfront.net/js/profitwell.js" ## [31] "https://disutgh7q0ncc.cloudfront.net/beacon.js"

    Some of those are specified all the way down to the .js file but others work across entire CDN networks. Granted, there’s nothing super bad like Amazon’s S3 apex domains (wildcarded or not) but there’s alot of trust going on in that list. The good thing about CSPs is that if Data Camp learns about an issue with any of those sites it can quickly zap them from the header since they’ve already done the hard enumeration work. Plus, if one of those sites is compromised and loads a resource that tries to execute code from a domain not in that list, a policy error will be generated and Data Camp’s excellent SOC (they responded super well to the breach) can take active measures to inform their downstream provider that there’s something wrong.

    We can also do some safety checks:

    check_script_unsafe_inline(csp_df) ## Category: script-unsafe-inline ## Severity: HIGH ## Message: 'unsafe-inline' allows the execution of unsafe in-page scripts and event handlers. ## ## directive value origin ## 67 script-src 'unsafe-inline' https://www.datacamp.com/ check_script_unsafe_eval(csp_df) ## Category: script-unsafe-eval ## Severity: HIGH ## Message: 'unsafe-eval' allows the execution of code injected into DOM APIs such as eval(). ## ## directive value origin ## 68 script-src 'unsafe-eval' https://www.datacamp.com/ check_plain_url_schemes(csp_df) ## Category: unsafe-execution ## Severity: HIGH ## Message: URI(s) found that allow the exeution of unsafe scripts. ## ## directive value origin ## 102 script-src data: https://www.datacamp.com/ check_wildcards(csp_df) check_missing_directives(csp_df) ## Category: missing-directive ## Severity: HIGH ## Message: Missing object-src allows the injection of plugins which can execute JavaScript. Can you set it to 'none'? ## ## directive value ## 1 object-src check_ip_source(csp_df) check_deprecated(csp_df) ## Category: deprecated-directive ## Severity: INFO ## Message: Found deprecated directive(s). See https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy for more information. ## ## directive value ## 111 report-uri https://infra-reporter.datacamp.com/csp-report ## origin ## 111 https://www.datacamp.com/ check_nonce_length(csp_df) check_src_http(csp_df) ## Category: http-src ## Severity: MEDIUM ## Message: Use HTTPS vs HTTP. ## ## directive value ## 81 script-src http://com-datacamp.mini.snplow.net ## origin ## 81 https://www.datacamp.com/

    Along with pretty printing they return data frames of potentially problematic rules.

    For some pages, Data Camp does what you likely do when you knit an R markdown document: save it as a self-contained page. Because of that it is absolutely necessary to use 'unsafe-inline', 'unsafe-eval', and data: directive values for a few directives. It just can’t be helped. We already knew about the report-uri deprecation. While object-src isn’t in the CSP-proper, there is a default-src and that will take effect. But, the checker knows it’s not set to 'none' which means plugins can still load from Data Camp’s own domains (which may be perfectly fine depending on the type of embeds they use).

    The one finding I’ll have to side with the checker on is the one at the end (the output of check_src_http()). They seem to allow one of the many analytics providers for the site (I think it’s this org) use http:// vs https://. If they really do load javascript content from that site via plain http:// then it would be possible for an attacker to hijack the content and insert malicious javascript. Yet another good reason to use DNS-based (e.g. pi-hole) or uBlock Origin to prevent any trackers or analytics scripts from loading at all.

    One further thing to note is that Data Camp uses Amazon AWS for their site hosting and you kinda have to jump through hoops to get this header added there so they have seriously done a decent job thinking about your safety.

    At-Scale Wrangling

    The cspy package is fine for casual/light inspection but it’s uses a Java shim which means marshaling of data between R and the JVM and that’s at-scale. Depending on what you’re trying to do, some basic string wrangling may suffice.

    At $DAYJOB we perform many HTTP scans. One scan we have internally (I don’t think it’s part of Open Data but I’ll check Monday) is a regular grab of the Alexa Top 1m named hosts. In the most recent scan we found ~65K sites with a CSP header. You can grab a CSV containing the virtual host name and the found CSP policy string here.

    Once you download that we can just use string ops to get to a tidy data frame:

    read_csv("top1m-csp.csv.gz", col_types = "cc") %>% # use the location where you stored it mutate(csp = stri_trans_tolower(csp)) -> top1m_csp stri_split_regex(top1m_csp$csp, ";[[:space:]]*", simplify = FALSE) %>% # split directives for each policy map2_df(top1m_csp$vhost, ~{ # we want to keep the origin stri_match_first_regex(.x, "([[:alpha:]\\-]+)[[:space:]]+(.*)$|([[:alpha:]\\-]+)") %>% # get the directive as.data.frame(stringsAsFactors = FALSE) %>% mutate(origin = sprintf("https://%s/", .y)) %>% mutate(raw = .x) # save the raw CSP }) %>% mutate(V2 = ifelse(is.na(V2), V4, V2)) %>% # figure out directive mutate(V3 = ifelse(is.na(V3), "", V3)) %>% # and value (keeping in mind no-value directives) select(directive = V2, value = V3, origin, raw) %>% as_tibble() %>% separate_rows(value, sep="[[:space:]]+") -> top1m_csp # tidy it up with each row being directive|value|origin like the above

    That took just about 2 minutes for me and that’s two minutes too long to spread around to intrepid readers who may be trying this at home so an RDS of the result is also available.

    So, only ~65K out of the top 1m sites have a CSP header (~6.5%) but that doesn’t mean they’re good. Let’s take a look at the invalid directives they’ve used:

    top1m_csp %>% filter(!is.na(directive)) %>% count(directive, sort = TRUE) %>% mutate(is_valid = directive %in% valid_csp_directives) %>% filter(!is_valid) %>% select(-is_valid) %>% print(n=65) ## # A tibble: 65 x 2 ## directive n ## ## 1 referrer 69 ## 2 reflected-xss 58 ## 3 t 50 ## 4 self 34 ## 5 tscript-src 22 ## 6 tframe-src 17 ## 7 timg-src 16 ## 8 content-security-policy 14 ## 9 default 10 ## 10 policy-definition 9 ## 11 allow 8 ## 12 defalt-src 8 ## 13 tstyle-src 8 ## 14 https 7 ## 15 nosniff 7 ## 16 null 7 ## 17 tfont-src 6 ## 18 unsafe-inline 6 ## 19 http 5 ## 20 allow-same-origin 4 ## 21 content 4 ## 22 defaukt-src 4 ## 23 default-scr 4 ## 24 defaut-src 4 ## 25 data 3 ## 26 frame-ancestor 3 ## 27 policy 3 ## 28 self-ancestors 3 ## 29 defualt-src 2 ## 30 jivosite 2 ## 31 mg-src 2 ## 32 none 2 ## 33 object 2 ## 34 options 2 ## 35 policy-uri 2 ## 36 referrer-policy 2 ## 37 tconnect-src 2 ## 38 tframe-ancestors 2 ## 39 xhr-src 2 ## 40 - 1 ## 41 -report-only 1 ## 42 ahliunited 1 ## 43 anyimage 1 ## 44 blok-all-mixed-content 1 ## 45 content-security-policy-report-only 1 ## 46 default-sec 1 ## 47 disown-opener 1 ## 48 dstyle-src 1 ## 49 frame-ancenstors 1 ## 50 frame-ancestorss 1 ## 51 frame-ancestros 1 ## 52 mode 1 ## 53 navigate-to 1 ## 54 origin 1 ## 55 plugin-type 1 ## 56 same-origin 1 ## 57 strict-dynamic 1 ## 58 tform-action 1 ## 59 tni 1 ## 60 upgrade-insecure-request 1 ## 61 v 1 ## 62 value 1 ## 63 worker-sc 1 ## 64 x-frame-options 1 ## 65 yandex 1

    Spelling errors are more common than one might have hoped (another reason CSPs are hard since humans tend to have to make them).

    We can also see the directives that are most used:

    top1m_csp %>% filter(!is.na(directive)) %>% count(directive, sort = TRUE) %>% mutate(is_valid = directive %in% valid_csp_directives) %>% filter(is_valid) %>% select(-is_valid) %>% print(n=25) ## # A tibble: 25 x 2 ## directive n ## ## 1 script-src 57636 ## 2 frame-ancestors 47685 ## 3 upgrade-insecure-requests 47515 ## 4 block-all-mixed-content 38038 ## 5 report-uri 37483 ## 6 default-src 36369 ## 7 img-src 35093 ## 8 style-src 23005 ## 9 connect-src 22415 ## 10 frame-src 15827 ## 11 font-src 12878 ## 12 media-src 6718 ## 13 child-src 5043 ## 14 object-src 4235 ## 15 worker-src 3076 ## 16 form-action 1522 ## 17 base-uri 861 ## 18 manifest-src 374 ## 19 sandbox 79 ## 20 script-src-elem 54 ## 21 plugin-types 47 ## 22 report-to 32 ## 23 prefetch-src 22 ## 24 require-sri-for 14 ## 25 style-src-elem 6

    Curious readers should load up urltools and see what domains those script-src directives are allowing. Remember, every value entry that points to an third-party site means it’s OK for that site to mess with your browser or spy on you.

    And, take a look at where folks are doing their CSP error reporting:

    filter(top1m_csp, directive %in% c("report-uri", "report-to")) %>% mutate(report_host = case_when( !grepl("^http[s]*://", value) ~ origin, TRUE ~ value )) %>% select(report_host) %>% mutate(report_host = sub('"$', "", report_host)) %>% pull(report_host) %>% urltools::url_parse() %>% as_tibble() %>% count(domain, sort=TRUE) %>% print(n=20) ## # A tibble: 18,565 x 2 ## domain n ## ## 1 csp.medium.com 113 ## 2 sentry.io 43 ## 3 de.forumhome.com 40 ## 4 report-uri.highwebmedia.com 33 ## 5 csp.yandex.net 23 ## 6 99designs.report-uri.io 20 ## 7 endpoint1.collection.us2.sumologic.com 20 ## 8 capture.condenastdigital.com 17 ## 9 secure.booked.net 14 ## 10 bimbobakeriesusa.com 10 ## 11 oroweat.com 10 ## 12 ponyfoo.com 8 ## 13 monzo.report-uri.com 6 ## 14 19jrymqg65.execute-api.us-east-1.amazonaws.com 5 ## 15 app.getsentry.com 5 ## 16 csp-report.airfrance.fr 5 ## 17 csp.nic.cz 5 ## 18 myklad.org 5 ## 19 reports-api.sqreen.io 5 ## 20 tracker.report-uri.com 5 ## # … with 1.854e+04 more rows

    The list (18K!) is far more diverse than I had anticipated.

    You can do tons more (especially since the wrangling has been done for you :-).

    FIN

    A future post will show how to process CSP error reports in R and there will likely be more security/safety-themed posts as well.

    The cspy CINC repo has Windows and macOS binary versions and you can get the source in all the usual places, so kick the tyres, file issues/PRs and start exploring the content security policies of the websites you frequent (and blog about your results!).

    One other fun exercise is to grab the R Weekly or R-bloggers blog rolls and see how many of those blogs have CSPs.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Exploring swings in Australian federal elections by @ellis2013nz

    Sun, 03/10/2019 - 14:00

    (This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

    Swings are far from uniform

    Last week I introduced a Bayesian state space model of two-party-preferred voting intention for Australian federal elections. It treats the surveys from various polling firms as imperfect (potentially systematically imperfect) measurements of an unobserved latent variable of “true” voting intention, which manifests itself only every few years in the form of election results. The model estimates that latent value for every day that measurements exist, and going forward with increasing uncertainty past when there are measurements.

    The most likely day for the next federal election seems to be 18 May 2019. In theory the House of Representatives election could be as late as November, but as half the Senate needs to go to the polls by end of May and most people prefer to have fewer elections not more the government is highly likely to opt for a simultaneous election of the House and Senate in May.

    To turn my model into an actual election forecast I need a way of converting the distribution of probable two-party-preferred results on a given day into seats in the House of Representatives and hence probabilities for who will form a new government. In doing this we come up against the issue of the moderately high degree of spatial variation in voting behaviour in Australia (as elsewhere), in combination with the spatially-based conversion of vote into seats (ie regional divisions).

    A key tool used in election analysis is to think in terms of the “swing” in vote in each division of the electorate, comparing vote in the last election to that in the next. This is just a practical application of a standard method in longitudinal statistical analysis. Analysis based on differential “swings” is basically the same idea as “difference in differences” analysis in econometrics and other social sciences.

    It’s common as we come towards an election for pundits to project how a given swing would impact on seats if the swing were “uniform” ie the same across all divisions. But of course everyone knows swings are far from uniform. Here we can look at all the swings since the 1993 election (Australian Labor Party (ALP) victory – Paul Keating’s “the sweetest victory of all” moment) to 2016 (Liberal-National Coalition under Malcolm Turnbull hanging on to government by the skin of their teeth with help from the cross bench):

    It’s interesting to see in this way how John Howard lost nearly all of his substantial 1996 gains in the 1998 election, the biggest average anti-government swing in this data to not result in a change of government. Then regained a lot in the 2001 election (those of us who lived through it will need little reminder that this was the election of the September 11 attacks in the USA, and the ‘Children Overboard’ and Tampa incidents). Whether one thinks it good or bad, we can agree that if Beazley had snuck in in 1998 and Howard had been a one-term Prime Minister, Australian history would have gone down quite a different path.

    This highlights one of the obvious issues with swings – big swings in one direction are predictive of a swing back to the centre. But the main aim of the chart above is to show how the individual divisions’ swings vary more than the average (of course).

    After controlling for the average swing, the standard deviation of swings is still 3.2 percentage points. This adds a big chunk of uncertainty to how an average swing will translate into seats. A sitting member could reasonably expect their own swing (for or against) to be within plus or minus six percentage points of the overall swing, but in any given election we’d expect several division-level swings to be even more divergent than that.

    Here’s the R code that produces that chart and that conclusion of 3.2 percentage points. It uses my new ozfedelect R package which is the workhorse I intend to use for re-usable functionality and data for Australian federal election analysis. So far it includes tidied polling data from 2007 to present (updated 11 March 2019), division-level two-party-preferred election results back to 1993, and “simple features” versions of electoral boundaries back to 2010 (simplified to 10% of their original size). Thanks to Wikipedia (and the polling firms of course), the Australian Electoral Commission, and the Australian Bureau of Statistics respectively for making the original data available.

    Post continues after code extract

    # Install the ozfedelect R package devtools::install_github("ellisp/ozfedelect/pkg") library(ozfedelect) # with data library(tidyverse) library(scales) library(grid) library(Cairo) library(sf) # Some checks. # -7.5 for Aston in 2013 meant a swing against the Labor govt # - 10.1 for Bass in 2016 meant a swing against the Lib/Nat govt results_2pp_div %>% select(division_nm, swing_to_govt, election_year) %>% spread(election_year, swing_to_govt) #---------------------------explore distribution of swings--------- # Let's try to understand those historical swings d <- results_2pp_div %>% group_by(election_year, incumbent) %>% mutate(avg_swing = sum(swing_to_govt * total_votes) / sum(total_votes)) %>% ungroup() %>% filter(abs(swing_to_govt) < 20) %>% mutate(year = fct_reorder(as.ordered(election_year), avg_swing)) # We are interested for future forecasting models in the variance of division-level swings # that are not explained by the nation-wide swing model <- lm(swing_to_govt ~ avg_swing, data = d) confint(model) coef(model) # of course the slope is 1 and the intercept is 0, basically - by design # the interesting thing is actually the residual standard error: summary(model) # what's the residual standard deviation of division-level swings? residual_sd <- summary(model)$sigma avgs <- distinct(d, avg_swing, year, incumbent) annotation_col <- "grey50" update_geom_defaults("label", list(family = main_font, fill = "white", colour = annotation_col, alpha = 0.8)) # Main chart: d %>% ggplot(aes(x = avg_swing / 100, y = swing_to_govt / 100)) + geom_smooth(se = FALSE, method = "lm") + geom_vline(xintercept = 0, colour = annotation_col) + geom_hline(yintercept = 0, colour = annotation_col) + geom_point(aes(colour = incumbent)) + geom_text(data = avgs, y = -0.18, aes(label = year, colour = incumbent)) + labs(caption = "Source: Australian Electoral Commission data, analysed by freerangestats.info.") + scale_x_continuous("Overall swing towards incumbent government", label = percent_format(accuracy = 1)) + scale_y_continuous("Division level swings to government", label = percent_format(accuracy = 1)) + annotate("label", x = -0.053, y = 0.06, label = "Strongly change\ngovernment") + annotate("label", x = -0.045, y = -0.065, label = "Narrow escape for\nHoward in 1998") + annotate("label", x = 0.0055, y = 0.04, label = "Strongly retain\ngovernment") + scale_colour_manual("Incumbent government:", values = c("ALP" = "#e53440", "Lib/Nat" = "#1c4f9c")) + ggtitle("Individual seats face more voting uncertainty than Australia as a whole", "Each point represents the swing in a single seat; residual standard deviation of about 3.2 percentage points around the nation-wide swing.") Swings vary spatially as well as over time

    Election analysts love our choropleth maps, but they pose a challenge because of the uneven distribution of people over land. This is a particularly acute problem for Australia.

    In trying to draw a meaningful electorate choropleth map for Australia I first tried a cartogram (to distort borders so each electorate is represented as visually equal while retaining some of its basic shape) but this was too big a job for the algorithm in the cartogram R package. Fair enough. Some of those divisions are pretty large, including Durack in northern Western Australia, at 1.6 million square kilometres for its 100,000 electors the second largest (by area) single-seat electoral division in the world.

    So instead of a cartogram, I’ve opted instead for a bunch of call-outs of the five largest cities. Here we can see the two-party-preferred votes in the 2016 election:

    Clicking on the image will open a scalable vector graphics version in a new tab.

    This call-out method seems effective in counteracting the big sea of blue that one gets from looking at the simple map of Australia, with Liberal-National Coalition support dominant in rural areas outside the Northern Territory.

    Note for those more used to American politics. In Australia, the centre-left party (ALP) has red as its colour, and the right use blue, unlike in the USA where since the 1990s the colours have opposite connotations, to the puzzlement of the rest of the world. Unlike in many other countries the Democrats in the USA did not grow out of a nineteenth and twentieth century socialist movement and hence never self-identified with the colour red, and when TV stations got to allocate colours they seemingly did it arbitrarily. Also, while we’re talking about such things, the Liberal Party in Australia is a party of the right or centre-right, and historically referred to economic, individual/family liberalism in contrast to the collective action advocated by the labor movement and their party. End of digression.

    So, here’s how this mapping approach looks with today’s issue of interest, division-level variation in swings.

    In 2010, we see the ALP under Julia Gillard taking a thumping and retaining government only as a minority in the House, with support from three independents and an Australian Green. Counter-swings to the ALP in the Melbourne and Adelaide areas (the two call-outs at the bottom of the map) saved the ALP from an outright loss:

    Then we see ALP under Kevin Rudd comprehensively losing their remaining wafer-thin hold of government nearly across the board in 2013. The average swing here was less than that against Howard in 1998, but the government had no buffer at all so the result was a clear win for the Liberal-National Coalition under Tony Abbott:

    Then in 2016 we saw most areas swing back, with Malcolm Turnbull’s narrow retain on government already mentioned:

    The ozfedelect package includes a function that does all the grunt work of these maps, so the end-user code that produces them is extremely simple:

    Post continues after code extract

    ozpol_infographic(2016, variable = "swing_to_govt") ozpol_infographic(2013, variable = "swing_to_govt") ozpol_infographic(2010, variable = "swing_to_govt") ozpol_infographic(2016) ozpol_infographic(2013) ozpol_infographic(2010) Swing and vote each have a relationship with division-level census variables

    A common thing to do with either swing or vote variables is to compare them to division-level indicators of social, economic and demographic variables, typically available from the Census. Hugh Parsonage’s invaluable Census2016.DataPack contains tidied versions of the ABS’ Census Data Packs, and I’ve further summarised a small subset of the variables available into a one-row-per-division collation in ozfedelect. Here’s what we see when we compare those variables to the most recent swing in 2016:

    The clearest thing to see here is that divisions with a high proportion of young persons (aged 19 and under) tended to swing away from the Liberal-National Coalition. If we put these variables, and the state territory (as a factor) into a simple model, we can see that there is a distinct state-based flavour to the swings, that divisions with a high number of people born in Australia tended to swing towards the government, and that those with many young people (which might be a proxy for either or both of students or young families) swung against it.

    Don’t take these results too seriously; to understand the trends properly here we really need individual data from the Australian Election Study which we could put in a multi-level model like this one by David Charnock way back in the 1990s.

    In addition to addressing the individual/regional problem or ecological fallacy, ideally, I would be comparing “swing” (which is effectively the first-difference in the “vote” time series) not against absolute levels of such variables but changes in them – so we were looking at the first-difference for both variables, and could see if a change in number of young people was leading to a change in vote. But this would need more data tidying than I’ve done yet.

    For the record, here is how these same seven variables compare to the more natural comparator, the simple two-party-preferred vote:

    As might be expected from a casual observation of Australian politics where the class origins of the main parties remain stubbornly persistent, areas with these characteristics are more likely to vote for Liberal-National Coalition:

    • Higher proportion of adults have year 10 or more education
    • Higher proportion of people were born in Australia
    • Higher proportion of people speak only English at home
    • Higher average personal income

    Again, don’t mistake these area-based findings for individual-level conclusions. That may come later if I have time (this is just a hobby project after all; but I note the AES data is freely downloadable and will be hard to keep my hands off!).

    Code for all these comparisons to Census data. Again, the hard yards have been done by Parsonage’s Census2016.DataPack and my ozfedelect.

    Post continues after code extract

    #----------------------- explore relationship to census variables------------ # a tiny bit of munging data together, into wide form for modelling: d1 <- results_2pp_div %>% filter(election_year == 2016) %>% left_join(div_census_2016, by = "division_nm") %>% select(division_nm, state_ab, lib_nat_percentage, swing_to_govt, young_persons:only_english_spoken_home) %>% mutate(state_ab = fct_relevel(state_ab, "NSW")) # and narrow form for plotting: d2 <- d1 %>% gather(variable, value, -division_nm, -lib_nat_percentage, -swing_to_govt, -state_ab) #------------charts-------------- # chart of vote: d2 %>% ggplot(aes(x = value, y = lib_nat_percentage / 100)) + facet_wrap(~variable, scale = "free_x") + scale_y_continuous("Two-party-preferred vote for Liberal/National Coalition") + geom_smooth(method = "gam") + geom_point(aes(colour = state_ab)) + ggtitle("Vote compared to census variables by electoral division", "2016 federal election") + labs(colour = "",x = "") # chart of swing: d2 %>% ggplot(aes(x = value, y = swing_to_govt / 100)) + facet_wrap(~variable, scale = "free_x") + scale_y_continuous("Two-party-preferred swing towards Liberal/National Coalition\n(incumbent government)") + geom_smooth(method = "gam") + geom_point(aes(colour = state_ab)) + ggtitle("Swing compared to census variables by electoral division", "2016 federal election") + labs(colour = "",x = "") #----------Modelling ----------- d3 <- d1 %>% select( -division_nm, - lib_nat_percentage) %>% mutate_at(vars(young_persons:only_english_spoken_home), scale) mod1 <- lm(I(swing_to_govt / 100) ~ ., data = d3) # chart of results: confint(mod1)[-1, ] %>% as.data.frame() %>% mutate(var = rownames(.), var = gsub("state_ab", "", var), var = gsub("_", " ", var)) %>% rename(lower = `2.5 %`, upper = `97.5 %`) %>% mutate(mid = (lower + upper) / 2, var = fct_reorder(var, mid)) %>% ggplot(aes(x = lower, xend = upper, y = var, yend = var)) + geom_vline(xintercept = 0, colour = "steelblue") + geom_segment(size = 3, colour = "grey") + scale_x_continuous("Impact of change in one standard deviation in census variable on swing", label = percent) + labs(y = "", caption = "Source: ABS Census data, AES election results, analysis by http://freerangestats.info") + ggtitle("Division-level variables related to a swing to the Liberal-National Coalition", "Comparing the 2016 results to 2013 by electoral division (or 'seat'). Conclusions about individual characteristics relating to vote should be drawn only with great caution.") And the answer is…

    Finally, enough exploring, and on to the answer. What’s the probability of an ALP government after the election, if it is in late May? The latest version of the model, with one more poll than my last blog post, and now extending only to late May not to November, looks like this:

    The mean predicted two-party preferred vote for ALP at end of May is 52.48 percent and the standard deviation is 2.08 percentage points. We can combine the expected distribution of votes at the time, with the historical distribution of individual divisions’ swings on top of that, to get an estimate of the chance of a change of government.

    So here’s my current best effort at converting modelled two-party-preferred vote into seats:

    This gives the ALP a healthy 83% chance of winning; better than my stab in the dark a week ago on social media of 71%, when I had been thinking in terms of a November election which I now realise is unlikely.

    Beware that this is a very simple model of seat allocation, and one I’m a bit untrusting of. For example, it doesn’t consider redistributions in electoral boundaries since 2016 at all. And I haven’t taken into account at all the six current independents or members of minor parties, which surely has an impact; never mind other seats where high profile independents might have a chance of knocking out a major party member that would not happen under straight two-party-preferred dominance.

    So treat this carefully for now, this is just a first step towards a proper model. Over the next few weeks I hope to robustify it and productionise into something that can easily update itself as we move towards election day.

    Final snippet of R code that did this simulation, based on a swing from the 2016 division results.

    Post continues after code extract

    #==============crude simulation=============== # so plausible to # a) model the overall swing to the govt # b) simulate for individual seats a randomness of N(0, 3.2) on top of that overall swing # c) add those swings to the 2016 results # Important - this misses out on two things that are needed: # a) 2 new electorates this time around # b) seats held by other parties last_result <- results_2pp_div %>% filter(election_year == 2016) %>% summarise(alp_2pp_2016 = sum(alp_votes) / sum(total_votes)) %>% pull(alp_2pp_2016) # last election the ALP got 49.6% of the 2pp. Currently on track to about 52.48 with sd of about 2.08. # So an average swing against the govt of N(2.88, 2.08), and division-level randomness on top of that. baseline <- results_2pp_div %>% filter(election_year == 2016) %>% select(division_nm, alp_percentage) %>% mutate(link = 1) nsims <- 1e5 set.seed(321) sims <- tibble(sim = 1:nsims, link = 1) %>% mutate(avg_swing = rnorm(n(), -2.88, 2.08)) %>% full_join(baseline, by = "link") %>% select(-link) %>% mutate(extra_swing = rnorm(n(), 0, residual_sd)) %>% group_by(sim) %>% # we scale the extra_swing to be mean zero so we aren't accidentally changing the average swing: mutate(extra_swing = extra_swing - mean(extra_swing), total_swing = avg_swing + extra_swing, alp_percentage_2019 = alp_percentage - total_swing) %>% ungroup() sims_by_div <- sims %>% group_by(sim) %>% summarise(avg_swing = unique(avg_swing), number_seats_alp = sum(alp_percentage_2019 > 50), prop_seats_alp = mean(alp_percentage_2019 > 50)) %>% ungroup() m <- sims_by_div %>% summarise(m = round(mean(prop_seats_alp > 0.5) * 100, 1)) %>% pull(m) # draw histogram of results sims_by_div %>% ggplot(aes(x = number_seats_alp)) + geom_histogram(alpha = 0.5, binwidth = 1, colour = "grey") + geom_vline(xintercept = 75.5, colour = "steelblue") + scale_x_continuous("Number of House of Representative seats won by ALP") + scale_y_continuous("Number of simulations\n(out of 100,000)", label = comma ) + ggtitle(paste0(m, "% probability of ALP win in the 2019 Federal Election"))

    That’s all for now. Comments are very welcome, including spotting any mistakes; I’ve been working through this at some pace and there’s no doubt the checking and testing is inadequate so I would be surprised if something isn’t wrong somewhere, so I’ll be pleased, not upset, if someone tells me what!

    As usual, here are the contributors to R that made all this possible:

    thankr::shoulders() %>% mutate(maintainer = str_squish(gsub("<.+>", "", maintainer))) %>% knitr::kable() %>% clipr::write_clip() maintainer no_packages packages Hadley Wickham 18 assertthat, dplyr, ellipsis, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, testthat, tidyr, tidyverse, usethis R Core Team 11 base, compiler, datasets, graphics, grDevices, grid, methods, splines, stats, tools, utils Gábor Csárdi 9 callr, cli, crayon, desc, pkgconfig, processx, ps, remotes, sessioninfo Kirill Müller 5 DBI, hms, pillar, rprojroot, tibble Jim Hester 4 devtools, pkgbuild, pkgload, readr Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1 Jim Hester 3 fs, glue, withr Lionel Henry 3 purrr, rlang, tidyselect Yixuan Qiu 3 showtext, showtextdb, sysfonts Yihui Xie 3 highr, knitr, xfun Edzer Pebesma 2 sf, units Jeroen Ooms 2 curl, jsonlite Dirk Eddelbuettel 2 digest, Rcpp R-core 1 nlme Vitalie Spinu 1 lubridate Michel Lang 1 backports Simon Wood 1 mgcv Achim Zeileis 1 colorspace Gabor Csardi 1 prettyunits Simon Urbanek 1 Cairo James Hester 1 xml2 Justin Talbot 1 labeling Roger Bivand 1 classInt Jennifer Bryan 1 readxl Kevin Ushey 1 rstudioapi Max Kuhn 1 generics Stefan Milton Bache 1 magrittr Martin Maechler 1 Matrix Charlotte Wickham 1 munsell Matthew Lincoln 1 clipr Marek Gagolewski 1 stringi Jeremy Stephens 1 yaml Deepayan Sarkar 1 lattice Claus O. Wilke 1 cowplot Rasmus Bååth 1 beepr Jennifer Bryan 1 cellranger Hugh Parsonage 1 Census2016.DataPack Alex Hayes 1 broom Peter Ellis 1 ozfedelect David Meyer 1 e1071 Brian Ripley 1 class Simon Urbanek 1 audio Jim Hester 1 memoise var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    (2/2) Book promotion (paperback edition) – “Processing and Analyzing Financial Data with R”

    Sun, 03/10/2019 - 01:00

    (This article was first published on R on msperlin, and kindly contributed to R-bloggers)

    I received many messages regarding my book promotion (see previous post ). I’ll use this post to answer the most frequent questions:

     

    Does the paperback edition have a discount?

    No. The price drop is only valid for the ebook edition but not by choice. Unfortunately, Amazon does not let me do countdown promotions for the paperback edition.

    So, in favor of those, like myself, that like the smell of a fresh book page, I manually dropped the price of the paperback to 17.99 USD (it was 24.99 USD). Printings costs are heavy, which is why I can’t go all the way to a 50% discount. The system tells me that the new price should be live within the next 72 hours. I’ll keep the new price until the end of this week.

     

    When will the second edition be released?

    My schedule is to start to work on the new edition of Processing and Analyzing.. on june 2019. Hopefully I can publish it before january 2020. Sorry but I can’t give much detail about the new content yet. But be sure I’ll keep you guys posted.

     

    Best.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on msperlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Fantasy Tips From the Fantasy King

    Sun, 03/10/2019 - 01:00

    (This article was first published on Analysis of AFL, and kindly contributed to R-bloggers)

    Selby Lee Steere has won AFL fantasy two years in a row. He is the coach of Moreiras Magic and through selling his fantasy guide donates a lot of money to the Starlight foundation. I was lucky enough to do an interview with him talking about his strategies in winning fantasy.

    AFL fantasy scores as opposed to supercoach scores are able to be fully derived from the box-score statistics. Why this is good it means if we wanted to, and I guess we do for the purposes of this post lets do a few things

    1. Lets get the fantasy scores for all players as far back as we can go
    2. Now that we have fantasy scores going back years and years, lets compare players. In the podcast, it was pointed out that maybe Brayshaw will see an increase in useage because Neale has left, are their stats similar in their first years? Can they be similar in their second.
    3. Maybe like Kane Cornes you believe that Powell-Pepper can be the next Dusty Martin, so lets compare the pair.
    4. Another one of Selby fantastic tips was to think about points per minute, we can actually calculate this too!
    Step one Get the data! library(tidyverse) ## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0 ## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1 ## ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.4.0 ## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() dataframe<-fitzRoy::get_afltables_stats(start_date = "1897-01-01",end_date = "2018-10-10") ## Returning data from 1897-01-01 to 2018-10-10 ## Downloading data ## ## Finished downloading data. Processing XMLs ## Warning: Detecting old grouped_df format, replacing `vars` attribute by ## `groups` ## Finished getting afltables data Step two data checks

    One of the things about AFL data is that statistics are being collected from different points in time, no one was recording the meters gained back in the 1960s.

    A quick way to check when a statistic is first being collected would be to do a search or hopefully there is a data dictionary.

    But in this case, lets use R and the tidyverse to do it.

    library(tidyverse) dataframe%>% group_by(Season)%>% summarise(meantime=mean(Time.on.Ground..))%>% filter(meantime>0) ## # A tibble: 16 x 2 ## Season meantime ## ## 1 2003 81.8 ## 2 2004 81.8 ## 3 2005 81.8 ## 4 2006 81.8 ## 5 2007 81.8 ## 6 2008 81.8 ## 7 2009 81.8 ## 8 2010 81.8 ## 9 2011 81.8 ## 10 2012 81.8 ## 11 2013 81.8 ## 12 2014 81.8 ## 13 2015 81.8 ## 14 2016 81.8 ## 15 2017 81.8 ## 16 2018 81.8

    So what we can see from this, is that time on ground was first being collected in 2003. So if we were to work out points per minute we could only start in 2003.

    Come up with a column of fantasy scores

    We know that AFL fantasy scores are worked out as follows:

    df<-data.frame(stringsAsFactors=FALSE, Match.Stat = c("Kick", "Handball", "Mark", "Tackle", "Free Kick For", "Free Kick Against", "Hitout", "Goal", "Behind"), Fantasy.Points = c("3 Points", "2 Points", "3 Points", "4 Points", "1 Point", "-3 Points", "1 Point", "6 Points", "1 Point") ) df ## Match.Stat Fantasy.Points ## 1 Kick 3 Points ## 2 Handball 2 Points ## 3 Mark 3 Points ## 4 Tackle 4 Points ## 5 Free Kick For 1 Point ## 6 Free Kick Against -3 Points ## 7 Hitout 1 Point ## 8 Goal 6 Points ## 9 Behind 1 Point

    Now all we have to do is create a new column using the above as a formula so we can work out the fantasy scores going backwards, we can use mutate to do this.

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% group_by(Season)%>% summarise(meanfantasy=mean(fantasy_score))%>% ggplot(aes(x=Season, y=meanfantasy))+geom_line()

    Looking at the graph, what we can see here is that roughly in the 60s there was a sudden jump in the statistics being collected that went on to be used to derive the AFL fantasy scores.

    To find out when the statistics are first being collected we might want to do this graphically. What I am thinking what if we just had a heap of lines with each line representing one of the statistics. When it comes to plotting multiple lines on a graph its a good idea to practice the use of gather from the tidyverse

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% group_by(Season)%>% summarise(meanfantasy=mean(fantasy_score), meankicks=mean(Kicks), meanhandballs=mean(Handballs), meanmarks=mean(Marks), meantackles=mean(Tackles), meanfreesfor=mean(Frees.For), meansfreesagainst=mean(Frees.Against), meanhitouts=mean(Hit.Outs), meangoals=mean(Goals), meanbehinds=mean(Behinds))%>% gather("variable", "value",-Season) %>% ggplot(aes(x=Season, y=value, group=variable, colour=variable))+geom_line()

    Looking at this graph, it would seem as though if we used 2000 as a cut off, that would be fine as all our data is being collected then. What we don’t want there to be an example of, is a statistic like meters gained that was only started to be tracked in 2007.

    Next what we want to do now that we have the fantasy scores and a rough timeline of when it makes sense that all our data is being collected lets see the average fantasy scores by player and year.

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% group_by(Season, First.name, Surname, ID)%>% summarise(meanFS=mean(fantasy_score))%>% filter(Season>2000) ## # A tibble: 10,933 x 5 ## # Groups: Season, First.name, Surname [10,879] ## Season First.name Surname ID meanFS ## ## 1 2001 Aaron Fiora 911 34.6 ## 2 2001 Aaron Hamill 171 71.4 ## 3 2001 Aaron Henneman 362 32.9 ## 4 2001 Aaron Lord 574 48.7 ## 5 2001 Aaron Shattock 1093 31.3 ## 6 2001 Adam Contessa 457 49.9 ## 7 2001 Adam Goodes 1012 71.5 ## 8 2001 Adam Houlihan 593 54.7 ## 9 2001 Adam Hunter 1072 23.3 ## 10 2001 Adam Kingsley 825 66.9 ## # … with 10,923 more rows

    But! We know that with AFL fantasy we don’t want to include finals that’s because finals aren’t included in the competition so lets exclude finals.

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% group_by(Season, First.name, Surname, ID)%>% summarise(meanFS=mean(fantasy_score))%>% filter(Season>2000)%>%arrange(desc(meanFS)) ## # A tibble: 10,926 x 5 ## # Groups: Season, First.name, Surname [10,872] ## Season First.name Surname ID meanFS ## ## 1 2014 Tom Rockliff 11787 135. ## 2 2012 Dane Swan 1460 134. ## 3 2018 Tom Mitchell 12196 129. ## 4 2017 Tom Mitchell 12196 127. ## 5 2012 Gary Ablett 1105 125. ## 6 2010 Dane Swan 1460 123. ## 7 2018 Jack Macrae 12166 123. ## 8 2011 Dane Swan 1460 121. ## 9 2017 Patrick Dangerfield 11700 121. ## 10 2018 Brodie Grundy 12217 120 ## # … with 10,916 more rows

    So what did we originally want to do, we wanted to see how Neale did in his second year to get a sense of his improvement and is that comparable to Brayshaw?

    To do this we want to find the player IDS, this makes it easier to filter out the players, secondly we want to add the count of games played in season. I think the second point is important because we don’t want to be misled by a small number of games played.

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% group_by(Season, First.name, Surname, ID)%>% summarise(meanFS=mean(fantasy_score), count=n())%>% filter(Season>2000)%>%arrange(desc(Season))%>% filter(ID %in%c("12055","12589")) ## # A tibble: 8 x 6 ## # Groups: Season, First.name, Surname [8] ## Season First.name Surname ID meanFS count ## ## 1 2018 Andrew Brayshaw 12589 66.8 17 ## 2 2018 Lachie Neale 12055 100. 22 ## 3 2017 Lachie Neale 12055 100. 21 ## 4 2016 Lachie Neale 12055 111. 22 ## 5 2015 Lachie Neale 12055 102. 22 ## 6 2014 Lachie Neale 12055 82.3 21 ## 7 2013 Lachie Neale 12055 77.1 9 ## 8 2012 Lachie Neale 12055 42.1 11

    So looking at the Neale vs Brayshaw comparision, Neale actually averaged less fantasy points in his first AFL season, but he picked it up to 77.1 in his second year and a slight imporvement to 82.3 in his third.

    But as Selby pointed out, he’s looking at points per minute as his metric. So lets add that in.

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% group_by(Season, First.name, Surname, ID)%>% summarise(meanFS=mean(fantasy_score), meantimeonground=mean(Time.on.Ground..), count=n())%>% filter(Season>2000)%>%arrange(desc(Season))%>% filter(ID %in%c("12055","12589")) ## # A tibble: 8 x 7 ## # Groups: Season, First.name, Surname [8] ## Season First.name Surname ID meanFS meantimeonground count ## ## 1 2018 Andrew Brayshaw 12589 66.8 66.7 17 ## 2 2018 Lachie Neale 12055 100. 80.2 22 ## 3 2017 Lachie Neale 12055 100. 77.6 21 ## 4 2016 Lachie Neale 12055 111. 81.6 22 ## 5 2015 Lachie Neale 12055 102. 77.0 22 ## 6 2014 Lachie Neale 12055 82.3 68.8 21 ## 7 2013 Lachie Neale 12055 77.1 73.9 9 ## 8 2012 Lachie Neale 12055 42.1 59.2 11

    What we can see here, is that Brayshaw average time on ground was 66.7, Neale when he was getting just a little bit more game time (68.8 and 73.9) was getting an extra 10+ fantasy points on average. But what we can see is that his mean fantasy score and his mean time on ground is virtually one to one.

    Lets see Brayshaw on a scatter plot for his debut season

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% filter(ID =="12589")%>% ggplot(aes(x=Time.on.Ground..,y=fantasy_score))+geom_point()+geom_abline(intercept = 0)+ylim(0,100)+xlim(0,100)

    Lets now look at Sam Powell Pepper vs Dustin Martin comparisons.

    First lets find their player Ids

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% filter(Surname =="Powell-Pepper")%>%select(ID) ## # A tibble: 37 x 1 ## ID ## ## 1 12494 ## 2 12494 ## 3 12494 ## 4 12494 ## 5 12494 ## 6 12494 ## 7 12494 ## 8 12494 ## 9 12494 ## 10 12494 ## # … with 27 more rows dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% filter(Surname =="Martin", First.name=="Dustin")%>%select(ID) ## # A tibble: 193 x 1 ## ID ## ## 1 11794 ## 2 11794 ## 3 11794 ## 4 11794 ## 5 11794 ## 6 11794 ## 7 11794 ## 8 11794 ## 9 11794 ## 10 11794 ## # … with 183 more rows

    Now lets do the same comparisions as we were doing for Brayshaw and Neale

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% group_by(Season, First.name, Surname, ID)%>% summarise(meanFS=mean(fantasy_score), meantimeonground=mean(Time.on.Ground..), count=n())%>% filter(Season>2000)%>%arrange(desc(Season))%>% filter(ID %in%c("11794","12494")) ## # A tibble: 11 x 7 ## # Groups: Season, First.name, Surname [11] ## Season First.name Surname ID meanFS meantimeonground count ## ## 1 2018 Dustin Martin 11794 92.4 82.9 21 ## 2 2018 Sam Powell-Pepper 12494 75 74.6 16 ## 3 2017 Dustin Martin 11794 114. 85 22 ## 4 2017 Sam Powell-Pepper 12494 69.9 68 21 ## 5 2016 Dustin Martin 11794 107. 83.1 22 ## 6 2015 Dustin Martin 11794 104. 79.8 22 ## 7 2014 Dustin Martin 11794 97.6 82.9 21 ## 8 2013 Dustin Martin 11794 97.0 81.8 22 ## 9 2012 Dustin Martin 11794 84.8 80.4 20 ## 10 2011 Dustin Martin 11794 89.5 84.4 22 ## 11 2010 Dustin Martin 11794 71.5 78.2 21

    Lets now look at a similar scatter plot.

    dataframe%>% mutate(fantasy_score= 3*Kicks +2*Handballs+ 3*Marks +4*Tackles+Frees.For - 3*Frees.Against+Hit.Outs+6*Goals+Behinds)%>% filter(!(Round %in% c("EF", "PF", "SF","GF","QF")))%>% group_by(Season, First.name, Surname, ID)%>% summarise(meanFS=mean(fantasy_score), meantimeonground=mean(Time.on.Ground..), count=n())%>% filter(Season>2000)%>%arrange(desc(Season))%>% filter(ID %in%c("11794","12494"))%>% ggplot(aes(x=meantimeonground, y=meanFS))+geom_point(aes(colour=as.factor(ID)))+geom_abline(intercept = 0)+ylim(0,140)+xlim(0,140)

    So hopefully now what you have is a rough framework in place so you can easily filter out players you think are similar, track their careers and use 2 time afl fantasy winners tips by looking at time on ground as well!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Wrapper of knitr::include_graphics to Handle URLs & PDF Outputs

    Sat, 03/09/2019 - 17:00

    (This article was first published on Yongfu's Blog, and kindly contributed to R-bloggers)

    Those who use knitr::include_graphics() frequently in their R Markdown files may discover some inconsistencies (from the user point of view) if the same Rmd is used for multiple output formats, especially when PDF (LaTeX) is involved.

    The following code works fine for HTML outputs but fails when the outputs are PDFs:

    1. knitr::include_graphics('local.gif')
    2. knitr::include_graphics('https://commonmark.org/images/markdown-mark.png')

    The first case is obvious since it’s impossible to include a GIF in a PDF document. The second case may cause some users to scratch their heads: “Why can’t I insert a PNG image in the PDF document?”. The answer is that, for PDFs, knitr::include_graphics() only works for local images because knitr::include_graphics won’t download the images for you, whereas for HTMLs, the images don’t have to be downloaded (images can be sourced using HTML img tags).

    To fix the problem that arise in the first case, one can use images that are compatible for both output formats, such as PNG or JPEG. Another solution is to include different figures for different output formats:

    if (knitr::is_html_output()) { knitr::include_graphics('local.gif') } else { knitr::include_graphics('local.png') }

    To fix the problem that arise in the second case, one has to remember not to pass an URL as path to knitr::include_graphics() if the Rmd is to be compiled to PDFs.

    include_graphics2()

    I can never be sure when I would want to make my Rmd, originally targeted for HTML outputs, available in PDFs. Also, I don’t want to write if (knitr::is_html_output()) every time I want to include a GIF. It is also a headache to download every figure from the web just for the PDF output.
    So I decided to write a wrapper of knitr::include_graphics() for myself, dealing with all the problems mentioned above.

    • For including different figures for HTML and PDF outputs, use:

      include_graphics2('local.gif', 'local.png')

      The first argument local.gif is used when the output format is HTML, and the second, local.png, is used when it’s PDF.

    • include_graphics2() also works fine with URLs. It is totally fine to use URLs instead of local file paths in the example above:

      png_url <- 'https://commonmark.org/images/markdown-mark.png' gif_url <- 'https://media.giphy.com/media/k3dcUPvxuNpK/giphy.gif' include_graphics2(png_url, gif_url)
    Source Code

    I made include_graphics2() available through a package I maintain (lingusiticsdown). You can read the
    documentation of include_graphics2() and a
    vignette of its usage on the package web page.

    For those who want to use include_graphics2() but don’t want to depend on linguisticsdown, feel free to copy the source of include_graphics2() to your personal package. If you don’t have one, you can still use the source script as regular R scripts, but remember to add the following lines to the top of the script:

    library(knitr) library(tools) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Yongfu's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppArmadillo 0.9.200.7.1

    Sat, 03/09/2019 - 03:48

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A minor RcppArmadillo bugfix release arrived on CRAN today. This version 0.9.200.7.1 has two local changes. R 3.6.0 will bring a change in sample() (to correct a subtle bug for large samples) meaning many tests will fail, so in one unit test file we reset the generator to the old behaviour to ensure we match the (old) test expectation. We also backported a prompt upstream fix for an issue with drawing Wishart-distributed random numbers via Armadillo which was uncovered this week. I also just uploaded the Debian version.

    Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 559 other packages on CRAN.

    Changed are listed below:

    Changes in RcppArmadillo version 0.9.200.7.1 (2019-03-08)
    • Explicit setting of RNGversion("3.5.0") in one unit test to accomodate the change in sample() in R 3.6.0

    • Back-ported a fix to the Wishart RNG from upstream (Dirk in #248 fixing #247)

    Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Book promotion – “Processing and Analyzing Financial Data with R”

    Sat, 03/09/2019 - 01:00

    (This article was first published on R on msperlin, and kindly contributed to R-bloggers)

    Cover

    I recently did a book promotion for my R book in portuguese and it was a big sucess!

    My english book is now being sold with the same promotion. You can purchase it with a 50% discount if you buy it on the 10th day of march. See it here. The discount will be valid throughout the week, with daily price increases.

    If you want to learn more about R and its use in Finance and Economics, this book is a great opportunity.

    You can find more details about the book, including datasets, R code and all that good stuff here. An online and public version with the first 7 chapters is available in this link.

    If you liked the work (or not), please provide a honest review at Amazon.com. As an author, I certainly apreciate it and will use the feedback for the next edition of the book.

    Best,

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on msperlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Building tidy tools workshop

    Fri, 03/08/2019 - 01:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    Join RStudio Chief Data Scientist Hadley Wickham for his popular “Building tidy tools” workshop in Sydney, Australia! If you’d missed the sold out course at rstudio::conf 2019 now is your chance.

    Register here: https://www.rstudio.com/workshops/building-tidy-tools/

    You should take this class if you have some experience programming in R and you want to learn how to tackle larger-scale problems. You’ll get the most if you’re already familiar with the basics of functions (i.e., you’ve written a few) and are comfortable with R’s basic data structures (vectors, matrices, arrays, lists, and data frames). There is approximately a 30% overlap in the material with Hadley’s previous “R Masterclass”. However, the material has been substantially reorganised, so if you’ve taken the R Masterclass in the past, you’ll still learn a lot in this class.

    What will I learn?

    This course has three primary goals. You will:

    • Learn efficient workflows for developing high-quality R functions, using the set of codified conventions of good package design. You’ll also learn workflows for unit testing, which helps ensure that your functions do exactly what you think they do.
    • Master the art of writing functions that do one thing well and can be fluently combined together to solve more complex problems. We’ll cover common function-writing pitfalls and how to avoid them.
    • Learn best practices for API design, programming tools, object design in S3, and the tidy eval system for NSE.

    When – WEDNESDAY, MAY 1, 2019, 8:00AM – THURSDAY, MAY 2, 2019, 5:00PM

    Where – The Westin Sydney, 1 Martin Pl, Sydney NSW 2000, Australia

    Who – Hadley Wickham, Chief Scientist at RStudio

    Build your skills and learn from the best at this rare in-person workshop – the only Australia workshop from Hadley in 2019.

    Register here: https://www.rstudio.com/workshops/building-tidy-tools/

    Discounts are available for 5 or more attendees from any organization, and for students.

    Please email training@rstudio.com if you have any questions about the workshop that aren’t answered on the registration page.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    SatRday Paris: Build interactive waffle plots

    Fri, 03/08/2019 - 00:25

    (This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

    Can mapping tools be diverted to other uses? Of course ! See how we play with leaflet and leafgl to quickly render a giant waffle made of millions of polygons.

    Spatial tools are not only for spatial data

    At SatRday in Paris, I presented a talk entitled “Everything but maps with spatial tools”.
    The aim is to show that there exists powerful tools in some fields that can be applied to other fields. You just have to open your mind a little bit. Personally, I like playing with maps and spatial data. In fact, I see rasters everywhere…

    In this blog post, I present a small part of my talk. It seems that whatt has left its mark on people’s minds is the waffle story… So, this is the first one I will share with you !

    Waffles are electronic components

    At ThinkR, we develop and deploy Shiny applications. In this case, it is for a client working with electronic components. I can not say more about this work. This is why I used this analogy with waffles.

    Open your mind I said ! Use your imagination.

    Imagine the world largest waffle. A waffle with millions of cells.

    Brooklyn, Martha Friedman Waffle

    We’re going to add some chocolate on it. However, we are not really good at it and it is not possible to put the same quantity of chocolate in each cell.

    For each cell, we know the coordinates of its center. We know the quantity of chocolate. In this job, we want to be able to:

    • Explore each cell if needed
    • Zoom in/zoom out over this millions of cells
    • Show this waffle in a Shiny application
    • Switch from one waffle to another as smoothly and quickly as possible
    A raster is a multipolygons structure

    A package that recently appeared among spatial tools within R is {leafgl}. This is a Github only R package available here: https://github.com/r-spatial/leafgl. It allows quick rendering of millions of points using leaflet.
    Waffles were the perfect opportunity to test this package !

    A waffle is a regular grid. This may be seen as a raster. However, {leafgl} only works with vector objects. Of course, {leaflet} can print rasters, but this was not quick enough for us. Not always, but sometimes speed matters…
    So again, open your mind and try to see a spatial vector when looking at a raster.

    library(raster) library(rasterVis) library(waffler) library(cowplot) # Build a raster mat <- matrix(c(1, 0, 0, 1), ncol = 2) r <- raster(mat) plot(r)

    # Get centers r_df <- data.frame(coordinates(r), values = values(r)) # Transform as polygons r_waffle <- wafflerize(r_df) g1 <- gplot(r) + geom_raster(aes(fill = value)) + guides(fill = FALSE) + ggtitle("Original raster") g2 <- ggplot(r_waffle) + geom_sf(aes(colour = LETTERS[1:nrow(r_waffle)]), size = 2) + guides(colour = FALSE) + ggtitle("Raster as polygon") plot_grid(plotlist = list(g1, g2))

    Do you see rasters as polygons now ?
    So we’re going to play with a bigger waffle! We can, so why would we deprive ourselves of it?

    Plot an heart-shaped waffle

    A rectangle is boring, let’s build a heart shaped waffle with {sf}. The heart is a polygon from which we can extract a regular set of point using st_sample. This set of point symbolizes the center of each waffle cell. Then we can generate more or less chocolate in each cell.
    The trick is to use spatial tools on non-spatial data. The heart we draw is not to be placed on a real position on Earth. > Indeed, I am currently writing this blog post in the train (#OnTheTrainAgain as says Colin), thus the heart drawn has no fixed position. Even if I am on Earth, my train is not following a heart shaped path. Well, you understand the idea…

    Hence, to be able to play with this polygon, it is better to assign it a projection. Let’s add a Lambert93 projection (EPSG: 2154), because I am in France !

    library(sf) # Construct heart (code from @dmarcelinobr) xhrt <- function(t) 16 * sin(t)^3 yhrt <- function(t) 13 * cos(t) - 5 * cos(2 * t) - 2 * cos(3 * t) - cos(4 * t) # create heart as polygon heart_sf <- tibble(t = seq(0, 2 * pi, by = .1)) %>% mutate(y = yhrt(t), x = xhrt(t)) %>% bind_rows(., head(., 1)) %>% dplyr::select(x, y) %>% as.matrix() %>% list() %>% st_polygon() %>% st_sfc(crs = 2154) g1 <- ggplot(heart_sf) + geom_sf(fill = "#cb181d") + coord_sf(crs = 2154, datum = 2154) + ggtitle("Heart sf polygon") # create grid heart_grid <- st_sample(heart_sf, size = 500, type = "regular") %>% cbind(as.data.frame(st_coordinates(.))) %>% rename(x = X, y = Y) %>% st_sf() %>% mutate(z = cos(2*x) - cos(x) + sin(y), z_text = paste("Info: ", round(z))) g2 <- ggplot(heart_grid) + geom_sf(colour = "#cb181d") + coord_sf(crs = 2154, datum = 2154) + ggtitle("Heart as regular point grid") g3 <- ggplot(heart_grid) + geom_sf(aes(colour = z), size = 2) + scale_colour_distiller(palette = "YlOrBr", type = "seq", direction = 1) + theme(panel.background = element_rect(fill = "#000000")) + coord_sf(crs = 2154, datum = 2154) + guides(colour = FALSE) + ggtitle("Chocolate quantity for each point") cowplot::plot_grid(g1, g2, g3, ncol = 3)

    The projection chosen is however a problem. {leaflet} requires spatial objects with non-projected geographic coordinates (EPSG: 4326). Why didn’t I assign directly coordinates in degrees, then ? Because {leaflet} produces a map in Mercator projection (EPSG: 3857), and I do not want my heart to be deformed by the change in projection. Why didn’t I assign a Mercator projection then ? Because I like maps and I dislike Mercator as it gives a really bad representation of distances and surfaces of our World. But this is another problem… In our case, I will indeed have to create my polygon grid with a Mercator projection, transform it as geographic coordinates, so that my heart is not in a bad mood on the final map.
    To transform a regular set of points into a polygon grid ready for leaflet while integrating deformation, I created package {waffler}, available on Github only (https://github.com/ThinkR-open/waffler). Use devtools::install_github("ThinkR-open/waffler") to try it.
    We’ll use function wafflerize. The fact parameter is here to magnify the size of the heart. If we keep a resolution of 1 for a 500x500 square waffle, this will be too small to appear with distinguishable geographical coordinates.

    # Generate a grid polygon from points heart_polygon <- wafflerize(heart_grid, fact = 1000000) g1 <- ggplot(heart_polygon) + geom_sf(aes(fill = z), colour = "blue", size = 0.5) + scale_fill_distiller(palette = "YlOrBr", type = "seq", direction = 1) + theme(panel.background = element_rect(fill = "#000000")) + coord_sf(crs = 4326, datum = 4326) + guides(fill = FALSE) + ggtitle("Chocolate quantity (geographical coordinates = data)") g2 <- ggplot(heart_polygon) + geom_sf(aes(fill = z), colour = "blue", size = 0.5) + scale_fill_distiller(palette = "YlOrBr", type = "seq", direction = 1) + theme(panel.background = element_rect(fill = "#000000")) + coord_sf(crs = 3857, datum = 3857) + guides(fill = FALSE) + ggtitle("Chocolate quantity (Mercator projection = leaflet)") plot_grid(plotlist = list(g1, g2), ncol = 2)

    Plot a big interactive waffle

    We’re almost there! Remember that our objective was to build an interactive waffle.
    For the demonstration, we build a bigger heart with 50000 points. Let’s plot it with {leafgl}.

    library(leaflet) library(leafgl) # Bigger heart heart_grid <- st_sample(heart_sf, size = 50000, type = "regular") %>% cbind(as.data.frame(st_coordinates(.))) %>% rename(x = X, y = Y) %>% st_sf() %>% mutate(z = cos(2*x) - cos(x) + sin(y), z_text = as.character(round(z, digits = 1))) # Generate a grid polygon from points heart_polygon2 <- wafflerize(heart_grid, fact = 100) # Define colors for `addGlPolygons` cols <- brewer_pal(palette = "YlOrBr", type = "seq")(7) colours <- gradient_n_pal(cols)(rescale(heart_polygon2$z)) colours_rgb <- (t(col2rgb(colours, alpha = FALSE))/255) %>% as.data.frame() # Render as leaflet m <- leaflet() %>% # The popup is currently not working anymore with {leafgl}. # I cheat a little and take the previous version waiting for a patch... leaflet.glify::addGlifyPolygons(data = heart_polygon2, # addGlPolygons( color = colours_rgb, popup = "z_text", opacity = 1) %>% setView(lng = mean(st_bbox(heart_polygon2)[c(1,3)]), lat = mean(st_bbox(heart_polygon2)[c(2,4)]), zoom = 15)

    You can now navigate inside this polygon-raster object. Click on the cells to know the quantity of chocolate…

    Next time you’ll see waffles in the wild, you may think about me…

    The post SatRday Paris: Build interactive waffle plots appeared first on (en) The R Task Force.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Where to find the worst weather in the US

    Thu, 03/07/2019 - 21:20

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Which US city has the worst weather? To answer that question, data analyst Taras Kaduk counted the number of pleasant days in each city and ranked them accordingly. For this analysis, a "pleasant" day is one where the average temperature was in the range 55°F-75°F, the maximum was in the range 60°-90°, the minimum was in the range 40°-70°, and there was no significant rain or snow. (The weather data was provided by the NOAA Global Surface Summary of the Day dataset and downloaded to R with the rnoaa package.)

    With those criteria and focusing just on the metro regions with more than 1 million people, the cities with the fewest pleasant days are:

    1. San Juan / Carolina / Caguas, Puerto Rico (hot year-round)
    2. Rochester, NY (cold in the winter, rain in the summer)
    3. Detroit / Warren / Dearborn, MI (cold in the winter, rain in the summer)

    You can see the top (bottom?) 25 cities in this list in the elegant chart below (also by Tara Kaduk), which shows each city as a polar bar chart and with one ring for each of the 6 years of data analyzed. 

    And if you're wondering which cities have the best weather, here's the corresponding chart for the 25 cities with the most pleasant days. San Diego / Carlsbad (CA) tops that list.

    You can find the R code behind the analysis and charts in this Github repository. (The polar charts above required a surprisingly small amount of code: it's a simple transformation of a regular bar chart with the ggplot2 coord_polar transformation — quite appropriate given the annual cycle of weather data.) And for the full description of the analysis including some other nice graphical representations of the data, check out the blog post linked below.

    Taras Kaduk: Where are the places with the best (and the worst) weather in the United States

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    SatRday Paris 2019 Slides on Futures

    Thu, 03/07/2019 - 21:00

    (This article was first published on JottR on R, and kindly contributed to R-bloggers)

    Below are links to my slides from my talk on Future: Friendly Parallel Processing in R for Everyone that I presented last month at the satRday Paris 2019 conference in Paris, France (February 23, 2019).

    My talk (32 slides; ~40 minutes):

    • Title: Future: Friendly Parallel Processing in R for Everyone
    • HTML (incremental slides; requires online access)
    • PDF (flat slides)

    A big shout out to the organizers, all the volunteers, and everyone else for making it a great satRday.

    Links var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: JottR on R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    gower 0.2.0 is on CRAN

    Thu, 03/07/2019 - 15:27

    (This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

    A new version of R package gower has just been released on CRAN.

    Thanks to our new contributor David Turner who was kind enough to provide a pull request, gower now also computes weighted gower distances.

    From the NEWS file:

    • gower_dist and gower_topn gain weight argument for weighted matching (thanks to David Turner)
    • gower_dist and gower_topn gain ignore_case argument for automatic column matching.
    • gower_dist now returns numeric(0) invisibly when there are no columns to compare.
    • gower_topn now returns a list with empty matrices when there are no columns to compare.
    • gower_topn now warns when n>nrow(y) and sets n=nrow(y)
    • bugfix: comparing factors with characters would cause a crash (thanks to max Kuhn)

    Compute Gower’s distance (or similarity) coefficient between records. Compute the top-n matches between records. Core algorithms are executed in parallel on systems supporting OpenMP.

    Markdown with by wp-gfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Forecasting From a Regression with a Square Root Dependent Variable

    Thu, 03/07/2019 - 03:21

    (This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers)

    Back in 2013 I wrote a post that was titled, “Forecasting From Log-Linear Regressions“. The basis for that post was the well-known result that if you estimate a linear regression model with the (natural) logarithm of y as the dependent variable, but you’re actually interested in forecasting y itself, you don’t just report the exponentials of the original forecasts. You need to add an adjustment that takes account of the connection between a Normal random variable and a log-Normal random variable, and the relationship between their means. Today, I received a query from a blog-reader who asked how the results in that post would change if the dependent variable was the square root of y, but we wanted to forecast the y itself. I’m not sure why this particular transformation was of interest, but let’s take a look at the question. In this case we can exploit the relationship between a (standard) Normal distribution and a Chi-Square distribution in order to answer the question.

    I’m going to “borrow” heavily from my earlier post.

    Suppose that we’re working with a regression model of the form
                  √(yt) = β1 + β2x2t + β3x3t + ……+ βkxkt + εt    ;   t = 1, 2, …., n.
    The explanatory variables are assumed to be non-random, so that rules out models with lagged values of the dependent variable appearing as regressors. Let’s also assume that the errors have a zero mean, and are serially independent,and homoskedastic (with a variance of σ2). Crucially, we’ll also assume that they are Normally distributed.
    Once we estimate the model, we have the “fitted” values for the dependent variable. That is, we have values of [√(yt)]* = b1 + b2x2t + ….. + bkxkt, where bi is the estimated value of βi (i = 1, 2, ……, k). These estimates might be obtained by OLS, but that’s not crucial to the following argument.
    Now, we’re likely to be more interested in fitted values expressed in terms of the original data – that is, in terms of y itself, rather than √(y) .
    You might think that all we have to do is to apply the inverse of the square root transformation, and the fitted values of interest will be yt* = {[√(yt)]*}2. Unfortunately, just squaring the original forecasts will result in an unnecessary distortion.  Let’s see why.

    If εt is Normally distributed, then √(yt) is also Normally distributed, with a mean of (β1 + β2x2t + β3x3t + ……+ βkxkt), and a variance of σ2. Another way of describing √(yt) is that it is σZ + (β1 + β2x2t + β3x3t + ……+ βkxkt), where Z is a Standard Normal variate. Now consider yt itself. We get this by squaring the√(yt) random variable, so we can write:     yt = σ2Z2 + (β1 + β2x2t + β3x3t + ……+ βkxkt)2 + 2σZ(β1 + β2x2t + β3x3t + ……+ βkxkt). Taking expectations,     E[yt] = σ2 E[Z2] + (β1 + β2x2t + β3x3t + ……+ βkxkt)2, because the mean of Z is zero. Recalling that Z2 is a Chi-Square variate with one degree of freedom, and that the mean of a Chi-Square variate is its degrees of freedom, we immediately see that     E[yt] = (β1 + β2x2t + β3x3t + ……+ βkxkt)2 + σ2. So, if we want to obtain forecasts for y, we should square the forecasts of √(y), but then we need to add σ2.  Of course, σ2 is unobservable but we can replace it with its unbiased estimator, s2. The latter is the sum of squared residuals (from the original regression, with √(yt) as the dependent variable), divided by the degrees of freedom, (n – k).  Failure to add this extra term will result in point forecasts that are distorted in a downwards direction. Of course, the magnitude of this distortion will depend on both the scale of measurement for our y data, and the signal-to-noise ratio in our regression model.

    In my earlier post relating to this issue in the context of log-linear regressions, I illustrated the results using both EViews and R. Specifically, I looked at a crude model for the “Airplane Passengers” (AP) time-series, based on the analysis of Cowperwait and Metcalf (2009, pp. 109-118). Here, I’ll just use R to illustrate the results of the current post. The R script is available on this blog’s code page, and it can be opened with any text editor. The square root of AP is regressed against a quadratic time trend and various Sine and Cosine terms of the form SIN(2πit) and COS(2πit); i = 1, 2, …, 5:

    The time series “APF” is the series of naive within-sample predictions, obtained by simply squaring the fitted values for √(AP). The time-series “APFAD” incorporates the adjustment term discussed above. In this particular case, s = 0.5098, so there’s not a huge difference between APF and APFAD:
    etc.
    However, the sum of the squared (within-sample) prediction errors, based on AP and APF is 42318.42, while that based on AP and APFAD is 42310.45. So, there’s a bit of improvement in overall forecast performance when we take the adjustment term into account.
    A final comment is in order. Although we’ve been able to see how to “back out” appropriate forecasts of y when the original regression has a dependent variable that is either log(y) or √(y). We were able to do this only because when the inverses of these particular transformations are applied to a Normal random variable, the resulting new random variable has a known distribution. This little “trick” is not going to work – at least, not easily – for arbitrary non-linear transformations of y.

    Reference

    Cowperwait, P. S. P. and A. V. Metcalf, 2009. Introductory Time Series With R. Springer, Dordrecht.
    © 2019, David E. Giles var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RInside 0.2.15

    Thu, 03/07/2019 - 01:40

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new release 0.2.15 of RInside arrived on CRAN and in Debian today. This marks the first release in almost two years, and it brings some build enhancements. RInside provides a set of convenience classes which facilitate embedding of R inside of C++ applications and programs, using the classes and functions provided by Rcpp.

    RInside is stressing the CRAN system a little in that it triggers a number of NOTE and WARNING messages. Some of these are par for the course as we get close to R internals not all of which are “officially” in the API. My continued thanks to the CRAN team for supporting the package.

    It has (once again!) been nearly two years since the last release, and a number of nice extensions, build robustifications (mostly for Windows) and fixes had been submitted over this period—see below for the three key pull requests. There are no new user-facing changes.

    The most recent change, and the one triggering the change, was based on a rchk report: the one time we call Rf_eval() could conceivably have a memory allocation race so two additional PROTECT calls make it more watertight. The joys of programming with the C API …

    But thanks so much to Tomas for patient help, and to Gábor for maintaining the ubuntu-rchk Docker container. While made for rhub, it is also available pre-made here at Docker Cloud which allowed me to run the rchk.sh script in a local instance.

    Changes since the last release were:

    Changes in RInside version 0.2.15 (2019-03-06)
    • Improved Windows build support by copying getenv("R_HOME") result and improving backslash handling in environemt variable setting (Jonathon Love in #27 and #28)

    • Improved Windows build support by quote-protecting Rscript path in Makevars.win (François-David Collin in #33)

    • A URL was corrected in README.md (Zé Vinícius in #34).

    • Temporary SEXP objects are handled more carefully at initialization to satisfy rchk (Dirk in #36)

    CRANberries also provides a short report with changes from the previous release. More information is on the RInside page. Questions, comments etc should go to the rcpp-devel mailing list off the Rcpp R-Forge page, or to issues tickets at the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Paid in Books: An Interview with Christian Westergaard

    Thu, 03/07/2019 - 01:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    R is greatly benefiting from new users coming from disciplines that traditionally did not provoke much serious computation. Journalists1 and humanist scholars2, for example, are embracing R. But, does the avenue from the Humanities go both ways? In a recent conversation with Christian Westergaard, proprietor of Sophia Rare Books in Copenhagen, I was delighted to learn that it does.

    JBR: I was very pleased to learn when I spoke with you recently at the California Antiquarian Book Fair that you were an S and S+ user in graduate school. What were you studying and how how was S and S+ helpful?

    CW: I did a Master’s in mathematics and a Bachelor’s in statistics at the University of Copenhagen in Denmark, graduating in 2005. During the first year of my courses in statistics, we were quickly introduced to S+ in order to do monthly assignments. In these assignments, we were to apply the theory we had learned in the lectures on some concrete data. I still remember how difficult I initially found applying the right statistical tools to real-world problems, rather than just understanding the math in theoretical statistics. I developed a deep respect for applied statistics. Our minds can easily be deceived and we need proper statistics to make the right decisions.

    JBR: How did you move from technical studies to dealing in rare scientific books and manuscripts? On the surface it seems that these might be two completely unrelated activities. How did you find a path between these two worlds?

    CW: It was a gradual shift. When I first began studying, I went into an old antiquarian book shop to acquire some second-hand math and statistic books to supplement my ordinary course texts. One of them was Statistical Methods by Anders Hald. Hald was no longer working at the university, but his text had become a classic. I was fascinated by this book shop. The owner was an old, grey-haired man sitting behind a huge stack of books smoking a pipe, and writing his sales catalogues on an old IBM typing machine. He allowed me to go down into his cellar where there books everywhere from floor to ceiling. There were many books which I wanted to acquire down there, but I hardly had any money. It was a mess in the cellar and I offered to tidy up if he could pay me in the books I wanted, and he agreed. I loved coming to work there, and I continued to do so my entire studies. My boss gave me more and more responsibility and put me in charge of the mathematics, physics, statistics and science books in general. When I finished my masters, I was considering doing a PhD. I loved mathematics and still do until this day. But I also found that when I woke up in the morning, I was thinking of antiquarian books and in the evening I couldn’t get to bed because I was thinking of books. It gave me energy and happiness. So I thought, why not try and be a rare book dealer for a year or two and see how it works out? It’s been 14 years since I made that decision, and I have really enjoyed it. In 2009, I decided to start my own company and specialize in important books and manuscripts in science.

    JBR: What is it like to be immersed in these rare artifacts that were so important for the transmission of scientific knowledge? What kinds of scholars do you consult to establish the authenticity of works like Euler’s Opuscula Varii Argumenti or Cauchy’s Leçons sur le calcul diffrentiel?

    CW: I feel privileged to handle some of these objects on a daily basis. One day I am sitting with an original autograph manuscript by Einstein doing research on relativity, and the next day I have a presentation copy of Darwin’s Origin of Species in my hands. These are objects which have changed the world and the way we think about ourselves. In addition to the books and manuscripts, I find the people who I meet extremely interesting. A few years before Anders Hald (whose book had originally brought me into my old boss’ shop) passed away, I went to buy his books. He was 92 and completely fresh in his mind. We spoke about the history of statistics – a subject about which he authored several books.

    JBR: I have noticed that collectors seem to be very interested in the works of twentieth-century mathematicians and physicists. You have works by Alonzo Church, Kurt Gödel, Richard Feynman, and others in your catalogue. But your roster of statisticians seems to focus on the old masters such as Laplace and de Moivre. Are collectors also interested in Karl Pearson, Udny Yule, and R. A. Fisher?

    CW: Certainly. Maybe so much so that every time I get one of Pearson or Fisher’s main papers they sell immediately. That’s why you don’t see them in my stock at the moment.

    JBR: I noticed that you had two works by Sofya Vasilyevna Kovalevskaya on display in California. Do you see a renewed interest in the works of women scientists and mathematicians, or is this remarkable and brilliant woman an exception?

    CW: There has definitely been a renewed interest in exceptional woman scientists. A few years ago the New York-based Grolier Club hosted an exhibition called ‘Extraordinary Women in Science and Medicine’, and several institutions are focusing on the subject. These woman who broke through the social constraints against them are exceptional and fascinating people.

    JBR: Although there are notable exceptions (Donald Knuth’s typesetting comes to mind), I think most data scientists, computer scientists, and statisticians work in a digital world of ebooks and poorly printed texts. Do you think that the technical book as a collectable artifact will survive the twenty-first century? What advice would you give to working data scientists and statisticians who are interested in collecting?

    CW: Good question. Many important papers nowadays are not even printed, and the only physical material a researcher might have left from some landmark work might be some scribbles he or she did on a piece of paper. There are examples of people who collect digital art. They use various ways of signing or otherwise authenticating the artists work even if it’s on a USB stick. Maybe that’s how some research papers might be collected in the future?

    My advice for anyone wanting to start collecting would be to first focus on some of the classics in their field or some other field that fascinates them. The classics will have been collected by many others in the past and there will be good descriptions, bibliographies, and catalogues describing them and why they are collectible. That way one will gradually get a feeling about which mechanisms are important when collecting and what to focus on, e.g., condition, provenance, etc. And then I’d say it’s important to build a good relationship with at least one dealer with a good reputation in the trade. Any great collection is built on a collaboration were collectors and dealers work together.

    JBR: Excelent advice! Thank you Christian.

    1 For example, have a look at some of the R training at this year’s IRE-CAR Conference.

    2 See, for example, these University of Washington resources for the digital humanities.

    Sophia Rare Books (Copenhagen), specializes in rare and important books and manuscripts in the History of Science and Medicine fields.

    _____='https://rviews.rstudio.com/2019/03/07/treasured-books-an-interview-with-christian-westergaard/';

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Quick post – detect and fix this ggplot2 antipattern

    Thu, 03/07/2019 - 01:00

    (This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers)

    Recently one of my coworkers showed me a ggplot and although it is not wrong, it is also not ideal. Here is the TL:DR :

    Whenever you find yourself adding multiple geom_* to show different groups, reshape your data

    In software engineering there are things called antipatterns, ways of programming
    that lead you into potential trouble. This is one of them.

    I’m not saying it is incorrect, but it might lead you into trouble.

    example: we have some data, some different calculations and we want to plot that.

    I load tidyverse and create a modified mtcars set in this hidden part,
    but if you don’t care you can leave it unopened

    Cool how this folds away right? It even works on github markdown, if you want to know how I did this, I explain it here

    library(tidyverse) # I started loading magrittr, ggplot2 and tidyr, and realised ## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0 ## ✔ tibble 2.0.1 ✔ dplyr 0.7.8 ## ✔ tidyr 0.8.2 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.3.0 ## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() # I needed dplyr too, at some point loading tidyverse is simply easiest. very_serious_data <- mtcars %>% as_tibble(rownames = "carname") %>% group_by(cyl) %>% mutate( mpg_hp = mpg/hp, first_letter = str_extract(carname, "^[A-z]"), mpg_hp_c = mpg_hp/mean(mpg_hp),# grouped mean mpg_hp_am = mpg_hp+ am )

    Now the data (mtcars) and calculations don’t really make sense but they are here to show you the
    antipattern. I created 3 variants of dividing mpg (miles per gallon) by hp (horse power)

    The antipattern

    We have a dataset with multiple variables (columns) and want to plot
    one against the other, so far so good.

    What is the effect of mpg_hp for every first letter of the cars?

    very_serious_data %>% ggplot(aes(first_letter, mpg_hp))+ geom_point()+ labs(caption = "So far so good")

    But you might wonder what the other transformations of that variable do?
    You can just add a new geom_point, but maybe with a different color?
    And to see the dots that overlap you might make them a little opaque.

    very_serious_data %>% ggplot(aes(first_letter, mpg_hp))+ geom_point(alpha = 2/3)+ geom_point(aes(y = mpg_hp_c), color = "red", alpha = 2/3)+ labs(caption = "adding equivalent information")

    And maybe the third one too?

    very_serious_data %>% ggplot(aes(first_letter, mpg_hp))+ geom_point(alpha = 2/3)+ geom_point(aes(y = mpg_hp_c), color = "red", alpha = 2/3)+ geom_point(aes(y = mpg_hp_am), color = "blue", alpha = 2/3)+ labs(caption = "soo much duplication in every geom_point call!")

    This results in lots of code duplication for specifying what is essentially
    the same for every geom_point() call. It’s also really hard to add a legend
    now.

    What is the alternative?

    Whenever you find yourself adding multiple geom_* to show different groups, reshape your data

    Gather the columns that are essentially representing the group and reshape
    the data into a format more suitable for plotting. Bonus: automatic correct labeling.

    very_serious_data %>% gather(key = "ratio", value = "score", mpg_hp, mpg_hp_c, mpg_hp_am ) %>% ggplot(aes(first_letter, score, color = ratio))+ geom_point(alpha = 2/3)+ labs(caption = "fixing the antipattern")

    And that’s it.

    Mari also tells you it will work

    State of the machine

    At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand

    sessioninfo::session_info() ## ─ Session info ────────────────────────────────────────────────────────── ## setting value ## version R version 3.5.2 (2018-12-20) ## os Ubuntu 16.04.5 LTS ## system x86_64, linux-gnu ## ui X11 ## language en_US ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Europe/Amsterdam ## date 2019-03-07 ## ## ─ Packages ────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) ## backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.2) ## bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.0) ## bindrcpp * 0.2.2 2018-03-29 [1] CRAN (R 3.5.0) ## blogdown 0.9 2018-10-23 [1] CRAN (R 3.5.2) ## bookdown 0.9 2018-12-21 [1] CRAN (R 3.5.2) ## broom 0.5.1 2018-12-05 [1] CRAN (R 3.5.2) ## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.0) ## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1) ## colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) ## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) ## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2) ## dplyr * 0.7.8 2018-11-10 [1] CRAN (R 3.5.1) ## evaluate 0.13 2019-02-12 [1] CRAN (R 3.5.2) ## forcats * 0.3.0 2018-02-19 [1] CRAN (R 3.5.0) ## generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.2) ## ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.2) ## glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1) ## gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) ## haven 2.0.0 2018-11-22 [1] CRAN (R 3.5.2) ## hms 0.4.2 2018-03-10 [1] CRAN (R 3.5.0) ## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) ## httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.1) ## jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.1) ## knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2) ## labeling 0.3 2014-08-23 [1] CRAN (R 3.5.0) ## lattice 0.20-38 2018-11-04 [4] CRAN (R 3.5.1) ## lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) ## lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) ## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) ## modelr 0.1.2 2018-05-11 [1] CRAN (R 3.5.0) ## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) ## nlme 3.1-137 2018-04-07 [4] CRAN (R 3.5.0) ## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2) ## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1) ## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) ## purrr * 0.3.0 2019-01-27 [1] CRAN (R 3.5.2) ## R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) ## Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.1) ## readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.2) ## readxl 1.2.0 2018-12-19 [1] CRAN (R 3.5.2) ## rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) ## rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.2) ## rstudioapi 0.8 2018-10-02 [1] CRAN (R 3.5.1) ## rvest 0.3.2 2016-06-17 [1] CRAN (R 3.5.0) ## scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1) ## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2) ## stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2) ## stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) ## tibble * 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) ## tidyr * 0.8.2 2018-10-28 [1] CRAN (R 3.5.1) ## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.1) ## tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.5.0) ## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) ## xfun 0.4 2018-10-23 [1] CRAN (R 3.5.2) ## xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) ## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1) ## ## [1] /home/roel/R/x86_64-pc-linux-gnu-library/3.5 ## [2] /usr/local/lib/R/site-library ## [3] /usr/lib/R/site-library ## [4] /usr/lib/R/library var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages