Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 36 min ago

Bootstrap Testing with MCHT

Mon, 10/29/2018 - 16:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

Now that we’ve seen MCHT basics, how to make MCHTest() objects self-contained, and maximized Monte Carlo (MMC) testing with MCHT, let’s now talk about bootstrap testing. Not much is different when we’re doing bootstrap testing; the main difference is that the replicates used to generate test statistics depend on the data we feed to the test, and thus are not completely independent of it. You can read more about bootstrap testing in [1].

Bootstrap Hypothesis Testing

Let represent our test statistic. For bootstrap hypothesis testing, we will construct test statistics from data generated using our sample. Call these test statistics . These statistics are generated in such a way that we know that the null hypothesis holds for them. Suppose for the sake of demonstration that large values of constitute evidence against the null hypothesis. Then the -value for the bootstrap hypothesis test is

S\}}" title="\hat{p} = \frac{1}{N} \sum_{j = 1}^{N} I_{\{S^{*}_j > S\}}" class="latex" />

Here, is the indicator function.

There are many ways to generate the data used to compute . There’s the parametric bootstrap, where the data is used to estimate the parameters of a distribution, then those parameters are plugged into that distribution and then the distribution is used to generate new samples. There’s also the nonparametric bootstrap that doesn’t make such strong assumptions about the data, perhaps sampling from the data itself to generate new samples. Either of these methods can be used in bootstrap testing, and MCHTest() supports both.

Unlike Monte Carlo tests, bootstrap tests cannot claim to be exact tests for any sample size; they’re better for larger sample sizes. That said, they often work well even in small sample sizes and thus are still a good alternative to inference based on asymptotic results. They also could serve as an alternative approach to the nuisance parameter problem, as MMC often has weak power.

Bootstrap Testing in MCHT

In MCHT, there is little difference between bootstrap testing and Monte Carlo testing. Bootstrap tests need the original dataset to generate replicates; Monte Carlo tests do not. So the difference here is that the function passed to rand_gen needs to accept a parameter x rather than n, with x representing the original dataset, like that passed to test_stat.

That’s the only difference. All else is the same.

Nonparametric Bootstrap Example

Suppose we wish to test for the location of the mean. Our nonparametric bootstrap procedure is as follows:

  1. Generate samples of data from the demeaned dataset.
  2. Suppose our mean under the null hypothesis is . Add this mean to each generated dataset and compute the statistic for each of those datasets; these will be the simulated test statistics .
  3. Compute the test statistic on the main data and use the empirical distribution function of the simulated test statistics to compute a -value.

The code below implements this procedure.

library(MCHT) library(doParallel) registerDoParallel(detectCores()) ts <- function(x, mu = 0) { sqrt(length(x)) * (mean(x) - mu)/sd(x) } rg <- function(x) { x_demeaned <- x - mean(x) sample(x_demeaned, replace = TRUE, size = length(x)) } sg <- function(x, mu = 0) { x <- x + mu test_stat(x, mu = mu) # Will be localizing } b.t.test <- MCHTest(ts, sg, rg, seed = 123, N = 1000, lock_alternative = FALSE, test_params = "mu", localize_functions = TRUE) dat <- c(2.3, 1.1, 8.1, -0.2, -0.8, 4.7, -1.9) b.t.test(dat, alternative = "two.sided", mu = 1) ## ## Monte Carlo Test ## ## data: dat ## S = 0.68164, p-value = 0.432 ## alternative hypothesis: true mu is not equal to 1 b.t.test(dat, alternative = "less", mu = 7) ## ## Monte Carlo Test ## ## data: dat ## S = -3.8626, p-value = 0.025 ## alternative hypothesis: true mu is less than 7 Parametric Bootstrap Test

The parametric bootstrap test assumes that the observed data was generated using a specific distribution, such as the Gaussian distribution. All that’s missing, in essence, is the parameters of that distribution. The procedure thust starts by estimating all nuisance parameters of the assumed distribution using the data. Then the first step of the process mentioned above (which admittedly was specific to a test for the mean but still strongly resembles the general process) is replaced with simulating data from the assumed distribution using any parameters assumed under the null hypothesis and the estimated values of any nuisance parameters. The other two steps of the above process are unchanged.

We can use the parametric bootstrap to test for goodness of fit with the Kolmogorov-Smirnov test. Without going into much detail, suppose that represents a distribution that is known except maybe for the values of its parameters. Assume that is an independently and identically distributed dataset, and we have observed values . We wish to use the dataset to decide between the hypotheses

That is, we want to test whether our data was drawn from the distribution or whether it was drawn from a different distribution. This is what the Kolmogorov-Smirnov test checks.

R implements this test in ks.test() but that function does not allow for any nuisance parameters. It will only check for an exact match between distributions. This is often not what we want; we want to check whether out data was drawn from any member of the family of distributions , not a particular member with a particular combination of parameters. It’s tempting to plug in the estimated values of these parameters, but then the -value needs to be computed differently, not in the way that is prescribed by ks.test(). Thus we will need to approach the test differently.

Since the distribution of the data is known under the null hypothesis, this is a good situation to use a bootstrap test. We’ll use maximum likelihood estimation to estimate the values of the missing parameters, as implemented by fitdistrplus (see [2]). Then we generate samples from this distribution using the estimated parameter values and use those samples to generate simulated test statistic values that follow the distribution prescribed by the null hypothesis.

Suppose we wished to test whether the data was drawn from a Weibull distribution. The result is the following test.

library(fitdistrplus) ts <- function(x) { param <- coef(fitdist(x, "weibull")) shape <- param[['shape']]; scale <- param[['scale']] ks.test(x, pweibull, shape = shape, scale = scale, alternative = "two.sided")$statistic[[1]] } rg <- function(x) { n <- length(x) param <- coef(fitdist(x, "weibull")) shape <- param[['shape']]; scale <- param[['scale']] rweibull(n, shape = shape, scale = scale) } b.ks.test <- MCHTest(test_stat = ts, stat_gen = ts, rand_gen = rg, seed = 123, N = 1000) b.ks.test(rweibull(1000, 2, 2)) ## ## Monte Carlo Test ## ## data: rweibull(1000, 2, 2) ## S = 0.021907, p-value = 0.275 b.ks.test(rbeta(1000, 2, 2)) ## ## Monte Carlo Test ## ## data: rbeta(1000, 2, 2) ## S = 0.047165, p-value < 2.2e-16 Conclusion

Given the choice between a MMC test and a bootstrap test, which should you prefer? If you’re concerned about speed and power, go with the bootstrap test. If you’re concerned about precision and getting an “exact” test that’s at least conservative, then go with a MMC test. I think most of the time, though, the bootstrap test will be good enough, even with small samples, but that’s mostly a hunch.

Next week we will see how we can go beyond one-sample or univariate tests to multi-sample or multivariate tests. See the next blog post.

References
  1. J. G. MacKinnon, Bootstrap hypothesis testing in Handbook of computational econometrics (2009) pp. 183-213
  2. M. L. Delignette-Muller and C. Dulag, fitdistrplus: an R package for fitting distributions, J. Stat. Soft., vol. 64 no. 4 (2015)

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science With R Course Series – Week 7

Mon, 10/29/2018 - 10:15

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

After week 7, you will be able to communicate confidently which model features are the most important.

Interpretability is a very important topic in machine learning. The automated machine learning tool, H2O, makes a data scientist’s life easier, however it doesn’t remove the need to understand your model. As the data scientist, you need to be able to explain why the selected model is the best.

In this week’s curriculum, you learn how to explain “black-box” machine learning models with LIME. LIME stands for Local Interpretable Model-Agnostic Explanations and is used to understand which model features have the most predictive impact.

Here is a recap of our trajectory and the course overview:

Recap: Data Science With R Course Series

You’re in the Week 7: Machine Learning Interpretability with LIME. Here’s our game-plan over the 10 articles in this series. We’ll cover how to apply data science for business with R following our systematic process.

Week 7: Machine Learning Interpretability with LIME

Student Feedback

Week 7: Machine Learning Interpretability with LIME Overview & Setup

The Overview & Setup will walk through the setup to support LIME within the project workflow, and prepare the machine learning model for interpretation.

After understanding the features that make up your machine learning model, you will be able to answer the critical business question, Why is employee churn happening?

Feature Explanation With LIME

Jump right into learning about the LIME package and how it works to interpret machine learning models. Here you will make predictions using your model and investigate employee turnover model results. You will then use LIME to produce an explanation of why certain employees were selected.

Challenge #4

In this 2 part challenge, you will recreate a single explanation plot and a full explanations plot to visualize important features.

After you complete the challenge, walk through the Solution videos to compare and review your working solution.

You Need To Learn R For Business



To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

  • Our systematic data science for business framework
  • R and H2O for Machine Learning
  • How to produce Return-On-Investment from data science
  • And much more.

Start Learning Today!

Next Up

The next article in the Data Science With R Series covers Evaluation: Calculating The Expected ROI (Savings) Of A Policy Change.

Learn how to communicate the cost savings of using your model. Inform the business to make decisions around time and resources based on the value of your findings.

Use the Expected Value Framework after your model is complete to explain which features are most important. The Expected Value Framework is a method to calculate savings from implementing business changes based on the model’s results.

Week 8: Evaluation: Calculating The Expected ROI (Savings) Of A Policy Change

New Course Coming Soon: Build A Shiny Web App!

You’re experiencing the magic of creating a high performance employee turnover risk prediction algorithm in DS4B 201-R. Why not put it to good use in an Interactive Web Dashboard?

In our new course, Build A Shiny Web App (DS4B 301-R), you’ll learn how to integrate the H2O model, LIME results, and recommendation algorithm building in the 201 course into an ML-Powered R + Shiny Web App!

Shiny Apps Course Coming in October 2018!!! Sign up for Business Science University Now!



Building an R + Shiny Web App, DS4B 301-R

Get Started Today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

About a Curious Feature and Interpretation of Linear Regressions

Mon, 10/29/2018 - 09:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

blockquote { font-size: 17px; font-weight: 800; }

A small blog post with a riddle, simulation, theory and a concluding rhyme.

Consider a fictitious example in which we have collected a sample of somewhat overweight persons for which we measured weight in kg as $y$ and height in cm as $x$. We estimate the following simple linear regression:

y_i = \hat \beta_0 + \hat \beta_1 \cdot x_i + \hat \varepsilon

y_i = 0 + 1 \cdot x_i + \hat \varepsilon

One early message in our Economics 101 course is that for such a regression with non-experimental data, one should not interpret the estimated coefficient $\hat \beta_1$ in a causal way, by saying that one more cm height leads to one more kg weight. One should rather interpret $\hat \beta_1$ as a quantitative measure of the linear relationship between $x$ and $y$, e.g. using a formulation like:

We estimate that 1 cm higher height corresponds on average with $\hat \beta_1 = 1$ kg higher weight.

A simulation study with an interesting finding

Let us move towards the curious feature that I promised in the title. Consider the following simple R code

set.seed(1) n = 10000 x = rnorm(n) eps = rnorm(n) y = x + eps

that simulates data for a simple linear regression model

y_i = \beta_0 + \beta_1 x + \varepsilon

with $\beta_0=0$ and $\beta_1=1$.

If we estimate that model, we indeed find a slope $\hat \beta_1$ close to 1 in our sample:

coef(lm(y~x)) ## (Intercept) x ## -0.004159068 1.004741804

If for a given data set we want to assume nothing about the causal structure, we may as well estimate a simple linear regression with $x$ as the dependent variable and $y$ as the explanatory variable:

lm(x~y)

To make this blog entry a bit more interactive, I have added quiz powered by Google forms, where you can make a guess about the rough slope of the regression above.

Loading…

… scroll down to continue…

Here is the result of the regression:

coef(lm(x~y)) ## (Intercept) y ## -0.001058499 0.510719332

Interestingly, the slope is now close to $1/2$ instead of $1$!
(And yes, the standard errors are very small.)

Being not a statistician by training, I must admit that I was quite surprised by this result. After all, if we would ignore the disturbances and just had a simple line $y=x$ with slope $1$, the slope stays $1$ if we just swap the sides of $x$ and $y$ to get the line $x=y$.

Of course, the observed result is fully consistent with the mathematics of the simple least squares estimator. The estimated slope of a simple linear regression of $y$ on $x$ is given by

\hat \beta_1 = \frac {Cov(x,y)} {Var(x)}

Let $\hat \alpha_1$ denote the estimated slope of the regression of $x$ on $y$. We have

\hat \alpha_1 = \frac {Cov(y,x)} {Var(y)}

Since the covariance is symmetric, i.e. $Cov(x,y) = Cov(y,x)$, we thus find

\frac{\hat \alpha_1}{\hat \beta_1} = \frac{Var(x)}{Var(y)}

The ratio of the slopes of the two regressions is equal to the ratio of the sample variances of $x$ and $y$.

In our data generating process $y$ as the sum of $x$ and $\varepsilon$ has twice the variance than $x$, which also holds approximately for the sample variances:

c(var(x),var(y)) ## [1] 1.024866 2.016225

To get more intuition, let us look at a scatter plot with y on the vertical and x on the horizontal axis. We have so far two candidate regression lines to account for the relationship between $x$ and $y$. First the red line with slope $\hat \beta_1 \approx 1$ from the regression of $y$ on $x$. Second the blue line with slope $\frac{1}{\alpha_1} \approx 2$, where $\alpha_1$ is the slope from the linear regression of $x$ on $y$.

library(ggplot2) ggplot(mapping=aes(x=x,y=y)) +geom_point(alpha=0.2) + geom_abline(slope=1, intercept=0, color="red", size=1.2)+ geom_abline(slope=2, intercept=0, color="blue", size=1.2)+ theme_bw()

From eye-sight, I could not tell which of the two lines provides a better linear approximation of the shape of the point cloud.

While the red line minimizes the sum of squared vertical distance from the points to the line, the blue line minimizes the sum of squared horizontal distances.

So what about our interpretation of the regression slope?

So, should I present in my introductory course something like the following pair of simplified interpretations of estimated regression slopes?

We estimate that 1 cm higher height corresponds on average with $\hat \beta_1 = 1$ kg higher weight.

and

We also estimate that 1 kg higher weight corresponds on average with $\hat \alpha_1 = 0.5$ cm higher height.

Well, this seems like a good method to generate headaches, get dozens of emails that there must be a typo in my script, and to cause a significant drop of my course evaluation…

Orthogonal Regression

Instead of minimizing the vertical or horizontal residuals, one could minimize the Euclidean distance of each observation to the regression line. This is done by a so called Orthogonal Regression.

Looking up Wikipedia, we find the following formula for the slope of an orthogonal regression of $y$ on $x$:

\tilde \beta_1 = \frac{s_{yy}-s_{xx} + \sqrt{ (s_{yy}-s_{xx})^2 + 4 s_{xy}^2}} {2 s_{xy}}

where $s_{xx}$ and $s_{yy}$ are the sample variances of $x$ and $y$, respectively, and $s_{xy}$ is the sample covariance.

Let $\tilde \alpha_1$ be the slope of the orthogonal regression of $x$ on $y$. One can show that both slopes indeed satisfy the relationship that we get when swapping $y$ and $x$ for a deterministic linear curve, i.e.

\tilde \alpha_1 = \frac{1} {\tilde \beta_1}

We can also verify this numerically with R:

slope.oreg = function(y,x) { s_yy = var(y) s_xx = var(x) s_xy = cov(x,y) (s_yy-s_xx + sqrt( (s_yy-s_xx)^2 + 4* s_xy^2 )) / (2* s_xy) } beta1.oreg = slope.oreg(y,x) alpha1.oreg = slope.oreg(x,y) c(beta1.oreg, alpha1.oreg, 1/ beta1.oreg) ## [1] 1.5911990 0.6284569 0.6284569

The following plot shows the result orthogonal regression line through our point cloud in dark-green.

ggplot(mapping=aes(x=x,y=y)) +geom_point(alpha=0.2) + geom_abline(slope=1, intercept=0, color="red", size=1.2)+ geom_abline(slope=2, intercept=0, color="blue", size=1.2)+ geom_abline(slope=beta1.oreg, intercept=0, color="darkgreen", size=1.2)+ theme_bw()

By eye-sight the green orthogonal regression line seems indeed better describe the linear relationship of the point cloud.

Conclusion?

If we just want to have a simple quantitative measure for the linear relationship between two variables, there indeed seems to be some merit for running an orthogonal regression instead of a simple linear regression.

Yet, there are many reasons to focus just on simple linear regressions. For example, it more closely relates to the estimation of causal effects and the estimation of parameters of structural models.

So maybe one should always introduce the linear regression model with all relevant assumptions and then stick to a more precise non-causal interpretation for the slope of a simple linear regression, like: “If we observe a 1 cm higher height, our best linear unbiassed prediction for the weight increases by $\hat \beta_1 = 1$ kg.” But I don’t see how that would be a good strategy for my Econ 101 course.

In the end, statistics is subtle and some simplifications in introductory classes just seem reasonable. I guess, I will just stick in my course with both: simple least squares regression and the simple interpretation.

I just will follow this advice:

Don’t make you course a mess,

but just be sly,

and never simultaneously regress,

$y$ on $x$ and $x$ on $y$.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

crfsuite for natural language processing

Mon, 10/29/2018 - 08:43

(This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

A new R package called crfsuite supported by BNOSAC landed safely on CRAN last week. The crfsuite package (https://github.com/bnosac/crfsuite) is an R package specific to Natural Language Processing and allows you to easily build and apply models for

  • named entity recognition
  • text chunking
  • part of speech tagging
  • intent recognition or
  • classification of any category you have in mind

The focus of the implementation is on allowing the R user to build such models on his/her own data, with your own categories. The R package is a Rcpp interface to the popular crfsuite C++ package which is used a lot in all kinds of chatbots.

In order to facilitate creating training data on your own data, a shiny app is made available in this R package which allows you to easily tag your own chunks of text, with your own categories, which can next be used to build a crfsuite model. The package also plays nicely together with the udpipe R package (https://CRAN.R-project.org/package=udpipe), which you need in order to extract predictive features (e.g. parts of speech tags) for your words to be used in the crfsuite model.

On a side-note. If you are in the area of NLP, you might also be interested in the upcoming ruimtehol R package which is a wrapper around the excellent StarSpace C++ code providing word/sentence/document embeddings, text-based classification, content-based recommendation and similarities as well as entity relationship completion.

You can get going with the crfsuite package as follows. Have a look at the package vignette, it shows you how to construct and apply your own crfsuite model.

## Install the packages
install.packages("crfsuite")
install.packages("udpipe")

## Look at the vignette
library(crfsuite)
library(udpipe)
vignette("crfsuite-nlp", package = "crfsuite")

More details at the development repository https://github.com/bnosac/crfsuite where you can also provide feedback.

Training on Text Mining 

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.

  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers - bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio IDE Custom Theme Support

Mon, 10/29/2018 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce that RStudio v1.2 has added support for custom editor themes. Custom editor themes will allow you to adjust the background color of the editor and syntax highlighting of code in RStudio to better suit your own personal style.

New editor themes can be added to RStudio by importing a tmTheme or sharing an existing rstheme file. The tmTheme file format was first introduced for the TextMate editor, and has since become a standard theme format. The rstheme format is specific to RStudio.

Importing a Custom Theme

Before you can add a theme to RStudio, you’ll have to find a theme in the right format. This online tmTheme editor will allow you to create your own tmThemes or download an existing theme from a large collection of themes. If you are interested in writing your own theme be sure to read this RStudio Extensions article about writing themes.

Once you have a tmTheme or rstheme file for your favorite theme or themes, you can import it to RStudio. Follow the instructions below to import a theme.

  1. In the menu bar, open the “Tools” menu.

  2. From the drop down menu, choose “Global Options”.

  3. In the pane on the left hand side of the options window, click “Appearance”.

  4. To import a theme, click on the “Add…” button.

  5. In the file browser, navigate to the location where you’ve saved your theme file.

  6. If prompted to install R packages, select “Yes”.

  7. You should now see your newly added theme in the list of editor themes. Simply click the “Apply” button to apply your theme to RStudio.

The theme pictured in these examples is called Night Owlish, and was adapted from the Night Owl theme by RStudio’s own Mara Averick. It can be found on her github page.

Removing a Custom Theme

If you accidentally added a theme, or you want to add an updated version, you can remove the theme from RStudio. To do so, follow the instructions below.

  1. As above, navigate to the Appearance Preferences Pane in the Global Options.

  2. If the theme you wish to remove is the active theme, be sure to switch to a different theme first.

  3. Select the theme you wish to remove from the list of themes and click the “Remove” button.

  4. Select “Yes” when prompted for confirmation.

Sharing Themes

If you’ve found (or made) a really cool theme that you want to share, you can do so just by sharing the tmTheme or rstheme file. Then the recipient can import it as per the instructions in the Importing a Custom Theme section. There is no difference between sharing the tmTheme file, or the rstheme file that is generated after the theme gets imported to RStudio, unless you or someone else has made changes to the rstheme file itself.

rstheme files can be found in the .R directory under your home directory. On Windows, the path is C:\Users\\Documents\.R\rstudio\themes. On all other operating systems, the path is ~/.R/rstudio/themes.

Some of Our Favorite Themes

To find out more about themes in RStudio, check out this support article about themes. In the meantime, here is RStudio styled using some of our favorite themes:

Ayu Dark, Light, and Mirage by dempfi:

Ayu Dark


Ayu Mirage


Ayu Light

Candy Brights:

Wombat, by randy3k:

This theme is an example of a theme where the rstheme file was modified directly. Without editing the rstheme file, it wouldn’t have been possible to change the style of non-editor elements of RStudio, like the tabs above the different panes. To learn more about creating new custom themes for RStudio, take a look at this RStudio Extensions article about writing themes.

We look forward to seeing what great new themes the RStudio community comes up with!

You can download the RStudio 1.2 Preview Release at https://www.rstudio.com/products/rstudio/download/preview/. If you have any questions or comments, please get in touch with us on the community forums.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Reproducible Finance, the book! And a discount for our readers

Mon, 10/29/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

I’m thrilled to announce the release of my new book Reproducible Finance with R: Code Flows and Shiny Apps for Portfolio Analysis, which originated as a series of R Views posts in this space. The first post was written way back in November of 2016 – thanks to all the readers who have supported us along the way!

If you are familiar with the R Views posts, then you probably have a pretty good sense for the book’s style, prose, and code approach, but I’d like to add a few quick words of background.

The book’s practical motivations are: (1) to introduce R to finance professionals, or aspiring finance professionals, who wish to move beyond Excel for their quantitative work, and (2) to introduce various finance coding paradigms to R coders.

The softer motivation is to demonstrate and emphasize readable, reusable, and reproducible R code, data visualizations, and Shiny dashboards. It will be very helpful to have some background in the R programming language or in finance, but the most important thing is a desire to learn about the landscape of R code and finance packages.

An overarching goal of the book is to introduce the three major R paradigms for portfolio analysis: xts, the tidyverse, and tidyquant. As a result, we will frequently run the same analysis using three different code flows.

If that ‘three-universe’ structure seems a bit unclear, have a quick look back at this post on skewness and this post on kurtosis, and you’ll notice that we solve the same task and get the same result with different code paths.

For example, if we had portfolio returns saved in a tibble object called portfolio_returns_tq_rebalanced_monthly, and an xts object called portfolio_returns_xts_rebalanced_monthly, and our goal was to find the Sharpe Ratio of portfolio returns, we would start in the xts world with SharpeRatio() from the PerformanceAnalytics package.

# define risk free rate variable rfr <- .0003 sharpe_xts <- SharpeRatio(portfolio_returns_xts_rebalanced_monthly, Rf = rfr, FUN = "StdDev") %>% `colnames<-`("sharpe_xts") sharpe_xts ## sharpe_xts ## StdDev Sharpe (Rf=0%, p=95%): 0.2748752

We next would use the tidyverse and run our calculations in a piped flow:

sharpe_tidyverse_byhand <- portfolio_returns_tq_rebalanced_monthly %>% summarise(sharpe_dplyr = mean(returns - rfr)/ sd(returns - rfr)) sharpe_tidyverse_byhand ## # A tibble: 1 x 1 ## sharpe_dplyr ## ## 1 0.275

And then head to the tidyquant paradigm where we can apply the SharpeRatio() function to a tidy tibble.

sharpe_tq <- portfolio_returns_tq_rebalanced_monthly %>% tq_performance(Ra = returns, performance_fun = SharpeRatio, Rf = rfr, FUN = "StdDev") %>% `colnames<-`("sharpe_tq")

We can compare our three Sharpe objects and confirm consistent results.

sharpe_tq %>% mutate(tidy_sharpe = sharpe_tidyverse_byhand$sharpe_dplyr, xts_sharpe = sharpe_xts) ## # A tibble: 1 x 3 ## sharpe_tq tidy_sharpe xts_sharpe ## ## 1 0.275 0.275 0.275

We might be curious how the Sharpe-Ratio-to-standard-deviation ratio of our portfolio compares to those of the component ETFs and a ggplot scatter is a nice way to visualize that.

asset_returns_long %>% na.omit() %>% summarise(stand_dev = sd(returns), sharpe = mean(returns - rfr)/ sd(returns - rfr)) %>% add_row(asset = "Portfolio", stand_dev = portfolio_sd_xts_builtin[1], sharpe = sharpe_tq$sharpe_tq) %>% ggplot(aes(x = stand_dev, y = sharpe, color = asset)) + geom_point(size = 2) + geom_text( aes(x = sd(portfolio_returns_tq_rebalanced_monthly$returns), y = sharpe_tq$sharpe_tq + .02, label = "Portfolio")) + ylab("Sharpe Ratio") + xlab("standard deviation") + ggtitle("Sharpe Ratio versus Standard Deviation") + # The next line centers the title theme_update(plot.title = element_text(hjust = 0.5))

Figure 1: Sharpe versus Standard Deviation

Finally, we are ready to calculate and visualize the Sharpe Ratio of a custom portfolio with Shiny and a flexdashboard, like the one found here.

As in the above example, most tasks in the book end with data visualization and then Shiny (a few early readers have commented with happy surprise that all the charts and code are in full color in the book – thanks to CRC press for making that happen!). Data visualization and Shiny are heavily emphasized – more so than in other finance books – and that might seem unusual. After all, every day we hear about how the financial world is becoming more quantitatively driven as firms race towards faster, more powerful algorithms. Why emphasize good ol’ data visualization? I believe, and have observed first-hand, that the ability to communicate and tell the story of data in a compelling way is only going to become more crucial as the financial world becomes more complex. Investors, limited partners, portfolio managers, clients, risk managers – they might not want to read our code or see our data, but we still need to communicate to them the value of our work. Data visualization and Shiny dashboards are a great way to do that. By book’s end, a reader will have built a collection of live, functioning flexdashboards that can be the foundation for more complex apps in the future.

If you’ve read this far, good news! Between now and December 31, 2018, there’s a 20% discount on the book being run at CRC, and if you don’t see it applied, readers can use discount code SS120 on the CRC website. The book is also available on Amazon as Kindle or paperback (but there’s only than 10 copies left as of right now).

Thanks so much for reading, and happy coding!

_____='https://rviews.rstudio.com/2018/10/29/reproducible-finance-the-book/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Is the answer to everything Gaussian?

Mon, 10/29/2018 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract:

As an applied statistician you get in touch with many challenging problems in need of a statistical solution. Often, your client/colleague already has a working solution and just wants to clarify a small statistical detail with you. Equally often, your intuition suggests you that the working solution is not statistically adequate, but how to substantiate this? As motivating example we use the statistical process control methodology used in Sarma et al. (2018) for monitoring a binomial proportion as part of a syndromic surveillance kit.



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

A few weeks ago I became aware of the publication by Sarma et al. (2018), who as part of a syndromic surveillance system monitor the time series of a proportion with the aim of timely detecting changes. What initially caught my intention was their figure 1A:




Figure : Figure 1A from Sarma et al. (2018) showing the time series of proportions that reports of acute respiratory infection make up of all syndrome reports that day.

Reading the details of the paper reveals that the number of daily counts on 14 syndromes is collected and for each syndrome the proportion of the particular syndrome out of all syndrome reports is calculated. In other words: given that the counts for a particular day \(t\) are \(y_{it}, i=1, \ldots, 14\), the monitored proportion is \(p_{it} = y_{it}/\sum_{j=1}^{14} y_{jt}\). It is thus clear that it’s impossible to get beyond 100%. The more surprising was that the upper level in the figure goes beyond it – a sign of an inadequate statistical method. What the authors do is to compute an upper threshold for a particular day \(t_{0}\) as follows:

\[
U_{t_0} = \overline{p}_{t_0}(d) + k \cdot s_{t_0}(d), \quad \text{where}\\
\quad \overline{p}_{t_0}(d) = \frac{1}{d}\sum_{t=t_0-d}^{t_0-1} p_{t}
\quad\text{and}\quad
s_{t_0}(d) = \frac{1}{d-1} \sum_{t=t_0-d}^{t_0-1}
(p_{t} – \overline{p}_{t_0}(d))^2
\]

is the mean and standard deviation of the \(d\) baseline observations1, respectively, and \(k\) is a tuning parameter – for 12 out of the 14 syndromes \(k=2\) is used, but for the two syndromes with highest proportions, including acute respiratory infections, \(k=4\) is used. As the method looks like an adaptation of the simple EARS method (Fricker, Hegler, and Dunfee 2008) to proportions this caused me to tweet the following critical remark (no harm intended besides the scientific criticism of using a somewhat inadequate statistical method):



To which one of the authors, Alexander Ullrich, replied



Initially, I replied by twitter, but realized that twitter is not a good format for a well-balanced and thorough scientific discourse. Also, my answers lacked exemplification and supporting numbers, so I deleted the answers and shall instead use this blog post to provide a thorough answer. Please note that my discussion focuses on the paper’s statistical methodology approach – I do think the published application is very important and I’m happy to see that the resulting Excel tool is made available to a greater audience under a creative common license!

As much I can understand the motives, working in applied statistics is always a balance between mathematical exactness and pragmatism. A famous quote says that things should be as simple as possible, but not simpler. But if mathematical rigour always is second, something is amiss. In this light, I’d like to comment on the four reasons given in Alexander’s reply.

Reason 1 & 2:

In principle I agree and taking a non-parametric and distribution free approach is healthy, if you fear your assumptions are more quickly violated than you can formulate them. Using mean plus two times standard deviation of the \(d=15\) baseline counts does, however, imply certain assumptions. It means that you believe the distribution being sufficiently stable that the expectation and standard deviation estimated from the baseline values is indicative for the next observation’s expectation and standard deviation. In other words: no trends, no day of the week effects no previous outbreaks are allowed in the baseline. Looking at the jump of the upper-bound line after the single peak in June 2016 in Fig 1A one might conclude that this might be a problematic assumption. Furthermore, one assumes the distribution is sufficiently symmetric such that its quantiles can be described as a number of times the standard deviations away from the mean. Finally, by using the usual formula to estimate the standard deviation one assumes that the observations are independent. They are likely not and, hence, the estimated standard deviation might be too small. All these limitations need to be mentioned, and probably are the biggest problem with the method, but could be addressed by semi-simple modelling approaches as done, e.g., in Farrington et al. (1996) for counts.

For the remainder of this post, I shall instead focus on using the mean plus two times standard deviation (sd) rule as I have seen it in too many times – also in other surveilance contexts. The problem we are solving is a statistical one, so writing that the k-times-sd-rule is not meant to have a probabilistic interpretation leaves the user alone with the choice of threshold. In particular many users will know from their Statistics 101 class that mean plus/minus two times sd is as a way to get approximately 95% of the mass for anything which has a Gaussian shape. Due to the central limit theorem this shape will apply to a certain degree.

So what we do is to compare an out-of-sample observation with a threshold. In this case the prediction interval for the observation is the statistical correct object for the comparison. Because the standard deviation is estimated from data, the prediction interval should be based on quantiles of a t-distribution with \(d-1\) degrees of freedom. In this case the appropriate upper limit of a two-sided \((1-\alpha)\cdot 100%\) prediction interval is given as

\[
\begin{align} \label{eq:predict-ul-gauss} \
U_{t_0} = \overline{p}_{t_0}(d) +
t_{1-\alpha/2}(d-1) \cdot \sqrt{1+\frac{1}{d}} \cdot s_{t_0}(d),
\end{align}
\] where \(t_{1-\alpha/2}(d-1)\) denotes the \(1-\alpha/2\) quantile of the t-distribution with \(d-1\) degrees of freedom. In our case \(\alpha=0.05\) so we need the 97.5% quantile. See for example Chapter 10 of Young and Smith (2005) or the Wikipedia page on prediction intervals for details.

With \(d=15\)2 the upper limit of a one-sided 97.5% prediction interval would have the multiplication factor 2.22 on the estimated standard deviation. Thus using a factor of 2 instead means your procedure is slightly more sensitive than a the possibly anticipated 2.5% false alarm probability. Calculations show that the false alarm probability under the null hypothesis is 3.7% (assuming the Gaussian assumption holds). Altogether, one can say that this in the \(d=15\) case, for the sake of simplicity, the difference appears ignorable. Had \(d\) been smaller the difference becomes more relevant though.

Reason 3

I’m not sure I got the point, but one problem is that if your baseline consists of the observations are \(y\), \(y\), \(\ldots\), \(y\) then the upper limit by your method will also be \(y\), as the estimated standard deviation will be zero. Another problem is if the denominator \(n_t\) is zero, but because this appears to have a regular shape (no reports on weekends), this can just be left out of the modelling?

Reason 4

This reason I find particular troublesome, because it is an argument statisticians hear often. A wider audience expects a valid statistical method, not more complicated than necessary, but sound and available within the tools at hand. I argued above that two times standard deviation for proportions might be working, but you implicitly leave a lot of problems for the user to solve due to insufficient statistical modelling. I agree that too complicated might not work, but if -for the sake of pragmatism- we neither inform or quantify potential problems of a too simplistic solution nor fail to provide something workable which is better, we’ll be out of a job quickly.

Can we do better?

Initially it appeared natural to either try a data or parameter transformation in order to ensure that the computed upper limit respects the \([0,1]\) bound. However, all suggestions I tried proved problematic one way or the other, e.g., due to small sample sizes. Instead, a simple Bayesian and a simple non-parametric variant are considered.

Beta-Binomial

A simple approach is to use a conjugate prior-posterior Bayesian updating scheme. Letting \(\pi\) be the true underlying proportion, we assume a \(\operatorname{Be}(0.5, 0.5)\) prior for it initially. Observing \(y_{t} \sim \operatorname{Bin}(n_t, \pi)\) for \(t=t_0-d, \ldots, t_0\), the posterior for \(\pi\) will be \[
\pi | y_{t_0-d},
\ldots, y_{t_0} \sim
\operatorname{Be}\left(0.5 + \sum_{t=t_0-d}^{t_0-1} y_t, 0.5 + \sum_{t=t_0-d}^{t_0-1} (n_t – y_t)\right)
\] One can then show that the posterior predictive distribution for the next observation, i.e. \(y_{t_0}\), is \[
y_{t_0} | y_{t_0-d}, \ldots, y_{t_0} \sim
\operatorname{BeBin}\left(n_{t_0}, 0.5 + \sum_{t=t_0-d}^{t_0-1} y_t, 0.5 + \sum_{t=t_0-d}^{t_0-1} (n_t – y_t\right),
\] where \(\operatorname{BeBin}(n, a, b)\) denotes the beta-binomial distribution with size parameter \(n\) and the two shape parameters \(a\) and \(b\) implemented in, e.g., the VGAM package. We then use the upper 97.5% quantile of this distribution to define the threshold \(U_{t_0}\) for \(p_{t_0}\) and sound an alarm if \(p_{t_0} > U_{t_0}\).

A simple variant of this procedure is to use a plug-in type prediction interval by obtaining the upper limit as the 97.5% quantile of the binomial with size parameter \(n_{t_0}\) and probability \(p_{t_0}\). However, this approach ignores all uncertainty originating from the estimation of \(p_{t_0}\) and, hence, is likely to result in somewhat narrower prediction intervals than the Beta-Binomial approach.

Non-parametric

A one-sided 97.5% prediction interval \([0, U]\) based on the continuous values \(p_{t_0-39},\ldots, p_{t_0-1}\) without ties is given as (see e.g. Arts, Coolen, and van der Laan (2004) or the Wikipedia entry on non-parametric prediction intervals): \[
U_{t_0} = \max(p_{t_0-39},\ldots, p_{t_0-1}).
\] Hence, an alarm is flagged if \(p_{t_0}> U_{t_0}\). This means we simply compare the current value with the maximum of the baseline values. If we only have \(d=19\) values, then the interval from zero to the maximum of these values would constitute a one-sided 95% prediction interval.

Simulation Study

We consider the false alarm proportion of the suggested method (2 times and 4 times standard deviation, as well as the prediction interval method and a beta-binomial approach by simulating from a null model, where \(d+1\) observations are iid. from a binomial distribution \(\operatorname{Bin}(25, \pi)\). The first \(d\) observations are used for estimation and then upper limit computed by the algorithm is compared to the last observations. Note: the non-parametric method requires the simulation of 39+1 values. For all methods: If the last observation exceeds the upper limit an alarm is sounded. We will be interested in the false alarm probability, i.e. the probability that an alarm is sounded even though we know that the last observation originates from the same model as the baseline parameters. For the methods using a 97.5% one-sided prediction interval, we expect this false error probability to be 2.5%.

The function implementing the six algorithms to compare looks as follows:

algo_sysu_all <- function(y, n, t0, d) { stopifnot(t0-d > 0) p <- y/n baseline_idx <- (t0-1):(t0-d) baseline <- p[baseline_idx] m <- mean(baseline) sd <- sd(baseline) U_twosd <- m + 2*sd U_pred <- m + sqrt(1+1/d)*qt(0.975, df=d-1)*sd U_foursd <- m + 4*sd ##Beta binomial astar <- 0.5 + sum(y[baseline_idx]) bstar <- 0.5 + sum((n[baseline_idx] - y[baseline_idx])) U_betabinom <- qbetabinom.ab(q=0.975, size=n[t0], shape1=astar, shape2=bstar) / n[t0] ##Non-parametric with a 97.5% one-sided PI (needs 39 obs) U_nonpar <- max( p[(t0-1):(t0-39)]) ##Prediction interval based on Binomal variance based on Fisher info U_binom <- qbinom(p=0.975, size=n[t0], prob=m) / n[t0] return(data.frame(t=t0, U_twosd=U_twosd, U_foursd=U_foursd, U_pred=U_pred, U_betabinom=U_betabinom, U_nonpar=U_nonpar, U_binom=U_binom)) }

This can be wrapped into a function performing a single simulation :

##Simulate one iid binomial time series simone <- function(pi0, d=21, n=25) { length_ts <- max(39+1, d+1) y <- rbinom(length_ts, size=n, prob=pi0) n <- rep(n, length_ts) p <- y/n res <- algo_sysu_all(y=y, n=n, t0=length_ts, d=d) return(c(twosd=res$U_twosd, foursd=res$U_foursd, pred=res$U_pred, betabinom=res$U_betabinom, nonpar=res$U_nonpar, binom=res$U_binom) < p[length_ts]) }

We then perform the simulation study using several cores using the future and future.apply package:




Figure 2: False positive probability for different \(\pi\) values each estimated by 10,000 Monte Carlo simulations.

In the figure we see that both the two and four times standard deviation approach (twosd, foursd) as well as the approach based on the 97.5% predictive distribution in a Gaussian setting (pred) have a varying FAP over the range of true proportions: The smaller \(\pi\) the higher the FAP. The FAP can be as high as 7% instead of the nominal 2.5%. When monitoring 145 time points this means that we will on average see \(145\cdot 0.07\)=10 false alarms, if the process does not change. This is somewhat problematic, because it means that the behaviour of the detection procedure depends on the underlying \(\pi\): the user will get way more false alarm at low \(\pi\)‘s than possibly expecting. Altogether, it appears better to use a slightly higher threshold than 2. However, \(k=4\) looks awfully conservative!

All considered procedures dip down to a FAP of 0% for \(\pi\) near 100%, which means no alarms are sounded here. This is related to the fact that if \(U_{t_0}=n_{t_0}\) than because \(p_{t_0} > U_{t_0}\) is required before an alarm will be sounded, no alarm is possible. Furthermore, both the beta-binomial, the binomial variant and the non-parametric procedure have FAPs slightly lower than the nominal 2.5%. This is again related to the discreteness of the problem: It might not be possible to find an integer limit \(U\) such that the CDF at \(U\) is exactly 97.5%, i.e. usually \(F(q_{0.975})>0.975\). Because we only sound alarms if \(y_{t_0} > U\), i.e. the probability for this to occur is even smaller, namely, \(1-F(q_{0.975}+1)\).

Note that in the above simulation study the binomial and beta-binomial models are in advantage, because the model used to simulate data is identical and closely identical, respectively, to how data are simulated. In order to make the simulation more comprehensive we investigate an additional scenario where the marginal distribution are binomial \(\operatorname{Bin}(25, 0.215)\), but are correlated3. We simulate variables \(y_t^*\) using an \(AR(1)\) process with parameter \(\rho\), \(|\rho| < 1\), i.e. \[
y_t^* | y_{t-1}^* \sim \rho \cdot y_{t-1}^* + \epsilon_t, \quad t=2,3,\ldots,
\] where \(y_1^* \sim N(0,1)\) and \(\epsilon_t \stackrel{\text{iid}}{\sim} N(0,1)\). These latent variables are then marginally back-transformed to standard uniforms \(u_t \sim U(0,1)\) which are then transformed using the quantile function of the \(\operatorname{Bin}(25, 0.215)\) distribution, i.e.


\(y_t\) = qbinom(pnorm(ystar[t]))

Altogether, this corresponds to a Gaussian copula approach to generate correlated random variables with a given marginal distribution. The correlation between the \(y_t\) will not be exactly be \(\rho\) due to the discrete nature of the binomial, but will approach \(\rho\) as \(n\) in the binomial becomes large. Figure 3 shows the results for the false alarm probability based on 10,000 Monte Carlo simulations for marginal \(\operatorname{Bin}(25, 0.215)\) distribution and latent \(AR(1)\) one-off-the-diagonal correlation parameter \(\rho\).



Figure 3: Equivalent of a false alarm probability by 10,000 Monte Carlo simulation for the algorithms when there is a correlation \(\rho\) on the latent scale, but the marginal mean of all observations is \(\pi=0.215\).

We see that the binomial and beta binomial approaches sound too many alarms as the correlation increases. Same goes for the two times standard deviation and the predictive approach. The non-parametric approach appears to behave slightly better.

Application

We use the synthetic acute respiratory infection data made available as part of the paper’s SySu Excel tool available for download under a creative common license. In what follows we focus on the time series for the symptom acute respiratory infections. Figure shows the daily proportions 2017-01-01 until 2017-07-20 for all weekdays as vertical bars. Also shown is the upper threshold \(U_t\) for six methods discussed above.




Figure 4: Upper bound curves for all detection procedures.

The correlation between \(p_{t}\) and \(p_{t-1}\) in the time series is 0.04, which could be a sign that the synthetic were artificially generated using an independence assumption.

For all algorithms we see the effect on the upper threshold as spikes enter the baseline values. In particular the non-parametric method, which uses \(d=39\) baseline values, will only sound an alarm during 39 days after time 60, if the proportion is larger than the \(p_{60} = 69.6%\) spike .

Discussion

This post discussed how to use statistical process control to monitor a proportion in a syndromic surveillance context. The suitability and performance of Gaussian approximations was discussed: It was shown that the false alarm probability for this approach depends on the level of the considered proportion and that auto-correlation also has a substantial impact on the chart. The investigation in this post were done in order to provide the user of such charts with a guidance on how to choose \(k\).

Altogether, a full scientific analysis would need a more comprehensive simulation study and likely access to the real data, but the point of this post was to substantiate that statistical problems need a statistical investigation. From the simulation results in this post it appears more prudent to use \(k>2\), e.g., the upper limit of a 97.5% one-sided prediction interval would be \(k\)=2.22 or \(k\)=3.07 for the upper limit of a 99.5% one-sided prediction interval. Choosing \(k>2\) is also supported by the fact that none of the 204 signals generated by the system were interpreted as an outbreak. A simple fix to avoid confusion could be chop the upper threshold at 100% in the graphics, i.e. to report \(U_t^* = \max(1, U_t)\) for the Gaussian based procedures. Better would be to use the predictive approach and let the user choose \(\alpha\) and thus give the parameter choice a probabilistic interpretation. However, binomial and beta-binomial based approaches provide more stable results over the full range of \(\pi\) and are guaranteed to respect the \([0,1]\) support. In particular the non-parametric method looks promising despite being even simpler than the proposed k-sd-system. All in all, addressing trends or other type of auto-correlation as well as previous outbreaks in the baseline appears to be important in order to get a more specific syndromic surveillance system – see Sect. 3.4 of Salmon, Schumacher, and Höhle (2016) for how this could look. I invite you to read the Sarma et al. (2018) paper to form your own impressions.

Acknowledgments

The contents of this post were discussed as part of the Statistical Consultancy M.Sc. course at the Department of Mathematics, Stockholm University. I thank Jan-Olov Persson, Rolf Sundberg and the students of the course for their comments, remarks and questions.

Conflicts of Interest

I have previously worked for the Robert Koch Institute. Some of the co-authors of the Sarma et al. (2018) paper are previous colleagues, which I have published together with.

Literature

Arts, G. R. J., F. P. A. Coolen, and P. van der Laan. 2004. “Nonparametric Predictive Inference in Statistical Process Control.” Quality Technology & Quantitative Management 1 (2): 201–16.

Farrington, C.P, N.J. Andrews, A.D. Beale, and M.A. Catchpole. 1996. “A Statistical Algorithm for the Early Detection of Outbreaks of Infectious Disease.” Journal of the Royal Statistical Society, Series A 159: 547–63.

Fricker, R. D., B. L. Hegler, and D. A. Dunfee. 2008. “Comparing syndromic surveillance detection methods: EARS’ versus a CUSUM-based methodology.” Stat Med 27 (17): 3407–29.

Salmon, M., D. Schumacher, and M. Höhle. 2016. “Monitoring Count Time Series in R: Aberration Detection in Public Health Surveillance.” Journal of Statistical Software 70 (10). doi:10.18637/jss.v070.i10.

Sarma, N., A. Ullrich, H. Wilking, S. Ghozzi, A. K. Lindner, C. Weber, A. Holzer, A. Jansen, K. Stark, and S. Vygen-Bonnet. 2018. “Surveillance on speed: Being aware of infectious diseases in migrants mass accommodations – an easy and flexible toolkit for field application of syndromic surveillance, Germany, 2016 to 2017.” Euro Surveill. 23 (40).

Young, G. A., and R. L. Smith. 2005. Essentials of Statistical Inference. Cambridge University Press.

  1. The method contained two additional parameters: one being the minimum number of cases needed on a particular day to sound an alarm (low-count protection) and a fixed threshold for the proportion beyond which a signal was always created. For the sake of statistical investigation we shall disregard these two features in the analysis of this post.

  2. In the paper d=21 was used, but due to many missing values, e.g., due to weekends, the actual number of observations used was on average 15. We therefore use \(d=15\) in the blog post.

  3. The 21.5% is taken from Table 2 of Sarma et al. (2018) for acute respiratory infections.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

Sun, 10/28/2018 - 18:07

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Conway’s Game of Life is one of the most interesting examples of cellular automata. It is traditionally simulated on a rectangular grid (like a chessboard) and each cell is considered either live or dead. The rules of evolution are simple: the next life grid is computed as follows:

  • To compute the state of a cell on the next grid sum the number of live cells in the eight neighboring cells on the current grid.
  • If this sum is 3 or if the current cell is live and the sum is 2 or 3, then the cell in the next grid will be live.

This rule can be implemented as scalar code in R:

# d is a matrix of logical values life_step_scalar <- function(d) { nrow <- dim(d)[[1]] ncol <- dim(d)[[2]] dnext <- matrix(data = FALSE, nrow = nrow, ncol = ncol) for(i in seq_len(nrow)) { for(j in seq_len(ncol)) { pop <- 0 if(i>1) { if(j>1) { pop <- pop + d[i-1, j-1] } pop <- pop + d[i-1, j] if(j1) { pop <- pop + d[i, j-1] } if(j1) { pop <- pop + d[i+1, j-1] } pop <- pop + d[i+1, j] if(j=2) && (pop<=3)) } } dnext }

We could probably speed the above code up by a factor of 2 to 4 by eliminating the if-statements which requires writing 9 versions of the for-loops (depending if the index is in the right-boundary, interior, or left-boundary of its range). However, as we are about to see- this is not worth the effort.

A much faster implementation is the vector implementation found here.

life_step <- function(d) { # form the neighboring sums nrow <- dim(d)[[1]] ncol <- dim(d)[[2]] d_eu <- rbind(d[-1, , drop = FALSE], 0) d_ed <- rbind(0, d[-nrow, , drop = FALSE]) d_le <- cbind(d[ , -1, drop = FALSE], 0) d_re <- cbind(0, d[ , -ncol, drop = FALSE]) d_lu <- cbind(d_eu[ , -1, drop = FALSE], 0) d_ru <- cbind(0, d_eu[ , -ncol, drop = FALSE]) d_ld <- cbind(d_ed[ , -1, drop = FALSE], 0) d_rd <- cbind(0, d_ed[ , -ncol, drop = FALSE]) pop <- d_eu + d_ed + d_le + d_re + d_lu + d_ru + d_ld + d_rd d <- (pop==3) | (d & (pop>=2) & (pop<=3)) d }

The way this code works is: it builds 8 copies of the life-table, one shifting each neighboring cell into the current cell-position. With these 8 new matrices the entire life forward evolution rule is computed vectorized over all cells using the expression “(pop==3) | (d & (pop>=2) & (pop<=3))“. Notice the vectorized code is actually shorter: we handle the edge-cases by zero-padding.

The performance difference is substantial:

The vectorized code is about 10 times faster on average (details can be found here).

A simulation of this type produces figures such as the following:

Of course if you are serious about Conway’s Game of Life you would use specialized software (even in-browser JavaScript), and specialized algorithms (such as HashLife).

One objection is: the vectorized code uses more memory. To that I give the following famous quote:

The biggest difference between time and space is that you can’t reuse time.

-Merrick Furst

And that is our (toy) example of vectorizing code. Techniques such as these are why very fast and powerful code can in fact be written in R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How quickly do stock market valuations revert back to their means?

Sun, 10/28/2018 - 13:20

(This article was first published on Data based investing, and kindly contributed to R-bloggers)

Mean reversion is the assumption that things tend to revert back to their means in the long run. This is especially true for valuations and certain macroeconomic variables, but not so much for stock prices themselves. In this post we’ll look at the mean reversion of different valuation measures by forming equal sized baskets from each valuation decile and letting the valuations change as time goes on. This study (pdf) shows an interesting graph on page 23 about the mean reversion of the 10-year price-to-earnings ratio also known as CAPE. In this post the study will be replicated using also international CAPE, P/E and P/B. I’ll replicate the results using a longer time frame of twenty years. Let’s start with CAPE using Shiller data of the US stock market from years 1926 to 2008: Click to enlarge images Using a longer time frame over reversion becomes visible, i.e. high valuations tend to eventually lead to low valuations and vice versa. The only exception is the decile with the highest valuation, which is explained by the housing bubble after the tech bubble. The valuations seem to revert back to their means in 11-12 years. Let’s look at the mean reversion of the same metric using Barclays data from years 1982 to 2008 from 26 different countries or continents: The mean reversion happens again in about 12 years, but the over reversion seems to disappear. This might be caused by US having different kind of bubbles and busts than the rest of the world, or because of the shorter time period. The dataset is many times larger and should give a clearer picture of the mean reversion than using only US data. Next, we’ll look at price-to-book: It seems to take longer for the P/B to revert back to its mean, which is logical since CAPE uses historical 10-year earnings. There is however still some noticeable over reversion. Let’s look at price-to-earnings ratio next: The P/E ratio seems to revert back to its mean a little bit quicker than the rest, in about 9-10 years. There is still some over reversion.

In summary, different valuation measures tend to revert back to their means in about ten years, and over revert after that.

Hope you enjoyed this short post. Be sure to follow me on Twitter for updates about new blog posts!

The R code used in the analysis can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data based investing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing cricpy:A python package to analyze performances of cricketers

Sun, 10/28/2018 - 07:32

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Full many a gem of purest ray serene,
The dark unfathomed caves of ocean bear;
Full many a flower is born to blush unseen,
And waste its sweetness on the desert air.

Thomas Gray, An Elegy Written In A Country Churchyard Introduction

It is finally here! cricpy, the python avatar , of my R package cricketr is now ready to rock-n-roll! My R package cricketr had its genesis about 3 and some years ago and went through a couple of enhancements. During this time I have always thought about creating an equivalent python package like cricketr. Now I have finally done it.

So here it is. My python package ‘cricpy!!!’

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports only Test cricket

You should be able to install the package using pip install cricpy and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

This post is also hosted on Rpubs at Introducing cricpy. You can also download the pdf version of this post at cricpy.pdf

Do check out my post on R package cricketr at Re-introducing cricketr! : An R package to analyze performances of cricketers

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

 

This package uses the statistics info available in ESPN Cricinfo Statsguru. T

The cricpy package

The cricpy package has several functions that perform several different analyses on both batsman and bowlers. The package has functions that plot percentage frequency runs or wickets, runs likelihood for a batsman, relative run/strike rates of batsman and relative performance/economy rate for bowlers are available.

Other interesting functions include batting performance moving average, forecasting, performance of a player against different oppositions, contribution to wins and losses etc.

The data for a particular player can be obtained with the getPlayerData() function. To do this you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Rahul Dravid, Virat Kohli, Alastair Cook etc. This will bring up a page which have the profile number for the player e.g. for Rahul Dravid this would be http://www.espncricinfo.com/india/content/player/28114.html. Hence, Dravid’s profile is 28114. This can be used to get the data for Rahul Dravid as shown below

The cricpy package is almost a clone of my R package cricketr. The signature of all the python functions are identical with that of its R avatar namely  ‘cricketr’, with only the necessary variations between Python and R. It may be useful to look at my post R vs Python: Different similarities and similar differences. In fact if you are familiar with one of the languages you can look up the package in the other and you will notice the parallel constructs. You can fork/clone the package at Github cricpy

The following 2 examples show the similarity between cricketr and cricpy packages

1a.Importing cricketr – R

Importing cricketr in R

#install.packages("cricketr") library(cricketr) 2a. Importing cricpy – Python # Install the package # Do a pip install cricpy # Import cricpy import cricpy # You could either do #1. import cricpy.analytics as ca #ca.batsman4s("../dravid.csv","Rahul Dravid") # Or #2. from cricpy.analytics import * #batsman4s("../dravid.csv","Rahul Dravid")

I would recommend using option 1 namely ca.batsman4s() as I may add an advanced analytics module in the future to cricpy.

2 Invoking functions

You can seen how the 2 calls are identical for both the R package cricketr and the Python package cricpy

2a. Invoking functions with R package ‘cricketr’ library(cricketr) batsman4s("../dravid.csv","Rahul Dravid")

2b. Invoking functions with Python package ‘cricpy’ import cricpy.analytics as ca ca.batsman4s("../dravid.csv","Rahul Dravid")

 

3a. Getting help from cricketr – R #help("batsman4s") 3b. Getting help from cricpy – Python help(ca.batsman4s)

The details below will introduce the different functions that are available in cricpy.

3. Get the player data for a player using the function getPlayerData()

Important Note This needs to be done only once for a player. This function stores the player’s data in the specified CSV file (for e.g. dravid.csv as above) which can then be reused for all other functions). Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses

import cricpy.analytics as ca #dravid =ca.getPlayerData(28114,dir="..",file="dravid.csv",type="batting",homeOrAway=[1,2], result=[1,2,4]) #acook =ca.getPlayerData(11728,dir="..",file="acook.csv",type="batting",homeOrAway=[1,2], result=[1,2,4]) import cricpy.analytics as ca #lara =ca.getPlayerData(52337,dir="..",file="lara.csv",type="batting",homeOrAway=[1,2], result=[1,2,4])253802 #kohli =ca.getPlayerData(253802,dir="..",file="kohli.csv",type="batting",homeOrAway=[1,2], result=[1,2,4]) 4 Rahul Dravid’s performance – Basic Analyses

The 3 plots below provide the following for Rahul Dravid

  1. Frequency percentage of runs in each run range over the whole career
  2. Mean Strike Rate for runs scored in the given range
  3. A histogram of runs frequency percentages in runs ranges
import cricpy.analytics as ca import matplotlib.pyplot as plt ca.batsmanRunsFreqPerf("../dravid.csv","Rahul Dravid")

ca.batsmanMeanStrikeRate("../dravid.csv","Rahul Dravid")

ca.batsmanRunsRanges("../dravid.csv","Rahul Dravid")

5. More analyses import cricpy.analytics as ca ca.batsman4s("../dravid.csv","Rahul Dravid")

ca.batsman6s("../dravid.csv","Rahul Dravid")

ca.batsmanDismissals("../dravid.csv","Rahul Dravid")

6. 3D scatter plot and prediction plane

The plots below show the 3D scatter plot of Dravid Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease

import cricpy.analytics as ca ca.battingPerf3d("../dravid.csv","Rahul Dravid")

7. Average runs at different venues

The plot below gives the average runs scored by Dravid at different grounds. The plot also the number of innings at each ground as a label at x-axis. It can be seen Dravid did great in Rawalpindi, Leeds, Georgetown overseas and , Mohali and Bangalore at home

import cricpy.analytics as ca ca.batsmanAvgRunsGround("../dravid.csv","Rahul Dravid")

 

8. Average runs against different opposing teams

This plot computes the average runs scored by Dravid against different countries. Dravid has an average of 50+ in England, New Zealand, West Indies and Zimbabwe.

import cricpy.analytics as ca ca.batsmanAvgRunsOpposition("../dravid.csv","Rahul Dravid") 9 . Highest Runs Likelihood

The plot below shows the Runs Likelihood for a batsman. For this the performance of Sachin is plotted as a 3D scatter plot with Runs versus Balls Faced + Minutes at crease. K-Means. The centroids of 3 clusters are computed and plotted. In this plot Dravid’s  highest tendencies are computed and plotted using K-Means

import cricpy.analytics as ca ca.batsmanRunsLikelihood("../dravid.csv","Rahul Dravid") 10. A look at the Top 4 batsman – Rahul Dravid, Alastair Cook, Brian Lara and Virat Kohli

The following batsmen have been very prolific in test cricket and will be used for teh analyses

  1. Rahul Dravid :Average:52.31,100’s – 36, 50’s – 63
  2. Alastair Cook : Average: 45.35, 100’s – 33, 50’s – 57
  3. Brian Lara : Average: 52.88, 100’s – 34 , 50’s – 48
  4. Virat Kohli: Average: 54.57 ,100’s – 24 , 50’s – 19

The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs

11. Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

import cricpy.analytics as ca ca.batsmanPerfBoxHist("../dravid.csv","Rahul Dravid")

ca.batsmanPerfBoxHist("../acook.csv","Alastair Cook")

ca.batsmanPerfBoxHist("../lara.csv","Brian Lara")


ca.batsmanPerfBoxHist("../kohli.csv","Virat Kohli")


12. Contribution to won and lost matches

The plot below shows the contribution of Dravid, Cook, Lara and Kohli in matches won and lost. It can be seen that in matches where India has won Dravid and Kohli have scored more and must have been instrumental in the win

For the 2 functions below you will have to use the getPlayerDataSp() function as shown below. I have commented this as I already have these files

import cricpy.analytics as ca #dravidsp = ca.getPlayerDataSp(28114,tdir=".",tfile="dravidsp.csv",ttype="batting") #acooksp = ca.getPlayerDataSp(11728,tdir=".",tfile="acooksp.csv",ttype="batting") #larasp = ca.getPlayerDataSp(52337,tdir=".",tfile="larasp.csv",ttype="batting") #kohlisp = ca.getPlayerDataSp(253802,tdir=".",tfile="kohlisp.csv",ttype="batting") import cricpy.analytics as ca ca.batsmanContributionWonLost("../dravidsp.csv","Rahul Dravid")

ca.batsmanContributionWonLost("../acooksp.csv","Alastair Cook")

ca.batsmanContributionWonLost("../larasp.csv","Brian Lara")

ca.batsmanContributionWonLost("../kohlisp.csv","Virat Kohli")


13. Performance at home and overseas

From the plot below it can be seen

Dravid has a higher median overseas than at home.Cook, Lara and Kohli have a lower median of runs overseas than at home.

This function also requires the use of getPlayerDataSp() as shown above

import cricpy.analytics as ca ca.batsmanPerfHomeAway("../dravidsp.csv","Rahul Dravid")

ca.batsmanPerfHomeAway("../acooksp.csv","Alastair Cook")

ca.batsmanPerfHomeAway("../larasp.csv","Brian Lara")

ca.batsmanPerfHomeAway("../kohlisp.csv","Virat Kohli")

14 Moving Average of runs in career

Take a look at the Moving Average across the career of the Top 4 (ignore the dip at the end of all plots. Need to check why this is so!). Lara’s performance seems to have been quite good before his retirement(wonder why retired so early!). Kohli’s performance has been steadily improving over the years

import cricpy.analytics as ca ca.batsmanMovingAverage("../dravid.csv","Rahul Dravid")

ca.batsmanMovingAverage("../acook.csv","Alastair Cook")

ca.batsmanMovingAverage("../lara.csv","Brian Lara")

ca.batsmanMovingAverage("../kohli.csv","Virat Kohli")

15 Cumulative Average runs of batsman in career

This function provides the cumulative average runs of the batsman over the career. Dravid averages around 48, Cook around 44, Lara around 50 and Kohli shows a steady improvement in his cumulative average. Kohli seems to be getting better with time.

import cricpy.analytics as ca ca.batsmanCumulativeAverageRuns("../dravid.csv","Rahul Dravid")

ca.batsmanCumulativeAverageRuns("../acook.csv","Alastair Cook")

ca.batsmanCumulativeAverageRuns("../lara.csv","Brian Lara")

ca.batsmanCumulativeAverageRuns("../kohli.csv","Virat Kohli")

16 Cumulative Average strike rate of batsman in career

Lara has a terrific strike rate of 52+. Cook has a better strike rate over Dravid. Kohli’s strike rate has improved over the years.

import cricpy.analytics as ca ca.batsmanCumulativeStrikeRate("../dravid.csv","Rahul Dravid")

ca.batsmanCumulativeStrikeRate("../acook.csv","Alastair Cook")

ca.batsmanCumulativeStrikeRate("../lara.csv","Brian Lara")

ca.batsmanCumulativeStrikeRate("../kohli.csv","Virat Kohli")
17 Future Runs forecast

Here are plots that forecast how the batsman will perform in future. Currently ARIMA has been used for the forecast. (To do:  Perform Holt-Winters forecast!)

import cricpy.analytics as ca ca.batsmanPerfForecast("../dravid.csv","Rahul Dravid") ## ARIMA Model Results ## ============================================================================== ## Dep. Variable: D.runs No. Observations: 284 ## Model: ARIMA(5, 1, 0) Log Likelihood -1522.837 ## Method: css-mle S.D. of innovations 51.488 ## Date: Sun, 28 Oct 2018 AIC 3059.673 ## Time: 09:47:39 BIC 3085.216 ## Sample: 07-04-1996 HQIC 3069.914 ## - 01-24-2012 ## ================================================================================ ## coef std err z P>|z| [0.025 0.975] ## -------------------------------------------------------------------------------- ## const -0.1336 0.884 -0.151 0.880 -1.867 1.599 ## ar.L1.D.runs -0.7729 0.058 -13.322 0.000 -0.887 -0.659 ## ar.L2.D.runs -0.6234 0.071 -8.753 0.000 -0.763 -0.484 ## ar.L3.D.runs -0.5199 0.074 -7.038 0.000 -0.665 -0.375 ## ar.L4.D.runs -0.3490 0.071 -4.927 0.000 -0.488 -0.210 ## ar.L5.D.runs -0.2116 0.058 -3.665 0.000 -0.325 -0.098 ## Roots ## ============================================================================= ## Real Imaginary Modulus Frequency ## ----------------------------------------------------------------------------- ## AR.1 0.5789 -1.1743j 1.3093 -0.1771 ## AR.2 0.5789 +1.1743j 1.3093 0.1771 ## AR.3 -1.3617 -0.0000j 1.3617 -0.5000 ## AR.4 -0.7227 -1.2257j 1.4230 -0.3348 ## AR.5 -0.7227 +1.2257j 1.4230 0.3348 ## ----------------------------------------------------------------------------- ## 0 ## count 284.000000 ## mean -0.306769 ## std 51.632947 ## min -106.653589 ## 25% -33.835148 ## 50% -8.954253 ## 75% 21.024763 ## max 223.152901 ## ## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:646: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. ## if issubdtype(paramsdtype, float): ## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:650: FutureWarning: Conversion of the second argument of issubdtype from `complex` to `np.complexfloating` is deprecated. In future, it will be treated as `np.complex128 == np.dtype(complex).type`. ## elif issubdtype(paramsdtype, complex): ## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:577: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. ## if issubdtype(paramsdtype, float):

18 Relative Batsman Cumulative Average Runs

The plot below compares the Relative cumulative average runs of the batsman for each of the runs ranges of 10 and plots them. The plot indicate the following Range 30 – 100 innings – Lara leads followed by Dravid Range 100+ innings – Kohli races ahead of the rest

import cricpy.analytics as ca frames = ["../dravid.csv","../acook.csv","../lara.csv","../kohli.csv"] names = ["Dravid","A Cook","Brian Lara","V Kohli"] ca.relativeBatsmanCumulativeAvgRuns(frames,names)

19. Relative Batsman Strike Rate

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show

Brian Lara towers over the Dravid, Cook and Kohli. However you will notice that Kohli’s strike rate is going up

import cricpy.analytics as ca frames = ["../dravid.csv","../acook.csv","../lara.csv","../kohli.csv"] names = ["Dravid","A Cook","Brian Lara","V Kohli"] ca.relativeBatsmanCumulativeStrikeRate(frames,names) 20. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

import cricpy.analytics as ca ca.battingPerf3d("../dravid.csv","Rahul Dravid")

ca.battingPerf3d("../acook.csv","Alastair Cook")

ca.battingPerf3d("../lara.csv","Brian Lara")

ca.battingPerf3d("../kohli.csv","Virat Kohli")

21. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

import cricpy.analytics as ca import numpy as np import pandas as pd BF = np.linspace( 10, 400,15) Mins = np.linspace( 30,600,15) newDF= pd.DataFrame({'BF':BF,'Mins':Mins}) dravid = ca.batsmanRunsPredict("../dravid.csv",newDF,"Dravid") print(dravid) ## BF Mins Runs ## 0 10.000000 30.000000 0.519667 ## 1 37.857143 70.714286 13.821794 ## 2 65.714286 111.428571 27.123920 ## 3 93.571429 152.142857 40.426046 ## 4 121.428571 192.857143 53.728173 ## 5 149.285714 233.571429 67.030299 ## 6 177.142857 274.285714 80.332425 ## 7 205.000000 315.000000 93.634552 ## 8 232.857143 355.714286 106.936678 ## 9 260.714286 396.428571 120.238805 ## 10 288.571429 437.142857 133.540931 ## 11 316.428571 477.857143 146.843057 ## 12 344.285714 518.571429 160.145184 ## 13 372.142857 559.285714 173.447310 ## 14 400.000000 600.000000 186.749436

The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease.

22 Analysis of Top 3 wicket takers

The following 3 bowlers have had an excellent career and will be used for the analysis

  1. Glenn McGrath:Wickets: 563, Average = 21.64, Economy Rate – 2.49
  2. Kapil Dev : Wickets: 434, Average = 29.64, Economy Rate – 2.78
  3. James Anderson: Wickets: 564, Average = 28.64, Economy Rate – 2.88

How do Glenn McGrath, Kapil Dev and James Anderson compare with one another with respect to wickets taken and the Economy Rate. The next set of plots compute and plot precisely these analyses.

23. Get the bowler’s data

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

import cricpy.analytics as ca #mcgrath =ca.getPlayerData(6565,dir=".",file="mcgrath.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4]) #kapil =ca.getPlayerData(30028,dir=".",file="kapil.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4]) #anderson =ca.getPlayerData(8608,dir=".",file="anderson.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4]) 24. Wicket Frequency Plot

This plot below plots the frequency of wickets taken for each of the bowlers

import cricpy.analytics as ca ca.bowlerWktsFreqPercent("../mcgrath.csv","Glenn McGrath")

ca.bowlerWktsFreqPercent("../kapil.csv","Kapil Dev")

ca.bowlerWktsFreqPercent("../anderson.csv","James Anderson")

25. Wickets Runs plot

The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken

import cricpy.analytics as ca ca.bowlerWktsRunsPlot("../mcgrath.csv","Glenn McGrath")

ca.bowlerWktsRunsPlot("../kapil.csv","Kapil Dev")

ca.bowlerWktsRunsPlot("../anderson.csv","James Anderson")

26 Average wickets at different venues

The plot gives the average wickets taken by Muralitharan at different venues. McGrath best performances are at Centurion, Lord’s and Port of Spain averaging about 4 wickets. Kapil Dev’s does good at Kingston and Wellington. Anderson averages 4 wickets at Dunedin and Nagpur

import cricpy.analytics as ca ca.bowlerAvgWktsGround("../mcgrath.csv","Glenn McGrath")

ca.bowlerAvgWktsGround("../kapil.csv","Kapil Dev")

ca.bowlerAvgWktsGround("../anderson.csv","James Anderson")

27 Average wickets against different opposition

The plot gives the average wickets taken by Muralitharan against different countries. The x-axis also includes the number of innings against each team

import cricpy.analytics as ca ca.bowlerAvgWktsOpposition("../mcgrath.csv","Glenn McGrath")

ca.bowlerAvgWktsOpposition("../kapil.csv","Kapil Dev")

ca.bowlerAvgWktsOpposition("../anderson.csv","James Anderson")

28 Wickets taken moving average

From the plot below it can be see James Anderson has had a solid performance over the years averaging about wickets

import cricpy.analytics as ca ca.bowlerMovingAverage("../mcgrath.csv","Glenn McGrath")

ca.bowlerMovingAverage("../kapil.csv","Kapil Dev")

ca.bowlerMovingAverage("../anderson.csv","James Anderson")

29 Cumulative average wickets taken

The plots below give the cumulative average wickets taken by the bowlers. mcGrath plateaus around 2.4 wickets, Kapil Dev’s performance deteriorates over the years. Anderson holds on rock steady around 2 wickets

import cricpy.analytics as ca ca.bowlerCumulativeAvgWickets("../mcgrath.csv","Glenn McGrath")

ca.bowlerCumulativeAvgWickets("../kapil.csv","Kapil Dev")

ca.bowlerCumulativeAvgWickets("../anderson.csv","James Anderson")

30 Cumulative average economy rate

The plots below give the cumulative average economy rate of the bowlers. McGrath’s was very expensive early in his career conceding about 2.8 runs per over which drops to around 2.5 runs towards the end. Kapil Dev’s economy rate drops from 3.6 to 2.8. Anderson is probably more expensive than the other 2.

import cricpy.analytics as ca ca.bowlerCumulativeAvgEconRate("../mcgrath.csv","Glenn McGrath")

ca.bowlerCumulativeAvgEconRate("../kapil.csv","Kapil Dev")

ca.bowlerCumulativeAvgEconRate("../anderson.csv","James Anderson")

31 Future Wickets forecast import cricpy.analytics as ca ca.bowlerPerfForecast("../mcgrath.csv","Glenn McGrath") ## ARIMA Model Results ## ============================================================================== ## Dep. Variable: D.Wickets No. Observations: 236 ## Model: ARIMA(5, 1, 0) Log Likelihood -480.815 ## Method: css-mle S.D. of innovations 1.851 ## Date: Sun, 28 Oct 2018 AIC 975.630 ## Time: 09:28:32 BIC 999.877 ## Sample: 11-12-1993 HQIC 985.404 ## - 01-02-2007 ## =================================================================================== ## coef std err z P>|z| [0.025 0.975] ## ----------------------------------------------------------------------------------- ## const 0.0037 0.033 0.113 0.910 -0.061 0.068 ## ar.L1.D.Wickets -0.9432 0.064 -14.708 0.000 -1.069 -0.818 ## ar.L2.D.Wickets -0.7254 0.086 -8.469 0.000 -0.893 -0.558 ## ar.L3.D.Wickets -0.4827 0.093 -5.217 0.000 -0.664 -0.301 ## ar.L4.D.Wickets -0.3690 0.085 -4.324 0.000 -0.536 -0.202 ## ar.L5.D.Wickets -0.1709 0.064 -2.678 0.008 -0.296 -0.046 ## Roots ## ============================================================================= ## Real Imaginary Modulus Frequency ## ----------------------------------------------------------------------------- ## AR.1 0.5630 -1.2761j 1.3948 -0.1839 ## AR.2 0.5630 +1.2761j 1.3948 0.1839 ## AR.3 -0.8433 -1.0820j 1.3718 -0.3554 ## AR.4 -0.8433 +1.0820j 1.3718 0.3554 ## AR.5 -1.5981 -0.0000j 1.5981 -0.5000 ## ----------------------------------------------------------------------------- ## 0 ## count 236.000000 ## mean -0.005142 ## std 1.856961 ## min -3.457002 ## 25% -1.433391 ## 50% -0.080237 ## 75% 1.446149 ## max 5.840050 32 Get player data special

As discussed above the next 2 charts require the use of getPlayerDataSp()

import cricpy.analytics as ca #mcgrathsp =ca.getPlayerDataSp(6565,tdir=".",tfile="mcgrathsp.csv",ttype="bowling") #kapilsp =ca.getPlayerDataSp(30028,tdir=".",tfile="kapilsp.csv",ttype="bowling") #andersonsp =ca.getPlayerDataSp(8608,tdir=".",tfile="andersonsp.csv",ttype="bowling") 33 Contribution to matches won and lost

The plot below is extremely interesting Glenn McGrath has been more instrumental in Australia winning than Kapil and Anderson as seems to have taken more wickets when Australia won.

import cricpy.analytics as ca ca.bowlerContributionWonLost("../mcgrathsp.csv","Glenn McGrath")

ca.bowlerContributionWonLost("../kapilsp.csv","Kapil Dev")

ca.bowlerContributionWonLost("../andersonsp.csv","James Anderson")

34 Performance home and overseas

McGrath and Kapil Dev have performed better overseas than at home. Anderson has performed about the same home and overseas

import cricpy.analytics as ca ca.bowlerPerfHomeAway("../mcgrathsp.csv","Glenn McGrath")

ca.bowlerPerfHomeAway("../kapilsp.csv","Kapil Dev")

ca.bowlerPerfHomeAway("../andersonsp.csv","James Anderson")

35 Relative cumulative average economy rate of bowlers

The Relative cumulative economy rate shows that McGrath has the best economy rate followed by Kapil Dev and then Anderson.

import cricpy.analytics as ca frames = ["../mcgrath.csv","../kapil.csv","../anderson.csv"] names = ["Glenn McGrath","Kapil Dev","James Anderson"] ca.relativeBowlerCumulativeAvgEconRate(frames,names) 36 Relative Economy Rate against wickets taken

McGrath has been economical regardless of the number of wickets taken. Kapil Dev has been slightly more expensive when he takes more wickets

import cricpy.analytics as ca frames = ["../mcgrath.csv","../kapil.csv","../anderson.csv"] names = ["Glenn McGrath","Kapil Dev","James Anderson"] ca.relativeBowlingER(frames,names) 37 Relative cumulative average wickets of bowlers in career

The plot below shows that McGrath has the best overall cumulative average wickets. Kapil’s leads Anderson till about 150 innings after which Anderson takes over

import cricpy.analytics as ca frames = ["../mcgrath.csv","../kapil.csv","../anderson.csv"] names = ["Glenn McGrath","Kapil Dev","James Anderson"] ca.relativeBowlerCumulativeAvgWickets(frames,names) Key Findings

The plots above capture some of the capabilities and features of my cricpy package. Feel free to install the package and try it out. Please do keep in mind ESPN Cricinfo’s Terms of Use.

Here are the main findings from the analysis above

Key insights

1. Brian Lara is head and shoulders above the rest in the overall strike rate
2. Kohli performance has been steadily improving over the years and with the way he is going he will shatter all records.
3. Kohli and Dravid have scored more in matches where India has won than the other two.
4. Dravid has performed very well overseas
5. The cumulative average runs has Kohli just edging out the other 3. Kohli is probably midway in his career but considering that his moving average is improving strongly, we can expect great things of him with the way he is going.
6. McGrath has had some great performances overseas
7. Mcgrath has the best economy rate and has contributed significantly to Australia’s wins.
8.In the cumulative average wickets race McGrath leads the pack. Kapil leads Anderson till about 150 matches after which Anderson takes over.

The code for cricpy can be accessed at Github at cricpy

Do let me know if you run into issues.

Conclusion

I have long wanted to make a python equivalent of cricketr and I have been able to make it. cricpy is still work in progress. I have add the necessary functions for ODI and Twenty20.  Go ahead give ‘cricpy’ a spin!!

Stay tuned!

You may also like
1. My book “Deep Learning from first principles” now on Amazon
2. My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon
3. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
4. De-blurring revisited with Wiener filter using OpenCV
5. Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
6. Sixer – R package cricketr’s new Shiny avatar
7. Simulating an Edge Shape in Android

To see all posts click Index of Posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scatterplot matrices (pair plots) with cdata and ggplot2

Sun, 10/28/2018 - 03:01

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs() function. Here is the base version of the pairs plot of the iris dataset:

pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])

There are other ways to do this, too:

# not run library(ggplot2) library(GGally) ggpairs(iris, columns=1:4, aes(color=Species)) + ggtitle("Anderson's Iris Data -- 3 species") library(lattice) splom(iris[1:4], groups=iris$Species, main="Anderson's Iris Data -- 3 species")

But I wanted to see if cdata was up to the task. So here we go….

First, load the packages:

library(ggplot2) library(cdata)

To create the pairs plot in ggplot2, I need to reshape the data appropriately. For cdata, I need to specify what shape I want the data to be in, using a control table. See the last post for how the control table works. For this task, creating the control table is slightly more involved.

Here, I specify the variables I want to plot.

meas_vars <- colnames(iris)[1:4]

The function expand_grid() returns a data frame of all combinations of its arguments; in this case, I want all pairs of variables.

# the data.frame() call strips the attributes from # the frame returned by expand.grid() controlTable <- data.frame(expand.grid(meas_vars, meas_vars, stringsAsFactors = FALSE)) # rename the columns colnames(controlTable) <- c("x", "y") # add the key column controlTable <- cbind( data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]), stringsAsFactors = FALSE), controlTable) controlTable ## pair_key x y ## 1 Sepal.Length Sepal.Length Sepal.Length Sepal.Length ## 2 Sepal.Width Sepal.Length Sepal.Width Sepal.Length ## 3 Petal.Length Sepal.Length Petal.Length Sepal.Length ## 4 Petal.Width Sepal.Length Petal.Width Sepal.Length ## 5 Sepal.Length Sepal.Width Sepal.Length Sepal.Width ## 6 Sepal.Width Sepal.Width Sepal.Width Sepal.Width ## 7 Petal.Length Sepal.Width Petal.Length Sepal.Width ## 8 Petal.Width Sepal.Width Petal.Width Sepal.Width ## 9 Sepal.Length Petal.Length Sepal.Length Petal.Length ## 10 Sepal.Width Petal.Length Sepal.Width Petal.Length ## 11 Petal.Length Petal.Length Petal.Length Petal.Length ## 12 Petal.Width Petal.Length Petal.Width Petal.Length ## 13 Sepal.Length Petal.Width Sepal.Length Petal.Width ## 14 Sepal.Width Petal.Width Sepal.Width Petal.Width ## 15 Petal.Length Petal.Width Petal.Length Petal.Width ## 16 Petal.Width Petal.Width Petal.Width Petal.Width

The control table specifies that for every row of iris, sixteen new rows get produced, one for each possible pair of variables. The column pair_key will be the key column in the new data frame; there’s one key level for every possible pair of variables. The columns x and y will be the value columns in the new data frame — these will be the columns that we plot.

Now I can create the new data frame, using rowrecs_to_blocks(). I’ll also carry along the Species column to color the points in the plot.

iris_aug = rowrecs_to_blocks( iris, controlTable, columnsToCopy = "Species") head(iris_aug) ## Species pair_key x y ## 1 setosa Sepal.Length Sepal.Length 5.1 5.1 ## 2 setosa Sepal.Width Sepal.Length 3.5 5.1 ## 3 setosa Petal.Length Sepal.Length 1.4 5.1 ## 4 setosa Petal.Width Sepal.Length 0.2 5.1 ## 5 setosa Sepal.Length Sepal.Width 5.1 3.5 ## 6 setosa Sepal.Width Sepal.Width 3.5 3.5

Note that the data is now sixteen times larger, which I admit is perverse.

If I didn’t care about how the individual subplots were arranged, I’d be done: I’d plot y vs x, and facet_wrap on pair_key. But I want the subplots arranged in a grid. To do this I use facet_grid, which will require two key columns:

splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE) iris_aug$xv <- vapply(splt, function(si) si[[1]], character(1)) iris_aug$yv <- vapply(splt, function(si) si[[2]], character(1)) head(iris_aug) ## Species pair_key x y xv yv ## 1 setosa Sepal.Length Sepal.Length 5.1 5.1 Sepal.Length Sepal.Length ## 2 setosa Sepal.Width Sepal.Length 3.5 5.1 Sepal.Width Sepal.Length ## 3 setosa Petal.Length Sepal.Length 1.4 5.1 Petal.Length Sepal.Length ## 4 setosa Petal.Width Sepal.Length 0.2 5.1 Petal.Width Sepal.Length ## 5 setosa Sepal.Length Sepal.Width 5.1 3.5 Sepal.Length Sepal.Width ## 6 setosa Sepal.Width Sepal.Width 3.5 3.5 Sepal.Width Sepal.Width

And now I can produce the graph, using facet_grid.

# reorder the key columns to be the same order # as the base version above iris_aug$xv <- factor(as.character(iris_aug$xv), meas_vars) iris_aug$yv <- factor(as.character(iris_aug$yv), meas_vars) ggplot(iris_aug, aes(x=x, y=y)) + geom_point(aes(color=Species, shape=Species)) + facet_grid(yv~xv, labeller = label_both, scale = "free") + ggtitle("Anderson's Iris Data -- 3 species") + scale_color_brewer(palette = "Dark2") + ylab(NULL) + xlab(NULL)

This pair plot has x = y plots on the diagonals instead of the names of the variables, but you can confirm that it is otherwise the same as the pair plot produced by pairs().

Of course, calling pairs() (or ggpairs(), or splom()) is a lot easier than all this, but now I’ve proven to myself that cdata with ggplot2 can do the job. This version does have a few advantages. It comes with a legend by default, which is nice. And it’s not obvious how to change the color palette in ggpairs() — I prefer the Brewer Dark2 palette, myself.

Luckily, this code is straightforward to wrap as a function, so if you like the cdata version, I’ve now added the PairPlot() function to WVPlots. Now it’s a one-liner, too.

library(WVPlots) PairPlot(iris, colnames(iris)[1:4], "Anderson's Iris Data -- 3 species", group_var = "Species") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Celebrate Halloween with Creepy Computer Games in R

Sun, 10/28/2018 - 02:00

(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers)

In the 1980s I spent my time writing code on my 8-bit ZX81 and Atari computers. I learnt everything I know about programming from copying and modifying printed code listings from books with computer games. The games in these books are mostly simple text-based games, but the authors gave them enticing names, often imaginatively illustrated to visualise the virtual world they represent. A line and a dot become a game of tennis, and a computer that was able to play Tic Tac Toe seemed like your machine had come alive.

This article translates some games from the 1983 Creepy Computer Games book to celebrate Halloween.  This article is part of my series on gaming with the R language.

Creepy Computer Games in R

The old books by Usborne publishing are unique because it contains several versions of each program to ensure compatibility with some of the many dialects of the BASIC language. I first entered this code into the atari800 emulator to test what it does, after which I converted it to the R language.

Let’s step into the creepy world of computer games as imagined by Usborne Publishing.

Reynold, Colin and McCaig, Rob, Creepy Computer Games (Usborne, London). Gravedigger

Gravedigger by Alan Ramsey is a typical example of the games listed in the books of the early days of home computing. You can download the original book for free from the publisher ‘s Google drive. The Gravedigger listing starts on page 10. The lyrical description of the game provides the instructions:

It’s dark and windy—not the kind of night to be lost in a graveyard, but that’s where you are. You have until midnight to find your way out. Skeletons lurk in the shadows waiting to scare you to death should you come to close. You can dig holes to help keep them away but digging is tiring work and you cannot manage more than five in one game.  You have to be careful not to fall down the holes you have dug. Grave stones (marked

+

) and  the walls of the graveyard (marked

:

) block your way. The holes you digs are marked

O

, you are

*

and the skeletons are

X

. See if you can escape.

Partial page of the Gravedigger game in BASIC.

I translated the BASIC code as close to the original as possible. This game is not pretty code, but it works. Some of the variable names have been changed because, in BASIC, single variables and vectors can have the same name and names of character vectors end in $. A future version of this game could use graphics as I did in the Tic Tac Toe game.

The game is quite tricky, and I have only managed to escape the graveyard once. It looks like the likelihood of success very much depends on the random distribution of the graves. Perhaps we need some machine learning to optimise strategy.

You can view the code below, or download it from GitHub. I leave it up to you to deconstruct the program and safely work your way through the graveyard.

Happy Halloween!

## Creepy Computer Games ## Reynold, Colin and McCaig, Rob, Creepy Computer Games (Usborne, London). ## https://archive.org/details/Creepy_Computer_Games_1983_Usborne_Publishing/ ## Gravedigger by Alan Ramsey ## Initiate board A <- matrix(ncol = 20, nrow = 10) A[,] <- " " ## Starting variables W <- 0 # Move number X <- 5 # Remaining holes death <- 0 ## Initiate pieces Y <- "*" B <- "+" C <- "O" D <- ":" E <- "X" Z <- " " ## Draw board ## Add borders A[c(1, 10), ] <- D A[10, ] <- D A[, 1] <- D A[1:8, 20] <- D ## Add graves for (i in 1:20){ A[floor(runif(1) * 7 + 2), floor(runif(1) * 15 + 3)] <- B } ## Starting positions ## Player M <- 2 N <- 2 ## Skeletons S <- c(4, 19, 3, 19, 2, 19) ## Game play repeat{ ## Position player A[N, M] <- Y ## Position skeletons for (J in seq(1, 5, by = 2)) { A[S[J], S[J + 1]] <- J } W <- W + 1 ## Move counter if (W > 60) { print("The clock's struck midnight") print("Aghhhhh!!!!") break } ## Clear screen #cat(rep("\n", 50)) ## Print board v <- paste(as.vector(t(A)), collapse = "") for (i in 1:10) print(substr(v, (i - 1) * 20 + 1, (i - 1) * 20 + 20)) ## Enter move A1 <- toupper(readline(paste0("Enter move ", W, " (You can go N, S, E or W): "))) ## Move player T <- N U <- M if (A1 == "N") { T <- N - 1 } if (A1 == "E") { U <- M + 1 } if (A1 == "S") { T <- N + 1 } if (A1 == "W") { U <- M - 1 } ## Collission detection if (A[T, U] == D | A[T, U] == B) { # Edge or grave print("That way's blocked") } if (A[T, U] == C) { # Hole print("You've fallen into one of your own holes") break } if (A[T, U] == E) { # Skeleton death <- 1 } if (T == 9 & U == 20) { # Escaped print("You're free!") print(paste0("Your performance rating is ", floor((60 - W) / 60 * (96 + X)), "%")) break } ## Move player and dig hole if (A[T, U] == Z) { # Move into empty space A [N, M] <- Z if (X != 0) { B1 <- toupper(readline("Would you like to dig a hole (Y or N): ")) if (B1 == "Y") { X <- X - 1 A[N, M] <- C } } N <- T M <- U } ## Move skeletons for (J in seq(1, 5, by = 2)) { T <- S[J] U <- S[J + 1] ## Collision detection if (any(c(A[T, U + 1], A[T, U - 1], A[T - 1, U], A[T + 1, U]) %in% Y)) { death <- 1 } if (A1 == "S" & A[T + 1, U] == Z){ S[J] <- S[J] + 1 # Follow player } if (A1 == "N" & A[T - 1, U] == Z){ S[J] <- S[J] - 1 # Follow player } if (A1 == "E" & A[T, U - 1] == Z & M < U){ S[J + 1] <- S[J + 1] - 1 # Move towards player } if (A1 == "E" & A[T, U + 1] == Z & M > U) { S[J + 1] <- S[J + 1] + 1 # Reverse direction } if (A1 %in% c("S", "N", "E")) { A[T, U] <- Z } } if (death == 1) { print("Urk! You've been scared to death by a skeleton.") break } }

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Phyllotaxis Sprial and Prime Numbers – Experiment

Sun, 10/28/2018 - 02:00

(This article was first published on R on Chi's Impe[r]fect Blog, and kindly contributed to R-bloggers)

I recently tried out Data Camp’s project called “Phyllotaxis: Draw flowers using mathematics”. Now I’m hooked on drawing spirals using golden angle. Also the mathematical art on instructor’s blog, Fronkostin is just amazing!

Separately on twitter, I’ve gotten message on art with prime number, and that got me thinking to experiment with prime number & phyllotaxis flowers. I couldn’t figure out how I’d generate prime numbers in R, but I came across site where you can download prime numbers.

library(tidyverse) library(patchwork) ## Read first 10000 digits of prime number! prime <- read_csv(file="http://www.naturalnumbers.org/P-10000.txt", col_names=F) names(prime) <- c("nth_prime","prime","int") ## int = interval from previous prime number ## Function to Draw Frlower my_flower <- function(points=5000,num_colour=9,col_option="magma",angle=pi*(3-sqrt(5)),...){ flower <- tibble( n = c(1:points), ## change number here to use different # of points r = sqrt(n), is_prime = n %in% prime$prime, #logical colour = n%%num_colour, ## 2,3,6,12,18, seems to bring out the sprial pattern x = r * cos(angle*n), y = r * sin(angle*n) ) prime.cnt <- flower %>% filter(is_prime) %>% count() angle_deg <- if(angle==pi*(3-sqrt(5))) {"golden angle!(137.51 degree | 2.4 radian)"} else {paste(round(angle*180/pi,2),"degree | ",round(angle,2),"radian")} ## Drawing Flower (but not using Prime Number) flower_plot <-flower %>% filter(!is_prime) %>% ggplot(aes(x=x, y=y, colour=colour)) + geom_point() + geom_path(size=0.01) + scale_colour_viridis_c(end=0.8, guide="none", option=col_option) + coord_fixed() + theme_void(base_family="Roboto Condensed") + labs(caption=paste(num_colour, "colours used to plot", points-prime.cnt,"dots.\nAngle Used: ", angle_deg), subtitle="Flower Nibbled by Prime Number Bug") ## Drawing Flower (only using Prime Number) flower_prime <-flower %>% filter(is_prime) %>% ggplot(aes(x=x, y=y, colour=colour)) + geom_point() + scale_colour_viridis_c(end=0.8, guide="none", option=col_option) + coord_fixed() + theme_void(base_family="Roboto Condensed") + labs(caption=paste("Numbers between 1 and ",points, "have", prime.cnt," Prime Numbers\n"), subtitle="Flower made up by Prime Numbers Only") #You need to Print flower_plot + flower_prime } Experimenting with Different Variables

I’ve wrote function to draw flower as above, so I can now experiment by changing below.

  • points = Number of points e.g. Up to what number should we use to draw flower? (up to 104729)
  • num_colour = Number of colours to use. When golden angle is used, seems like multiple of 6 makes colours line up?
  • col_option = I can use magma,viridis, plasma, inferno or cividis here
  • angle = Angle to use for drawing spirals. Default is set to golden angle pi*(3-sqrt(5))
my_flower() ## drawing with all default values e.g. 5000 points, 9 colours, angle= golden angle

my_flower(angle = 2*pi-(pi*(3-sqrt(5)))) ## changing angle 2pi - golden angle so it looks like it's the reverse of the above.

my_flower(angle=pi/sqrt(7)) ## Just testing out different angles

my_flower(angle=pi/sqrt(pi), num_col=6*3) ## Using 18 different colours, but hard to distinguish the colour...

my_flower(angle=exp(1), num_col=10) ## exp(1) = Euler's Number as angle.

my_flower(angle = 0.3, num_col=3) ## All the black arm goes away when you only use prime number

my_flower(num_col=6*3, col_option="plasma") ## plasma colour palette!

my_flower(points=10000,num_col=18, col_option="viridis") ## little over crowded...

I could play with different angles all day long! So intriguing…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Chi's Impe[r]fect Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppRedis 0.1.9

Sat, 10/27/2018 - 23:58

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new minor release of RcppRedis arrived on CRAN earlier today. RcppRedis is one of several packages to connect R to the fabulous Redis in-memory datastructure store (and much more). RcppRedis does not pretend to be feature complete, but it may do some things faster than the other interfaces, and also offers an optional coupling with MessagePack binary (de)serialization via RcppMsgPack. The package has carried production loads for several years now.

This release adds a few functions for the hash data structure thanks to Whit. I also relented and now embed the small hiredis C library as I got tired of seeing builds fail on macOS where the CRAN maintainer was either unwilling or unable to install an external hiredis library. Some packaging details were also brushed up. Fuller details below.

Changes in version 0.1.9 (2018-10-27)
  • The configure setup is more robust with respect to the C++ setup [CRAN request].

  • The Travis builds was updated to R 3.5 along with all others (#34).

  • A new Redis function hexists was added (Whit Armstrong in #35).

  • The hiredis library source is now included, and built on all systems unwilling or unable to provide it (#36).

  • Added hash functions HDEL, HLEN, HKEYS, and HGETALL (Whit Armstrong in #38).

Courtesy of CRANberries, there is also a diffstat report for this release. More information is on the RcppRedis page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to perform merges (joins) on two or more data frames with base R, tidyverse and data.table

Sat, 10/27/2018 - 14:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

In this post in the R:case4base series we will look at one of the most common operations on multiple data frames – merge, also known as JOIN in SQL terms.

We will learn how to do the 4 basic types of join – inner, left, right and full join with base R and show how to perform the same with tidyverse’s dplyr and data.table’s methods. A quick benchmark will also be included.

Joins

Contents
  1. Merging (joining) two data frames with base R
  2. The arguments of merge
  3. Merging multiple data frames
  4. Alternatives to base R
  5. Quick benchmarking
  6. TL;DR – Just want the code
  7. References
Merging (joining) two data frames with base R

To showcase the merging, we will use a very slightly modified dataset provided by Hadley Wickham’s nycflights13 package, mainly the flights and weather data frames. Let’s get right into it and simply show how to perform the different types of joins with base R.

First, we prepare the data and store the columns we will merge by (join on) into mergeCols:

dataurl <- "https://jozefhajnala.gitlab.io/r/post/data/" weather <- readRDS(url(paste0(dataurl, "r006/weather.rds"))) flights <- readRDS(url(paste0(dataurl, "r006/flights.rds"))) mergeCols <- c("time_hour", "origin") head(flights) ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 ## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 ## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 ## 5 -25 DL 461 N668DN LGA ATL 116 762 6 ## 6 12 UA 1696 N39463 EWR ORD 150 719 5 ## minute time_hour ## 1 15 2013-01-01 05:00:00 ## 2 29 2013-01-01 05:00:00 ## 3 40 2013-01-01 05:00:00 ## 4 45 2013-01-01 05:00:00 ## 5 0 2013-01-01 06:00:00 ## 6 58 2013-01-01 05:00:00 head(weather) ## origin year month day hour temp dewp humid wind_dir wind_speed ## 1 EWR 2013 1 1 1 39.02 26.06 59.37 270 10.35702 ## 2 EWR 2013 1 1 2 39.02 26.96 61.63 250 8.05546 ## 3 EWR 2013 1 1 3 39.02 28.04 64.43 240 11.50780 ## 4 EWR 2013 1 1 4 39.92 28.04 62.21 250 12.65858 ## 5 EWR 2013 1 1 5 39.02 28.04 64.43 260 12.65858 ## 6 EWR 2013 1 1 6 37.94 28.04 67.21 240 11.50780 ## wind_gust precip pressure visib time_hour ## 1 NA 0 1012.0 10 2013-01-01 01:00:00 ## 2 NA 0 1012.3 10 2013-01-01 02:00:00 ## 3 NA 0 1012.5 10 2013-01-01 03:00:00 ## 4 NA 0 1012.2 10 2013-01-01 04:00:00 ## 5 NA 0 1011.9 10 2013-01-01 05:00:00 ## 6 NA 0 1012.4 10 2013-01-01 06:00:00

Now, we show how to perform the 4 merges (joins):

Inner join inner <- merge(flights, weather, by = mergeCols) Left (outer) join left <- merge(flights, weather, by = mergeCols, all.x = TRUE) Right (outer) join right <- merge(flights, weather, by = mergeCols, all.y = TRUE) Full (outer) join full <- merge(flights, weather, by = mergeCols, all = TRUE) Other join types # Cross Join (Cartesian product) cross <- merge(flights, weather, by = NULL) # Natural Join natural <- merge(flights, weather) The arguments of merge

The key arguments of base merge data.frame method are:

  • x, y – the 2 data frames to be merged
  • by – names of the columns to merge on. If the column names are different in the two data frames to merge, we can specify by.x and by.y with the names of the columns in the respective data frames. The by argument can also be specified by number, logical vector or left unspecified, in which case it defaults to the intersection of the names of the two data frames. From best practice perspective it is advisable to always specify the argument explicitly, ideally by column names.
  • all, all.x, all.y – default to FALSE and can be used specify the type of join we want to perform:
    • all = FALSE (the default) – gives an inner join – combines the rows in the two data frames that match on the by columns
    • all.x = TRUE – gives a left (outer) join – adds rows that are present in x, even though they do not have a matching row in y to the result for all = FALSE
    • all.y = TRUE – gives a right (outer) join – adds rows that are present in y, even though they do not have a matching row in x to the result for all = FALSE
    • all = TRUE – gives a full (outer) join. This is a shorthand for all.x = TRUE and all.y = TRUE

Other arguments include

  • sort – if TRUE (default), results are sorted on the by columns
  • suffixes – length 2 character vector, specifying the suffixes to be used for making the names of columns in the result which are not used for merging unique
  • incomparables – for single-column merging only, a vector of values that cannot be matched. Any value in x matching a value in this vector is assigned the nomatch value (which can be passed using ...)
Merging multiple data frames

For this example, let us have a list of all the data frames included in the nycflights13 package, slightly updated such that they can me merged with the default value for by, purely for this exercise, and store them into a list called flightsList:

flightsList <- readRDS(url(paste0(dataurl, "r006/nycflights13-list.rds"))) lapply(flightsList, function(x) c(toString(dim(x)), toString(names(x)))) ## $flights ## [1] "336776, 19" ## [2] "year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour" ## ## $weather ## [1] "26115, 15" ## [2] "origin, year, month, day, hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib, time_hour" ## ## $airlines ## [1] "16, 2" "carrier, name" ## ## $airports ## [1] "1458, 8" ## [2] "origin, airportname, lat, lon, alt, tz, dst, tzone" ## ## $planes ## [1] "3322, 9" ## [2] "tailnum, yearmanufactured, type, manufacturer, model, engines, seats, speed, engine"

Since merge is designed to work with 2 data frames, merging multiple data frames can of course be achieved by nesting the calls to merge:

multiFull <- merge(merge(merge(merge( flightsList[[1L]], flightsList[[2L]], all = TRUE), flightsList[[3L]], all = TRUE), flightsList[[4L]], all = TRUE), flightsList[[5L]], all = TRUE)

We can however achieve this same goal much more elegantly, taking advantage of base R’s Reduce function:

# For Inner Join multi_inner <- Reduce( function(x, y, ...) merge(x, y, ...), flightsList ) # For Full (Outer) Join multi_full <- Reduce( function(x, y, ...) merge(x, y, all = TRUE, ...), flightsList )

Note that this example is oversimplified and the data was updated such that the default values for by give meaningful joins. For example, in the original planes data frame the column year would have been matched onto the year column of the flights data frame, which is nonsensical as the years have different meanings in the two data frames. This is why we renamed the year column in the planes data frame to yearmanufactured for the above example.

Alternatives to base R Using the tidyverse

The dplyr package comes with a set of very user-friendly functions that seem quite self-explanatory:

library(dplyr) inner_dplyr <- inner_join(flights, weather, by = mergeCols) left_dplyr <- left_join(flights, weather, by = mergeCols) right_dplyr <- right_join(flights, weather, by = mergeCols) full_dplyr <- full_join(flights, weather, by = mergeCols)

We can also use the “forward pipe” operator %>% that becomes very convenient when merging multiple data frames:

inner_dplyr <- flights %>% inner_join(weather, by = mergeCols) left_dplyr <- flights %>% left_join(weather, by = mergeCols) right_dplyr <- flights %>% right_join(weather, by = mergeCols) full_dplyr <- flights %>% full_join(weather, by = mergeCols) Using data.table

The data.table package provides an S3 method for the merge generic that has a very similar structure to the base method for data frames, meaning its use is very convenient for those familiar with that method. In fact the code is exactly the same as the base one for our example use.

One important difference worth noting is that the by argument is by default constructed differently with data.table.

We however provide it explicitly, therefore this difference does not directly affect our example:

library(data.table) weather <- as.data.table(weather) flights <- as.data.table(flights) setkeyv(weather, mergeCols) setkeyv(flights, mergeCols) # Note that this is identical to the code for base # The data.table method is called automatically for objects of class data.table inner_dt <- merge(flights, weather, by = mergeCols) left_dt <- merge(flights, weather, by = mergeCols, all.x = TRUE) right_dt <- merge(flights, weather, by = mergeCols, all.y = TRUE) full_dt <- merge(flights, weather, by = mergeCols, all = TRUE)

Alternatively, we can write data.table joins as subsets:

inner_dt <- flights[weather, on = mergeCols, nomatch = 0] left_dt <- weather[flights, on = mergeCols] right_dt <- flights[weather, on = mergeCols] Quick benchmarking

For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned packages:



Inner join bench_inner <- microbenchmark::microbenchmark(times = 100, base = base::merge.data.frame(flights, weather, by = mergeCols), base_nosort = base::merge.data.frame(flights, weather, by = mergeCols, sort = FALSE), dt_merge = merge(flights, weather, by = mergeCols), dt_subset = flights[weather, on = mergeCols, nomatch = 0], dplyr = inner_join(flights, weather, by = mergeCols), dplyr_pipe = flights %>% inner_join(weather, by = mergeCols) )

$(function () { $('#r006-01-bench-inner-boxplot').highcharts({ title: { text: "microbenchmark" }, yAxis: { title: { text: "time (milliseconds)" }, min: 0 }, credits: { enabled: false }, exporting: { enabled: false }, plotOptions: { series: { turboThreshold: 0, marker: { symbol: "circle" }, showInLegend: false }, treemap: { layoutAlgorithm: "squarified" }, bubble: { minSize: 5, maxSize: 25 }, boxplot: { fillColor: "#C9E4FF", lineWidth: 1, medianWidth: 2, stemDashStyle: "dot", stemWidth: 1, whiskerLength: "40%", whiskerWidth: 1.5 } }, annotationsOptions: { enabledButtons: false }, tooltip: { delayForDisplay: 10 }, chart: { type: "column" }, xAxis: { type: "category", categories: "" }, series: [ { g2: null, data: [ { name: "base", low: 3147, q1: 3455, median: 3639, q3: 3810.5, high: 4274 }, { name: "base_nosort", low: 1965, q1: 2236.5, median: 2413.5, q3: 2570.5, high: 2981 }, { name: "dt_merge", low: 66, q1: 70, median: 93.5, q3: 191, high: 368 }, { name: "dt_subset", low: 54, q1: 57.5, median: 71, q3: 171.5, high: 340 }, { name: "dplyr", low: 84, q1: 90, median: 102.5, q3: 175, high: 297 }, { name: "dplyr_pipe", low: 82, q1: 90, median: 105, q3: 183, high: 317 } ], type: "boxplot", id: null, color: "blue", name: "microbenchmark" } ] } ); });

Full (outer) join bench_outer <- microbenchmark::microbenchmark(times = 100, base = base::merge.data.frame(flights, weather, by = mergeCols, all = TRUE), base_nosort = base::merge.data.frame(flights, weather, by = mergeCols, all = TRUE, sort = FALSE), dt_merge = merge(flights, weather, by = mergeCols, all = TRUE), dplyr = full_join(flights, weather, by = mergeCols), dplyr_pipe = flights %>% full_join(weather, by = mergeCols) )

$(function () { $('#r006-02-bench-full-boxplot').highcharts({ title: { text: "microbenchmark" }, yAxis: { title: { text: "time (milliseconds)" }, min: 0 }, credits: { enabled: false }, exporting: { enabled: false }, plotOptions: { series: { turboThreshold: 0, marker: { symbol: "circle" }, showInLegend: false }, treemap: { layoutAlgorithm: "squarified" }, bubble: { minSize: 5, maxSize: 25 }, boxplot: { fillColor: "#C9E4FF", lineWidth: 1, medianWidth: 2, stemDashStyle: "dot", stemWidth: 1, whiskerLength: "40%", whiskerWidth: 1.5 } }, annotationsOptions: { enabledButtons: false }, tooltip: { delayForDisplay: 10 }, chart: { type: "column" }, xAxis: { type: "category", categories: "" }, series: [ { g2: null, data: [ { name: "base", low: 3068, q1: 3405, median: 3543, q3: 3707.5, high: 4029 }, { name: "base_nosort", low: 2489, q1: 2913, median: 3060, q3: 3207, high: 3593 }, { name: "dt_merge", low: 119, q1: 155.5, median: 277, q3: 382.5, high: 577 }, { name: "dplyr", low: 142, q1: 150.5, median: 164, q3: 228, high: 342 }, { name: "dplyr_pipe", low: 143, q1: 154, median: 177, q3: 297, high: 484 } ], type: "boxplot", id: null, color: "blue", name: "microbenchmark" } ] } ); });

Visualizing the results in this case shows base R comes way behind the two alternatives, even with sort = FALSE.

Note: The benchmarks are ran on a standard droplet by DigitalOcean, with 2GB of memory a 2vCPUs.

TL;DR – Just want the code

No time for reading? Click here to get just the code with commentary

References


Exactly 100 years ago tomorrow, October 28th, 1918 the independence of Czechoslovakia was proclaimed by the Czechoslovak National Council, resulting in the creation of the first democratic state of Czechs and Slovaks in history.

Did you find the article helpful or interesting? Help others find it by sharing var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing The Catholic Lectionary – Part 1

Sat, 10/27/2018 - 06:00

(This article was first published on Stefan Avey - r, and kindly contributed to R-bloggers)

What’s a Lectionary?

A lectionary, according to Wikipedia, is a listing of scripture readings for Christian or Judaic worship on a given day. The Roman Catholic Lectionary will contain a list of readings for a specific day that are on a 3-year cycle. Here is an example:

Thirtieth Sunday in Ordinary Time
Lectionary: 149
Reading 1: Jer 31:7-9
Psalm: Ps 126:1-2, 2-3, 4-5, 6
Reading 2: Heb 5:1-6
Gospel: Mk 10:46-52

As a Catholic, I’ve often wondered how much of the Bible is included in the lectionary. Fortunately, Felix Just, S.J., Ph.D. compiled a website with statistics on many aspects of the lectionary which help to answer this question.

His analysis shows that a Catholic who attends Mass on Sundays and Feasts (but not weekdays) would hear ~ 4% of the Old Testament (excluding Psalms) and ~ 41% of the New Testament. Wow – this was a shock to me! I bet if you ask around you’ll hear that many Catholics think the lectionary includes the whole Bible. To be fair, these numbers do get a bit higher if you include the complete lectionary (with weekdays). The total coverage of the Old Testament is ~14% (again excluding Psalms) and for the New Testament it is ~72%.

Since I’m a Data Scientist, I naturally wanted to dig into these numbers and ask more questions. The summary tables nicely show which books are covered and which are not. But what verses are actually covered? Is there something unique about those verses vs. those in the Lectionary (e.g. different sorts of themes)? In the future I plan to write another post on understanding the content via Natural Language Processing. For this post, I’ll focus on creating a single visualization of the lectionary coverage for Sundays and Major Feasts to help me understand what parts are covered at a glance.

Ok Cool, But Isn’t the Lectionary a Book?

Yes, the lectionary is traditionally a book. But it’s the 21st century and luckily someone else has transcribed this into an electronic listing so I didn’t have to do it. The data necessary includes the Sunday Lectionary and a pair of reference guides that tell us how many chapters/verses are in each book of the Bible (Old Testament Reference, New Testament Reference). I could have played with web scraping but there were only a few tables so I just copied the tables in Felix’s website into Excel files (link to data files).

Code ############## ## Packages ## ############## library(readxl) library(scales) library(knitr) library(tidyverse) library(aveytoolkit) ############### ## Variables ## ############### data_dir <- "plotBible" OTRefFile <- file.path(data_dir, "OldTestamentReference.xlsx") NTRefFile <- file.path(data_dir, "NewTestamentReference.xlsx") abbrevFile <- file.path(data_dir, "BibleAbbreviations.xlsx") LectSundaysFile <- file.path(data_dir, "Lectionary-SundaysAndFeasts.xlsx") ############### ## Functions ## ############### source(file.path(data_dir, "helperFunctions.R")) Code ######################## ## Read in references ## ######################## testaments <- c("OT", "NT") abbrev <- map(testaments, ~read_excel(abbrevFile, sheet = .x)) names(abbrev) <- testaments ## Old Testament OTsections <- c("Torah", "Historical", "Wisdom", "MajorProphets", "MinorProphets") OTref <- map_df(OTsections, function(OTsection) { read_excel(OTRefFile, sheet = OTsection, na = ".") %>% mutate(Section = OTsection) }) %>% rename(Book = `Book Name`) %>% gather(matches("[0-9]+"), key = "Chapter", value = "Verses") %>% filter(!is.na(Verses)) %>% select(Section, Book, Chapter, Verses) %>% mutate(Book = factor(Book, # Order books in factor levels levels = abbrev$OT$Name), Abbrv = abbrev$OT$Abbreviation[match(Book, abbrev$OT$Name)], Chapter = as.numeric(Chapter)) %>% arrange(Book) %>% mutate(Abbrv = factor(Abbrv, levels = unique(Abbrv))) ## Munge so that reference contains 1 row for every verse in the order they ## appear in the Old Testament with a `Pos` column to denote the position of the ## verse. OTref2Pos <- OTref %>% arrange(Book, Chapter) %>% mutate(Verse = map(Verses, function(x) 1:x)) %>% unnest() %>% mutate(Chapter_Verse = paste(Chapter, Verse, sep = ':')) %>% mutate(Pos = 1:n()) %>% mutate(Testament = "Old") ## New Testament NTsections <- c("Gospels", "NT") NTref <- map_df(NTsections, function(NTsection) { read_excel(NTRefFile, sheet = NTsection, na = ".") %>% mutate(Section = NTsection) %>% mutate_at(vars(matches("[0-9]+")), funs(as.integer(str_replace(., fixed("*"), "")))) }) %>% rename(Abbrv = `Book  Name`) %>% gather(matches("[0-9]+"), key = "Chapter", value = "Verses") %>% filter(!is.na(Verses)) %>% select(Section, Abbrv, Chapter, Verses) %>% mutate(Abbrv = factor(Abbrv, levels = abbrev$NT$Abbreviation), Book = abbrev$NT$Name[match(Abbrv, abbrev$NT$Abbreviation)], Chapter = as.numeric(Chapter)) %>% arrange(Abbrv) %>% mutate(Book = factor(Book, levels = unique(Book))) ## Munge so that reference contains 1 row for every verse in the order they ## appear in the New Testament with a `Pos` column to denote the position of the ## verse. NTref2Pos <- NTref %>% arrange(Book, Chapter) %>% mutate(Verse = map(Verses, function(x) 1:x)) %>% unnest() %>% mutate(Chapter_Verse = paste(Chapter, Verse, sep = ':')) %>% mutate(Pos = 1:n()) %>% mutate(Testament = "New") ref2Pos <- bind_rows(OTref2Pos, NTref2Pos) %>% mutate(Pos = 1:n()) Reference Data

There is a lot of data transformation necessary but in the end, we have a data frame with 35,524 rows which contains 1 row for every verse in the Catholic Bible. This will be used as the reference to compare the lectionary to. Below I show part of the table containing the first and last 6 verses.

Code ref2Pos %>% select(Testament, Section, Book, Chapter, Verse, Pos) %>% head() %>% knitr::kable() Testament Section Book Chapter Verse Pos Old Torah Genesis 1 1 1 Old Torah Genesis 1 2 2 Old Torah Genesis 1 3 3 Old Torah Genesis 1 4 4 Old Torah Genesis 1 5 5 Old Torah Genesis 1 6 6 Code ref2Pos %>% select(Testament, Section, Book, Chapter, Verse, Pos) %>% tail() %>% knitr::kable() Testament Section Book Chapter Verse Pos New NT Revelation 22 16 35519 New NT Revelation 22 17 35520 New NT Revelation 22 18 35521 New NT Revelation 22 19 35522 New NT Revelation 22 20 35523 New NT Revelation 22 21 35524 Lectionary Data Code ################################################### ## Read in the Lectionary for Sundays and Feasts ## ################################################### readings <- c("OT", "Psalm", "NT", "Gospel") LectSundays <- map(readings, ~read_excel(LectSundaysFile, sheet = .x)) names(LectSundays) <- readings SunLect <- bind_rows(LectSundays) %>% tbl_df() %>% separate(LectNum_Year, c("LectNum", "Year"), sep = '-') %>% mutate(YearA = grepl("A", Year), YearB = grepl("B", Year), YearC = grepl("C", Year))

Below are some entries in the lectionary including the Reading as well as the Year in the 3-year cycle (ABC means read in years A, B, and C). For this analysis, I’ll ignore the Year and just consider the coverage over the whole 3-year cycle.

Code SunLect %>% select(Reading, LectNum, Year, Day) %>% head() %>% kable() Reading LectNum Year Day Gen 1:1-2:2 or 1:1, 26-31a 41 ABC Easter Vigil (1st Reading) Gen 2:7-9; 3:1-7 22 A 1st Sunday of Lent Gen 2:18-24 140 B 27th Sunday in Ordinary Time Gen 3:9-15 89 B 10th Sunday in Ordinary Time Gen 9:8-15 23 B 1st Sunday of Lent Gen 11:1-9 62 ABC Pentecost Sunday: Vigil Mass (Option 1) Parsing Verses

The crucial part is matching the Reading to the reference. This seems straightforward but can be very tricky in some cases. I wrote a basic parser (see code) to handle the most common cases. It ignores some aspects such as the “or” option as in the first row of the table above. Usually an “or” option gives a longer form followed by a shorter form (subset of the first option) so I’ll only consider the longer forms.

Each set of continuous verses gets stored in a new data frame by the parser. For example, Gen 1:1-2:2 or 1:1, 26-31a will get parsed to:

Code ParseFull(SunLect$Reading[1]) ## start end Abbrv ## 1 1:1 2:2 Gen

and Gen 2:7-9; 3:1-7 will get parsed to:

Code ParseFull(SunLect$Reading[2]) ## start end Abbrv ## 1 2:7 2:9 Gen ## 2 3:1 3:7 Gen

Then, once each reading from the lectionary is parsed, it gets matched back to the reference.

Code ## Apply parser on the full Sunday Lectionary res <- SunLect %>% mutate(Pos = map(Reading, function(x) { ## cat(x, sep = '\n') # debugging ParseFull(x) %>% left_join(ref2Pos, by = c(start = "Chapter_Verse", Abbrv = "Abbrv")) %>% left_join(ref2Pos, by = c(end = "Chapter_Verse", Abbrv = "Abbrv")) %>% rowwise() %>% do(data.frame(Pos = if (any(is.na(.))) { NA } else { .$Pos.x:.$Pos.y } )) %>% pull(Pos)})) ## Combine the results with the reference comb_dat <- res %>% unnest(Pos) %>% left_join(ref2Pos, by = "Pos")

So each reading is parsed into multiple segments and then mapped onto the position variable (Pos) from the Reference. The result for one reading looks like this:

Code comb_dat %>% select(Reading, Section, Book, Chapter, Verse, Pos) %>% filter(Reading == "Isa 63:16b-17, 19b; 64:2-7") %>% kable() Reading Section Book Chapter Verse Pos Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 63 16 22923 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 63 17 22924 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 63 19 22926 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 64 2 22928 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 64 3 22929 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 64 4 22930 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 64 5 22931 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 64 6 22932 Isa 63:16b-17, 19b; 64:2-7 MajorProphets Isaiah 64 7 22933 Visualize the Lectionary Coverage

The inspiration behind this visualization is picturing the whole Bible as a bookshelf. There is one row for each major section (e.g., Torah, Minor Prophets, Gospels) and each section is made up of multiple books. In the visualization, lines are used to indicate coverage of a particular verse in the lectionary.

Code ## Get all section lengths sections <- ref2Pos %>% select(Section, Abbrv) %>% distinct() %>% group_by(Section) %>% mutate(Abbrv = list(Abbrv)) %>% distinct() %>% deframe() sectPos <- map_int(names(sections), function(sect) { abbrvs <- sections[[sect]] lastAbbrv <- abbrvs[length(abbrvs)] maxPos <- ref2Pos %>% filter(Abbrv == lastAbbrv) %>% pull(Pos) %>% max() return(maxPos) }) names(sectPos) <- names(sections) sectPos2 <- c(0, sectPos[-length(sectPos)]) names(sectPos2) <- names(sectPos) sectLength <- sectPos - sectPos2 ## Tweak Section Labels for Plot sectLabels <- names(sections) %>% paste0(., " (V=", comma(sectLength[.]), ")") %>% str_replace("Prophets", " Prophets") %>% str_replace("NT", "New Testament") %>% setNames(names(sections)) ## Generate plotting data dat <- comb_dat %>% filter(!is.na(Pos)) %>% select(Section, Abbrv, Chapter, Verse, Pos) %>% mutate(Pos = (Pos - sectPos2[Section]) / sectLength[Section]) %>% mutate(SectionLabel = factor(sectLabels[Section], levels = sectLabels)) %>% mutate(Label = paste0(Abbrv, " ", Chapter, ":", Verse)) %>% distinct() # ignore how many times something appeared (read 2 or 3 times) book_dat <- ref2Pos %>% mutate(Abbrv = factor(Abbrv, levels = unique(Abbrv))) %>% mutate(Pos = (Pos - sectPos2[Section]) / sectLength[Section]) %>% mutate(SectionLabel = factor(sectLabels[Section], levels = sectLabels)) %>% group_by(SectionLabel, Abbrv) %>% summarize(Pos = min(Pos)) %>% ungroup() %>% group_by(SectionLabel) %>% mutate(y = rep(c(0, 0.33, 0.66, 1), length.out = n())) %>% ungroup() ## Create the plot gg <- ggplot(dat, x = 0, y = 1) + geom_vline(aes(xintercept = Pos, color = Abbrv), show.legend = FALSE) + geom_label(data = book_dat, aes(x = Pos, y = y, label = Abbrv), hjust = 0, size = 3.5) + scale_x_continuous(labels = percent) + scale_y_continuous(labels = NULL, breaks = NULL) + xlab("") + ylab("") + ggtitle(paste("Bible Coverage of Catholic Lectionary for", "Sundays and Major Feasts", sep = '\n')) + labs(caption = "Three Year Cycle") + facet_grid(SectionLabel ~ ., scales = "fixed") + getBaseTheme() + theme(strip.text = element_text(size = 10, face = "bold")) plot(gg)

Code GetCoverage <- function(abbrv) { tot <- ref2Pos %>% filter(Abbrv == abbrv) %>% select(Chapter_Verse) %>% distinct() %>% nrow() cov <- dat %>% filter(Abbrv == abbrv) %>% select(Pos) %>% distinct() %>% nrow() return(cov/tot) }

The amount of white space clearly shows what parts of the Bible are not contained in the lectionary. For example, in the New Testament section (bottom row of the “bookshelf”), it is apparent that little of the book of Acts (coverage = 17%) or Revelation (coverage = 10%) are in the Sunday lectionary.

While most of the Old Testament is sparsely covered, the Psalms (abbreviated Ps) are one of the most covered books at 21% coverage. One of the drawbacks of this visualization is that each row contains a different number of verses so that the visual sense of “fullness” can be deceiving.

Acknowledgments

Huge kudos to Felix Just, S.J., Ph.D. for compiling these resources:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stefan Avey - r. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Maps with pie charts on top of each administrative division: an example with Luxembourg’s elections data

Sat, 10/27/2018 - 02:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


Abstract

You can find the data used in this blog post here: https://github.com/b-rodrigues/elections_lux

This is a follow up to a previous blog post
where I extracted data of the 2018 Luxembourguish elections from Excel Workbooks.
Now that I have the data, I will create a map of Luxembourg by commune, with pie charts of the
results on top of each commune! To do this, I use good ol’ {ggplot2} and another packages
called {scatterpie}. As a bonus, I have added the code to extract the data from the 2013
elections from Excel. You’ll find this code in the appendix at the end of the blog post.

Introduction

Before importing the data for the elections of 2018, let’s install some packages:

install.packages('rgeos', type='source') # Dependency of rgdal install.packages('rgdal', type='source') # To read in the shapefile

These packages might be very tricky to install on OSX and Linux, but they’re needed to import the
shapefile of the country, which is needed to draw a map. So to make things easier, I have
created an rds object, from the shapefile of Luxembourg, that you can import natively in R without
needing these two packages. But if you want to use them, here is how:

communes <- readOGR("Limadmin_SHP/LIMADM_COMMUNES.shp")

By the way, you can download the shapefile for Luxembourg here.

I’ll use my shapefile though (that you can download from the same github repo as the data):

communes_df <- readRDS("commune_shapefile.rds")

Here’s how it looks like:

head(communes_df) ## long lat order hole piece group id ## 1 91057.65 101536.6 1 FALSE 1 Beaufort.1 Beaufort ## 2 91051.79 101487.3 2 FALSE 1 Beaufort.1 Beaufort ## 3 91043.43 101461.7 3 FALSE 1 Beaufort.1 Beaufort ## 4 91043.37 101449.8 4 FALSE 1 Beaufort.1 Beaufort ## 5 91040.42 101432.1 5 FALSE 1 Beaufort.1 Beaufort ## 6 91035.44 101405.6 6 FALSE 1 Beaufort.1 Beaufort

Now let’s load some packages:

library("tidyverse") library("tidyxl") library("ggplot2") library("scatterpie")

Ok, now, let’s import the elections results data, which is the output of
last week’s blog post:

elections <- read_csv("elections_2018.csv") ## Parsed with column specification: ## cols( ## Party = col_character(), ## Year = col_integer(), ## Variables = col_character(), ## Values = col_double(), ## locality = col_character(), ## division = col_character() ## )

I will only focus on the data at the commune level, and only use the share of votes for each party:

elections_map <- elections %>% filter(division == "Commune", Variables == "Pourcentage")

Now I need to make sure that the names of the communes are the same between the elections data
and the shapefile. Usual suspects are the “Haute-Sûre” and the “Redange-sur-Attert” communes,
but let’s take a look:

locality_elections <- unique(elections_map$locality) locality_shapefile <- unique(communes_df$id) setdiff(locality_elections, locality_shapefile) ## [1] "Lac de la Haute-Sûre" "Redange Attert"

Yep, exactly as expected. I’ve had problems with the names of these two communes in the past already.
Let’s rename these two communes in the elections data:

elections_map <- elections_map %>% mutate(commune = case_when(locality == "Lac de la Haute-Sûre" ~ "Lac de la Haute Sûre", locality == "Redange Attert" ~ "Redange", TRUE ~ locality))

Now, I can select the relevant columns from the shapefile:

communes_df <- communes_df %>% select(long, lat, commune = id)

and from the elections data:

elections_map <- elections_map %>% select(commune, Party, Variables, Values) Plotting the data on a map

Now, for the type of plot I want to make, using the {scatterpie} package, I need the data to be
in the wide format, not long. For this I will use tidyr::spread():

elections_map <- elections_map %>% spread(Party, Values)

This is how the data looks now:

glimpse(elections_map) ## Observations: 102 ## Variables: 10 ## $ commune "Beaufort", "Bech", "Beckerich", "Berdorf", "Bertr... ## $ Variables "Pourcentage", "Pourcentage", "Pourcentage", "Pour... ## $ ADR 0.12835106, 0.09848661, 0.08596748, 0.16339234, 0.... ## $ CSV 0.2426239, 0.2945285, 0.3004751, 0.2604552, 0.2902... ## $ `déi gréng` 0.15695672, 0.21699651, 0.24072721, 0.15619529, 0.... ## $ `déi Lénk` 0.04043732, 0.03934808, 0.05435776, 0.02295273, 0.... ## $ DP 0.15875393, 0.19394645, 0.12899689, 0.15444466, 0.... ## $ KPL 0.015875393, 0.006519208, 0.004385164, 0.011476366... ## $ LSAP 0.11771754, 0.11455180, 0.08852549, 0.16592103, 0.... ## $ PIRATEN 0.13928411, 0.03562282, 0.09656496, 0.06516242, 0....

For this to work, I need two datasets; one to draw the map (commune_df) and one to draw the
pie charts over each commune, with the data to draw the charts, but also the position of where I
want the pie charts. For this, I will compute the average of the longitude and latitude, which
should be good enough:

scatterpie_data <- communes_df %>% group_by(commune) %>% summarise(long = mean(long), lat = mean(lat))

Now, let’s join the two datasets:

final_data <- left_join(scatterpie_data, elections_map, by = "commune")

I have all the ingredients to finally plot the data:

ggplot() + geom_polygon(data = communes_df, aes(x = long, y = lat, group = commune), colour = "grey", fill = NA) + geom_scatterpie(data = final_data, aes(x=long, y=lat, group=commune), cols = c("ADR", "CSV", "déi gréng", "déi Lénk", "DP", "KPL", "LSAP", "PIRATEN")) + labs(title = "Share of total vote in each commune, 2018 elections") + theme_void() + theme(legend.position = "bottom", legend.title = element_blank(), legend.text = element_text(colour = "white"), plot.background = element_rect("#272b30"), plot.title = element_text(colour = "white")) + scale_fill_manual(values = c("ADR" = "#009dd1", "CSV" = "#ee7d00", "déi gréng" = "#45902c", "déi Lénk" = "#e94067", "DP" = "#002a54", "KPL" = "#ff0000", "LSAP" = "#ad3648", "PIRATEN" = "#ad5ea9"))

Not too bad, but we can’t really read anything from the pie charts. I will now make their size
proportional to the number of voters in each commune. For this, I need to go back to the Excel
sheets, and look for the right cell:

It will be easy to extract this info. It located in cell “E5”:

elections_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx") electors_commune <- elections_raw_2018 %>% filter(!(sheet %in% c("Le Grand-Duché de Luxembourg", "Centre", "Est", "Nord", "Sud", "Sommaire"))) %>% filter(address == "E5") %>% select(sheet, numeric) %>% rename(commune = sheet, electors = numeric)

I can now add this to the data:

final_data <- final_data %>% full_join(electors_commune) %>% mutate(log_electors = log(electors) * 200) ## Joining, by = "commune"

In the last line, I create a new column called log_electors that I then multiply by 200. This
will be useful later.

Now I can add the r argument inside the aes() function on the third line, to make the pie chart
size proportional to the number of electors in that commune:

ggplot() + geom_polygon(data = communes_df, aes(x = long, y = lat, group = commune), colour = "grey", fill = NA) + geom_scatterpie(data = final_data, aes(x=long, y=lat, group = commune, r = electors), cols = c("ADR", "CSV", "déi gréng", "déi Lénk", "DP", "KPL", "LSAP", "PIRATEN")) + labs(title = "Share of total vote in each commune, 2018 elections") + theme_void() + theme(legend.position = "bottom", legend.title = element_blank(), legend.text = element_text(colour = "white"), plot.background = element_rect("#272b30"), plot.title = element_text(colour = "white")) + scale_fill_manual(values = c("ADR" = "#009dd1", "CSV" = "#ee7d00", "déi gréng" = "#45902c", "déi Lénk" = "#182024", "DP" = "#002a54", "KPL" = "#ff0000", "LSAP" = "#ad3648", "PIRATEN" = "#ad5ea9")) ## Warning: Removed 32 rows containing non-finite values (stat_pie).

Ok, that was not a good idea! Perhaps the best option would be to have one map per circonscription.
For this, I need the list of communes by circonscription. This is available on Wikipedia. Here are
the lists:

centre <- c("Bissen", "Colmar-Berg", "Fischbach", "Heffingen", "Larochette", "Lintgen", "Lorentzweiler", "Mersch", "Nommern", "Helperknapp", "Bertrange", "Contern", "Hesperange", "Luxembourg", "Niederanven", "Sandweiler", "Schuttrange", "Steinsel", "Strassen", "Walferdange", "Weiler-la-Tour") est <- c("Beaufort", "Bech", "Berdorf", "Consdorf", "Echternach", "Rosport-Mompach", "Waldbillig", "Betzdorf", "Biwer", "Flaxweiler", "Grevenmacher", "Junglinster", "Manternach", "Mertert", "Wormeldange","Bous", "Dalheim", "Lenningen", "Mondorf-les-Bains", "Remich", "Schengen", "Stadtbredimus", "Waldbredimus") nord <- c("Clervaux", "Parc Hosingen", "Troisvierges", "Weiswampach", "Wincrange", "Bettendorf", "Bourscheid", "Diekirch", "Erpeldange-sur-Sûre", "Ettelbruck", "Feulen", "Mertzig", "Reisdorf", "Schieren", "Vallée de l'Ernz", "Beckerich", "Ell", "Grosbous", "Préizerdaul", "Rambrouch", "Redange", "Saeul", "Useldange", "Vichten", "Wahl", "Putscheid", "Tandel", "Vianden", "Boulaide", "Esch-sur-Sûre", "Goesdorf", "Kiischpelt", "Lac de la Haute Sûre", "Wiltz", "Winseler") sud <- c("Dippach", "Garnich", "Käerjeng", "Kehlen", "Koerich", "Kopstal", "Mamer", "Habscht", "Steinfort", "Bettembourg", "Differdange", "Dudelange", "Esch-sur-Alzette", "Frisange", "Kayl", "Leudelange", "Mondercange", "Pétange", "Reckange-sur-Mess", "Roeser", "Rumelange", "Sanem", "Schifflange") circonscriptions <- list("centre" = centre, "est" = est, "nord" = nord, "sud" = sud)

Now, I can make one map per circonscription. First, let’s split the data sets by circonscription:

communes_df_by_circonscription <- circonscriptions %>% map(~filter(communes_df, commune %in% .)) final_data_by_circonscription <- circonscriptions %>% map(~filter(final_data, commune %in% .))

By using pmap(), I can reuse the code to generate the plot to each element of the two lists.
This is nice because I do not need to copy and paste the code 4 times:

pmap(list(x = communes_df_by_circonscription, y = final_data_by_circonscription, z = names(communes_df_by_circonscription)), function(x, y, z){ ggplot() + geom_polygon(data = x, aes(x = long, y = lat, group = commune), colour = "grey", fill = NA) + geom_scatterpie(data = y, aes(x=long, y=lat, group = commune), cols = c("ADR", "CSV", "déi gréng", "déi Lénk", "DP", "KPL", "LSAP", "PIRATEN")) + labs(title = paste0("Share of total vote in each commune, 2018 elections for circonscription ", z)) + theme_void() + theme(legend.position = "bottom", legend.title = element_blank(), legend.text = element_text(colour = "white"), plot.background = element_rect("#272b30"), plot.title = element_text(colour = "white")) + scale_fill_manual(values = c("ADR" = "#009dd1", "CSV" = "#ee7d00", "déi gréng" = "#45902c", "déi Lénk" = "#182024", "DP" = "#002a54", "KPL" = "#ff0000", "LSAP" = "#ad3648", "PIRATEN" = "#ad5ea9")) } ) ## $centre

## ## $est

## ## $nord

## ## $sud

I created an anonymous function of three argument, x, y and z. If you are unfamiliar with
pmap(), study the above code closely. If you have questions, do not hesitate to reach out!

The pie charts are still quite small, but if I try to change the size of the pie charts,
I’ll have the same problem as before: inside the same circonscription, some communes have really a
lot of electors, and some a very small number. Perhaps I can try with the log of the electors?

pmap(list(x = communes_df_by_circonscription, y = final_data_by_circonscription, z = names(communes_df_by_circonscription)), function(x, y, z){ ggplot() + geom_polygon(data = x, aes(x = long, y = lat, group = commune), colour = "grey", fill = NA) + geom_scatterpie(data = y, aes(x=long, y=lat, group = commune, r = log_electors), cols = c("ADR", "CSV", "déi gréng", "déi Lénk", "DP", "KPL", "LSAP", "PIRATEN")) + labs(title = paste0("Share of total vote in each commune, 2018 elections for circonscription ", z)) + theme_void() + theme(legend.position = "bottom", legend.title = element_blank(), legend.text = element_text(colour = "white"), plot.background = element_rect("#272b30"), plot.title = element_text(colour = "white")) + scale_fill_manual(values = c("ADR" = "#009dd1", "CSV" = "#ee7d00", "déi gréng" = "#45902c", "déi Lénk" = "#182024", "DP" = "#002a54", "KPL" = "#ff0000", "LSAP" = "#ad3648", "PIRATEN" = "#ad5ea9")) } ) ## $centre

## ## $est

## ## $nord ## Warning: Removed 16 rows containing non-finite values (stat_pie).

## ## $sud

This looks better now!

Conclusion

Having data in a machine readable format is really important. The amount of code I had to write
to go from the Excel Workbooks that contained the data to this plots is quite large, but if the
data was in a machine readable format to start with, I could have focused on the plots immediately.

The good thing is that I got to practice my skills and discovered {scatterpie}!

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

Appendix

The following lines of code extract the data (from the 2013 elections) from the Excel Workbooks
that can be found in Luxembourguish Open Data Portal.

I will not comment them, as they work in a similar way than in the previous blog post where I
extracted the data from the 2018 elections. The only difference, is that the sheet with the
national level data was totally different, so I did not extract it. The first reason is because
I don’t need it for this blog post, the second is because I was lazy. For me, that’s two pretty
good reasons not to do something. If you have a question concerning the code below, don’t
hesitate to reach out though!

library("tidyverse") library("tidyxl") library("brotools") path <- Sys.glob("content/blog/2013*xlsx")[-5] elections_raw_2013 <- map(path, xlsx_cells) %>% map(~filter(., sheet != "Sommaire")) elections_sheets_2013 <- map(map(path, xlsx_sheet_names), ~`%-l%`(., "Sommaire")) list_targets <- list("Centre" = seq(9, 32), "Est" = seq(9, 18), "Nord" = seq(9, 20), "Sud" = seq(9, 34)) position_parties_national <- seq(1, 24, by = 3) extract_party <- function(dataset, starting_col, target_rows){ almost_clean <- dataset %>% filter(row %in% target_rows) %>% filter(col %in% c(starting_col, starting_col + 1)) %>% select(character, numeric) %>% fill(numeric, .direction = "up") %>% filter(!is.na(character)) party_name <- almost_clean$character[1] almost_clean$character[1] <- "Pourcentage" almost_clean$party <- party_name colnames(almost_clean) <- c("Variables", "Values", "Party") almost_clean %>% mutate(Year = 2013) %>% select(Party, Year, Variables, Values) } # Treat one district extract_district <- function(dataset, sheets, target_rows, position_parties_national){ list_data_districts <- map(sheets, ~filter(.data = dataset, sheet == .)) elections_districts_2013 <- map(.x = list_data_districts, ~map_df(position_parties_national, extract_party, dataset = .x, target_rows = target_rows)) map2(.y = elections_districts_2013, .x = sheets, ~mutate(.y, locality = .x, division = "Commune", Year = "2013")) %>% bind_rows() } elections_2013 <- pmap_dfr(list(x = elections_raw_2013, y = elections_sheets_2013, z = list_targets), function(x, y, z){ map_dfr(position_parties_national, ~extract_district(dataset = x, sheets = y, target_rows = z, position_parties_national = .)) }) # Correct districts elections_2013 <- elections_2013 %>% mutate(division = case_when(locality == "CENTRE" ~ "Electoral district", locality == "EST" ~ "Electoral district", locality == "NORD" ~ "Electoral district", locality == "SUD" ~ "Electoral district", TRUE ~ division)) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining Week 8 (2018)

Fri, 10/26/2018 - 19:20

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 8 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining Week 8 (2018) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spotlight on Julia Silge, Keynote Speaker EARL Seattle 7th November

Fri, 10/26/2018 - 16:27

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Julia is a Data Scientist at Stack Overflow, has a PhD in astrophysics and an abiding love for Jane Austen (which we totally understand!). Before moving into Data Science and discovering R, Julia worked in academia and ed tech, and was a NASA Datanaut. She enjoys making beautiful charts, programming in R, text mining, and communicating about technical topics with diverse audiences. In fact, she loves R and text mining so much, she literally wrote the book on it: Text Mining with R: A Tidy Approach!

Lovely to speak to you Julia, could you give us a bit of a background around the work that you do? 

The open source work I do focuses on building a bridge between the tidyverse ecosystem of tools and the real world text data that so many of us need to use in our organizations, so we can use powerful, well-designed tidy tools with text data. In my day job, I work at Stack Overflow, using statistics and machine learning to make our site the best place for people who code to learn and share knowledge online, and to help our clients who want to engage with developers be successful.

What led to your career path?

My academic background is in physics and astronomy, where I was an observational astronomer who spent my time “in the trenches” with real-life data. Also, I’ve been heavily involved in education in various forms for a long time, whether speaking, teaching, writing, or otherwise. All of this together informs how I do data science, because a huge part of what I do is communicate with people about what a complex data analysis means. The fact that I analyze some dataset or train some machine learning model is great, but if I can’t explain it to my business partners, then we can’t make decisions.

Could you tell us what to expect from the content of your talk? And are there any key takeaway advice or tips that delegates will come away with?

Many R users working in fields from healthcare to finance to tech deal with messy text data (this includes me at Stack Overflow!); my talk focuses on a practical, flexible approach to use this text data to gain insight and make better decisions.

Can you give an example?

Folks at EARL can expect my talk to start with the fundamentals of exploratory data analysis for text. EDA is a fruitful and important part of the data science process, and in my own work, I know how much bang for the buck I get when I am deliberate about EDA strategies. We won’t stop there, though! We will also cover how to use tidy data principles for supervised and unsupervised machine learning for text.

What inspired you to write your book Text Mining with R – A Tidy Approach?

The book that my collaborator Dave and I wrote together grew organically out of the work we were doing in this space. We started by developing long-form documentation for our R package, invested more time in laying out best practices in workflows through blog posts, and eventually brought a book’s worth of content together in one cohesive, organized place.

Tell us about the type of work you get involved with on a day to day basis.

In my day job at Stack Overflow, I work on two main categories of questions. The first is centered on the ways that we directly generate revenue, through partnering with clients who want to hire, engage with, and enable the world’s developers. The second (which is of course connected to the first) is centered on the public Q&A community of Stack Overflow and the other Stack Exchange sites; I work on questions around how technologies are related to each other and changing, how to scaffold question askers to success, and how to make Stack Overflow more welcoming and inclusive.

What work do you do with the wider data science community and how do you see it evolving?

In my open source work, I maintain my own R packages, blog and speak about data analysis practices and share resources about data science and tech via social media. I have some ideas for new work I am excited about pursuing soon! I would love to evolve my data science work to more fully support best practices in machine learning for text. Another area that I want to continue to invest energy in, both in my day job and community work, is moving data science and tech toward more just and inclusive practices.

Come and see Julia and be inspired about her love for text mining and tidyverse applications at EARL Seattle on 7th November, we are really looking forward to the conference programme in Seattle, Houston and Boston.

Tickets can still be purchased here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simulating simple dice games by @ellis2013nz

Fri, 10/26/2018 - 15:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

A necessary but not sufficient condition for a game being one of skill rather than pure chance is that the player gets to make choices. If the game play is fully automatic, as in standard Snakes and Ladders, then there cannot possibly be any skill involved.

However, any game of pure chance can be converted to one of skill simply by adding a wager, or similar tool that sets up decisions for players such as the doubling cube in backgammon (backgammon of course is not a game of pure chance even just in the checker play, but it’s the doubling cube that brings in most of the skill). This is one of the reasons why so much effort over the centuries has gone into understand the probabilities of elementary card and dice games; leveraging the probabilities into a gamble turn a game of low or zero skill into something much more interesting.

Note – if you’re reading this on R Bloggers, the graphics aren’t showing up for a reason I need time to troubleshoot. Read the original on Free Range Statistics to get the graphics and formulae.

A simple game

Imagine a simple dice game familiar from “intro to probability” classes, where players A and B take turns, with A starting, and the first to roll a six wins. No choices are involved.

Computing the probability of A winning is a classic exercise in probability pedagogy with an elegant derivation. At the beginning of the game, A obviously has a 1/6 probability of winning straight away, and a 5/6 probability of being disappointed for now and giving the dice to B. At that point, B has 1/6 probability of winning straight away, and a 5/6 probability of having to give the dice back to A… So if p is the probability of winning given you are holding the dice, simply solve:

p = \frac{1}{6}1+\frac{5}{6}(1-p)

And some highschool algebra gives the answer as $$p = \frac{6}{11}$$.

So if Player A can convince B to have an even odds bet (“let’s both put in a dollar and whoever wins gets it all”), and to keep playing the game all night (with A always being allowed to start), they’ll come out on average about 9c better for each round they’ve played.

The converse of “no choices means no skill” doesn’t hold true; just having choices doesn’t mean there is skill involved. Consider if, perhaps as part of distracting B from the scam, A introduces a variant. Before each roll, the player has to decide whether to use a red or a blue die. We have choice, but no impact on the game; it remains a game of pure chance. Doubtless silly superstitions (“red always works for me after midnight!”) and rituals (“baby needs new shoes!”) will evolve, and possibly links to political preferences, but the maths of the game doesn’t change.

My hunch is that for psychological reasons players will focus on the things they can control unless they have discipline of iron. I imagine this has been researched, but looking into that will be for another day.

Simulation

Games are rarely as simple as the example above. Game designers and players have a knack of introducing complications – I believe with good reason – which mean that direct calculations are often not possible. For instance, I am working on a post on Snakes and Ladders (that’s “Chutes and Ladders” for some, due to an amusing historical concern that snakes on the board were too frightening for American children ), which shows that while it is all very elegant to model it with an absorptive Markov chain as per the research literature, this can only be practically done with simplified rules. Introduce the “3 sixes in a row and its back to square one” rule and things get much, much more complicated.

For even modestly complicated games, simulations will always trump analytical (ie pen and paper) solutions. Before we complicate our dice game, lets simulate the simple version to see if we can replicate the analytical result.

There are plenty of ways to do this, but an efficient way with this super-simple game is to generate a large collection of dice rolls all at once, mark the wins and work out who won by the game length (odd number of rolls means A won). This gives us, in addition to the correct average result, a nice visualisation of the geometric distribution of game lengths:

Here’s the R code for that.

library(tidyverse) library(scales) n <- 1000000 dice <- sample(1:6, n, replace = TRUE) wins <- which(dice == 6) results <- data.frame(game_length = diff(c(0, wins))) %>% mutate(who_won = ifelse(game_length %% 2 == 1, "Player A", "Player B")) rs <- results %>% group_by(who_won) %>% summarise(freq = n()) %>% ungroup() %>% mutate(prop = freq / sum(freq)) ggplot(results, aes(x = game_length, fill = who_won)) + geom_histogram(binwidth = 1) + ggtitle("Results of an alternating dice roll game", paste0("First to roll a six wins; Starting player wins ", round(rs[1, "prop"], 2), " of the time")) + scale_y_continuous(label = comma) + labs(x = "Game length", fill = "Winner:", y = paste("Number of wins out of", format(n, big.mark = ",", scientific = FALSE))) A more complex game

Let’s complicate our game by saying that not only does throwing a six win, so also does getting the same throw as your opponent previously.

This will swing the odds towards Player B, because while A has only a 1/6 chance of winning on their first roll, when Player B gets their first turn they have a 1/3 chance (they could roll a six to win, but also win with whatever A just rolled).

This situation is still simple enough to model analytically, but not (for me) in my head, so we’re getting towards territory where a simulation wins out. Actually, I did work this one out analytically to make sure I got the simulation right, so let’s think that through first. We can calculate the probability of the current holder of the dice winning, from the second roll onwards,. This is the same
method as for the simpler game but with a 1/3 chance of winning on this role rather than 1/6.

p_b = \frac{1}{3}1+\frac{2}{3}(1-p_b)

The same maths as before gives us a 0.6 probability of winning if you hold the dice, from the second roll onwards. Then we go back to the situation at roll one for Player A and we have:

p_a = \frac{1}{6}1 + \frac{5}{6}(1-p_b)

which yields exactly 0.5 as Player A’s chance at the beginning of the game.

This game is markedly more complicated to simulate than the first one. I tried to use the same approach of generating a big vector of dice roles, then using the tidyverse and extensive use of lag() to implement the game logic, but it was simply too awkward. The complications were issues such as when the dice roll is the same 3 or 4 rolls in a row, identifying in a data_frame with dice roll as a column exactly which rows marked a win and which ones needed to start the game again. I thought I’d cracked it several times only to discover bugs in rare situations (eg 5 identical non-six numbers in a row). That’s not good, for such a simple game. So I ended up with a completely different and conceptually simpler approach, closer to how humans play the game, of writing a function to play a round of the game. Some people (not me) would regard this as less R-idiomatic because it doesn’t vectorize the whole thing, but it’s much more generalizable to more complex games which is where I’ll be heading in future posts.

(Trust me, this is going somewhere – by the end of this series I have R successfully playing a 1970s text based computer game with the help of machine learning.)

That gives us this nice result:

#' Roll a dice and return the game 2 results for one round #' #' @param last_roll the previous roll of the dice. If NA, this means we are at the beginning of the game #' @return a list with two elements: whether it was a win based on the rule of 6, or matching the last roll; #' and what the latest roll of the dice is diceroll <- function(last_roll){ this_roll <- sample(1:6, 1) win <- this_roll %in% c(6, last_roll) return(list( win = win, this_roll = this_roll) ) } #' Main cycle for playing "Game 2" #' #' @return the number of rolls it took to win the game dicegame <- function(){ i <- 0 rolls <- NA win <- FALSE while(!win){ i <- i + 1 dr <- diceroll(rolls[length(rolls)]) win <- dr$win rolls <- c(rolls, dr$this_roll) } return(i) } game_length <- sapply(1:n, function(x){dicegame()}) results <- data.frame(game_length = game_length) %>% mutate(who_won = ifelse(game_length %% 2 == 1, "Player A", "Player B")) rs <- results %>% group_by(who_won) %>% summarise(freq = n()) %>% ungroup() %>% mutate(prop = freq / sum(freq)) svg("..http://freerangestats.io/img/0137-game2-results.svg", 8, 4) ggplot(results, aes(x = game_length, fill = who_won)) + geom_histogram(binwidth = 1) + ggtitle("Results of an alternating dice roll game", paste0("First to roll a six or to match the last roll wins; Starting player wins ", round(rs[1, "prop"], 2), " of the time")) + scale_y_continuous(label = comma) + labs(x = "Game length", fill = "Winner:", y = paste("Number of wins out of", format(n, big.mark = ",", scientific = FALSE))) Key points

Some basic points I’ll be drawing on again in future posts:

  • games with no choices are games of chance
  • games of pure chance do not imply equal chances of winning
  • any game of chance can be converted to a complex game of skill by adding gambling
  • simulation is useful for understanding gaming probabilities
  • simulation of complex games is likely to need to mimic the game process (eg turn by turn) rather than elegant mathematical simplifications – precisely because game designers introduce ad hoc complications to their games.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages