Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 33 min ago

To eat or not to eat! That’s the question? Measuring the association between categorical variables

Thu, 06/01/2017 - 02:00

(This article was first published on Stories Data Speak, and kindly contributed to R-bloggers)

1. Introduction

I serve as a reviewer to several ISI and Scopus indexed journals in Information Technology. Recently, I was reviewing an article, wherein the researchers had made a critical mistake in data analysis. They converted the original categorical data to continuous without providing a rigorous statistical treatment, nor, any justification to the loss of information if any. Thus, my motivation to develop this study, is borne out of their error.

We know the standard association measure between continuous variables is the product-moment correlation coefficient introduced by Karl Pearson. This measure determines the degree of linear association between continuous variables and is both normalized to lie between -1 and +1 and symmetric: the correlation between variables x and y is the same as that between y and x. the best-known association measure between two categorical variables is probably the chi-square measure, also introduced by Karl Pearson. Like the product-moment correlation coefficient, this association measure is symmetric, but it is not normalized. This lack of normalization provides one motivation for Cramer’s V, defined as the square root of a normalized chi-square value; the resulting association measure varies between 0 and 1 and is conveniently available via the assocstats function in the vcd package. An interesting alternative to Cramer’s V is Goodman and Kruskal’s tau, which is not nearly as well known and is asymmetric. This asymmetry arises because the tau measure is based on the fraction of variability in the categorical variable y that can be explained by the categorical variable x. 1

The data for this study is sourced from UCI Machine Learning repository. As it states in the data information section, “This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The guide clearly states that there is no simple rule for determining the edibility of a mushroom;

Furthermore, the possible research questions, I want to explore are;

  • Is significance test enough to justify a hypothesis?
  • How to measure associations between categorical predictors?
2. Making data management decisions

As a first step, I imported the data in R environment as;

# Import data from UCI ML repo > theURL<- "" # Explicitly adding the column headers from the data dictionary ><- read.csv(file = theURL, header = FALSE, sep = ",",strip.white = TRUE, stringsAsFactors = TRUE, col.names = c("class","cap-shape","cap-surface","cap-color","bruises", "odor","gill-attachment","gill-spacing","gill-size", "gill-color","stalk-shape","stalk-root","stalk-surface-above-ring", "stalk-surface-below-ring","stalk-color-above-ring","stalk-color-below-ring", "veil-type","veil-color","ring-number","ring-type","spore-print-color", "population","habitat"))

Next, I quickly summarize the dataset to get a brief glimpse. The reader’s should note that the data has no missing values.

# Calculate number of levels for each variable ><, Total_Levels=sapply(,function(x){as.numeric(length(levels(x)))})) > print( Variable Total_Levels class class 2 cap.shape cap.shape 5 cap.surface cap.surface 3 cap.color cap.color 10 bruises bruises 2 odor odor 8 gill.attachment gill.attachment 1 gill.spacing gill.spacing 2 gill.size gill.size 3 gill.color gill.color 10 stalk.shape stalk.shape 3 stalk.root stalk.root 6 stalk.surface.above.ring stalk.surface.above.ring 4 stalk.surface.below.ring stalk.surface.below.ring 5 stalk.color.above.ring stalk.color.above.ring 7 stalk.color.below.ring stalk.color.below.ring 8 veil.type veil.type 2 veil.color veil.color 2 ring.number ring.number 3 ring.type ring.type 5 spore.print.color spore.print.color 7 population population 7 habitat habitat 8

As we can see, the variable, gill.attachement has just one level. Nor, does it make any significant contribution to the target class so dropping it.

# dropping variable with constant variance >$gill.attachment<- NULL

The different levels are uninterpretable in their current format. I will use the data dictionary and recode the levels into meaningful names.

> levels($class)<- c("edible","poisonous") > levels($cap.shape)<-c("bell","conical","flat","knobbed","sunken","convex") > levels($cap.surface)<- c("fibrous","grooves","smooth","scaly") > levels($cap.color)<- c("buff","cinnamon","red","gray","brown","pink","green","purple","white","yellow") > levels($bruises)<- c("bruisesno","bruisesyes") > levels($odor)<-c("almond","creosote","foul","anise","musty","nosmell","pungent","spicy","fishy") > levels($gill.attachment)<- c("attached","free") > levels($gill.spacing)<- c("close","crowded") > levels($gill.size)<-c("broad","narrow") > levels($gill.color)<- c("buff","red","gray","chocolate","black","brown","orange","pink","green","purple","white","yellow") > levels($stalk.shape)<- c("enlarging","tapering") > table($stalk.root) # has a missing level coded as ? ? b c e r 2480 3776 556 1120 192 > levels($stalk.root)<- c("missing","bulbous","club","equal","rooted") > levels($stalk.surface.above.ring)<-c("fibrous","silky","smooth","scaly") > levels($stalk.surface.below.ring)<-c("fibrous","silky","smooth","scaly") > levels($stalk.color.above.ring)<- c("buff","cinnamon","red","gray","brown", "orange","pink","white","yellow") > levels($stalk.color.below.ring)<- c("buff","cinnamon","red","gray","brown", "orange","pink","white","yellow") > levels($veil.type)<-c("partial") > levels($veil.color)<- c("brown","orange","white","yellow") > levels($ring.number)<-c("none","one","two") > levels($ring.type)<- c("evanescent","flaring","large","none","pendant") > levels($spore.print.color)<- c("buff","chocolate","black","brown","orange","green","purple","white","yellow") > levels($population)<- c("abundant","clustered","numerous","scattered","several","solitary") > levels($habitat)<-c("woods","grasses","leaves","meadows","paths","urban","waste") 3. Initial data visualization a. Is there a relationship between cap-surface and cap-shape of a mushroom? > p<- ggplot(data =, aes(x=cap.shape, y=cap.surface, color=class)) > p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-1: Mushroom cap-shape and cap-surface

From Fig-1, we can easily notice, the mushrooms with a, flat cap-shape and scaly, smooth or fibrous cap-surface are poisonous. While, the mushrooms with a, bell,knob or sunken cap-shape and are fibrous, smooth or scaly are edible. A majority of flat cap-shaped mushrooms with scaly or smooth cap surface are poisonous.

b. Is mushroom habitat and its population related? > p<- ggplot(data =, aes(x=population, y=habitat, color=class)) > p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-2: Mushroom cap-shape and cap-surface

From Fig-2, we see that mushrooms which are clustered or scattered in population and living in woods are entirely poisonous. Those that live in grasses, wasteland, meadows, leaves, paths and urban area’s are edible.

c. What’s the deal with living condition and odor? > p<- ggplot(data =, aes(x=habitat, y=habitat, color=class)) > p + geom_jitter(alpha=0.3) +scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-3: Mushroom habitat and odor

From Fig-3, we notice, the mushrooms with fishy, spicy,pungent, foul, musty and creosote odor are clearly marked poisonous irrespective of there habitat. Whereas the one’s with almond, anise odour are edible mushrooms. We also notice, that a minority of no odour mushrooms are poisonous while the one’s living in meadows are entirely poisonous in nature.

Although, there could be many other pretty visualizations but I will leave that as a future work.

I will now focus on exploratory data analysis.

4. Exploratory data analysis a. Correlation detection & treatment for categorical predictors

If we look at the structure of the dataset, we notice that each variable has several factor levels. Moreover, these levels are unordered. Such unordered categorical variables are termed as nominal variables. The opposite of unordered is ordered, we all know that. The ordered categorical variables are called, ordinal variables.

“In the measurement hierarchy, interval variables are highest, ordinal variables are next, and nominal variables are lowest. Statistical methods for variables of one type can also be used with variables at higher levels but not at lower levels.”, see Agresti

I found this cheat-sheet that can aid in determining the right kind of test to perform on categorical predictors (independent/explanatory variables). Also, this SO post is very helpful. See the answer by user gung.

For categorical variables, the concept of correlation can be understood in terms of significance test and effect size (strength of association)

The Pearson’s chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. It is a significance test. Given two categorical random variables, X and Y, the chi-squared test of independence determines whether or not there exists a statistical dependence between them. Formally, it is a hypothesis test. The chi-squared test assumes a null hypothesis and an alternate hypothesis. The general practice is, if the p-value that comes out in the result is less than a pre-determined significance level, which is 0.05 usually, then we reject the null hypothesis.

H0: The The two variables are independent

H1: The The two variables are dependent

The null hypothesis of the chi-squared test is that the two variables are independent and the alternate hypothesis is that they are related.

To establish that two categorical variables (or predictors) are dependent, the chi-squared statistic must have a certain cutoff. This cutoff increases as the number of classes within the variable (or predictor) increases.

In section 3a, 3b and 3c, I detected possible indications of dependency between variables by visualizing the predictors of interest. In this section, I will test to prove how well those dependencies are associated. First, I will apply the chi-squared test of independence to measure if the dependency is significant or not. Thereafter, I will apply the Goodman’s Kruskal Tau test to check for effect size (strength of association).

i. Pearson’s chi-squared test of independence (significance test) > chisq.test($cap.shape,$cap.surface, correct = FALSE) Pearson's Chi-squared test data:$cap.shape and$cap.surface X-squared = 1011.5, df = 15, p-value < 2.2e-16

since the p-value is < 2.2e-16 is less than the cut-off value of 0.05, we can reject the null hypothesis in favor of alternative hypothesis and conclude, that the variables, cap.shape and cap.surface are dependent to each other.

> chisq.test($habitat,$odor, correct = FALSE) Pearson's Chi-squared test data:$habitat and$odor X-squared = 6675.1, df = 48, p-value < 2.2e-16

Similarly, the variables habitat and odor are dependent to each other as the p-value < 2.2e-16 is less than the cut-off value 0.05.

ii. Effect size (strength of association)

The measure of association does not indicate causality, but association–that is, whether a variable is associated with another variable. This measure of association also indicates the strength of the relationship, whether, weak or strong.

Since, I’m dealing with nominal categorical predictor’s, the Goodman and Kruskal’s tau measure is appropriate. Interested readers are invited to see pages 68 and 69 of the Agresti book. More information on this test can be seen here

> library(GoodmanKruskal) > varset1<- c("cap.shape","cap.surface","habitat","odor","class") > mushroomFrame1<- subset(, select = varset1) > GKmatrix1<- GKtauDataframe(mushroomFrame1) > plot(GKmatrix1, corrColors = "blue")

In Fig-4, I have shown the association plot. This plot is based on the corrplot library. In this plot the diagonal element K refers to number of unique levels for each variable. The off-diagonal elements contain the forward and backward tau measures for each variable pair. Specifically, the numerical values appearing in each row represent the association measure τ(x,y)τ(x,y) from the variable xx indicated in the row name to the variable yy indicated in the column name.

The most obvious feature from this plot is the fact that the variable odor is almost perfectly predictable (i.e. τ(x,y)=0.94) from class and this forward association is quite strong. The forward association suggest that x=odor (which has levels “almond”, “creosote”, “foul”, “anise”, “musty”, “nosmell”, “pungent”, “spicy”, “fishy”) is highly predictive of y=class (which has levels “edible”, “poisonous”). This association between odor and class is strong and indicates that if we know a mushroom’s odor than we can easily predict its class being edible or poisonous.

On the contrary, the reverse association y=class and x=odor(i.e. τ(y,x)=0.34; is a strong association and indicates that if we know the mushroom’s class being edible or poisonous than its easy to predict its odor.

Earlier we have found cap.shape and cap.surface are dependent to each other (chi-squared significance test). Now, let’s see if the association is strong too or not. Again, from Fig-4, both the forward and reverse association suggest that x=cap shape is weakly associated to y=cap surface (i.e.τ(x,y)=0.03) and (i.e.τ(y,x)=0.01). Thus, we can safely say that although these two variables are significant but they are association is weak; i.e. it will be difficult to predict one from another.

Similarly, many more associations can be interpreted from plot-4. I invite interested reader’s to explore it further.

5. Conclusion

The primary objective of this study was to drive the message, do not tamper the data without providing a credible justification. The reason I chose categorical data for this study to provide an in-depth treatment of the various measures that can be applied to it. From my prior readings of statistical texts, I could recall that significance test alone was not enough justification; there had to be something more. It is then, I found the about the different types of association measures and it sure did cleared my doubts. In my next post, I will continue the current work by providing inferential and predictive analysis. For interested reader’s, I have uploaded the complete code on my Github repository in here

To leave a comment for the author, please follow the link and comment on their blog: Stories Data Speak. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My new DataCamp course: Forecasting Using R

Thu, 06/01/2017 - 02:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

For the past few months I’ve been working on a new DataCamp course teaching Forecasting using R. I’m delighted that it is now available for anyone to do.
Course blurb Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements.

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

let there be progress

Thu, 06/01/2017 - 01:00

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

The ‘wrapr’package for use with dplyr programming –


I’m the first to admit I’m not an R expert, (even duffers can blog about it though), but when I began thinking about writing some dplyr functions to help me create and analyse run charts, I had no idea that I was going to struggle quite so much getting to grips with Non Standard Evaluation and Lazy Evaluation (see my last post for links and more background).

To clarify, I wanted to create flexible dplyr functions, without hardcoded grouping parameters, so I could query different SQL Server tables and apply the same functions reliably to transform and plot the data.

Eventually I sussed things out and created the functions I needed, but it was yet another example of how R makes me feel really stupid on a regular basis.
Please tell me I’m not alone with that…

I’m aware that dplyr 0.6 is imminent, and that my code will need revising as the dplyr verbs with underscores (for programmatic use) will be deprecated (at least eventually).

While I could probably install the dev version , I thought I should finally take a look at the let function in the wrapr package, as I’d seen numerous tweets giving it positive feedback.

I had a bash at recreating one of my functions – got some help directly from win-vector and was then able to rewrite all my remaining functions quickly.

I’ve put the code gist here – should hopefully run as intended, assuming you have the current dplyr 0.5 and wrapr packages installed.

Points to note
  • I commented out a line in the wrapr let example code for this demo so that all 3 return the same output.

  • There is some nice code to handle the lack of a grouping column, courtesy of Win-Vector. For my real world use, there will always be a grouping variable, but its nice to build this in for future requirements.

  • I used slice instead of filter in the NSE code – I could not get it to work at all otherwise. I’m pretty convinced I’ve done something wrong here, because the function only runs if I don’t wrap the variables in quotes when I call it. For the life of me I couldn’t figure out why a grouping call would work in one NSE function and fail in another when the same syntax was being used. Bearing in mind that the work I actually WANTED to do was going to be complex enough, I was just happy that it worked.

Code examples

Day to day, interactive dplyr code :

My attempt at the whole NSE/lazy eval thing :

Wrapr code :

With the let function you create the mapping block first, then use the variables there in your function – and can use the regular dplyr verbs. Once I’d seen how to set up the code, it was easy to recreate the others.

Outputs (in same order as above code examples):

Here’s another example of using the underscore versions compared to wrapr.

This time the NSE code works with quoted variables:

And here is the wrapr equivalent using let:

From a real world deadlines / getting things done perspective – if you need to create flexible dplyr functions, and you need to do it soon – take a look at wrapr to see if it fits your needs.

To leave a comment for the author, please follow the link and comment on their blog: HighlandR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny: data presentation with an extra

Thu, 06/01/2017 - 00:40

(This article was first published on, and kindly contributed to R-bloggers)

A Shiny app with three tabs presenting different sections of the same data.

Shiny is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with drop-down lists and checkboxes—programming-free. The apps can be useful for both the data analyst and the public.

Shiny apps are based on the Internet: This allows for private consultation of the data on one’s own browser as well as for online publication. Free apps can handle quite a lot of data, which can be increased with a subscription.

The target user of Shiny is extremely broad. Let’s take science—open science. At a time when openly archiving all data is becoming standard practice (e.g.,,,, Shiny can be used to walk the extra mile by letting people tour the data at once without programming. It’s the right tool for acknowledging all aspects of the data. Needless to say, these apps do not replace raw data archiving.

The apps simply add. For instance, the data in the lower app is a little noisy, right? Indeed it shows none of the succession of waves that characterizes word reading. The app helped me in identifying this issue. Instead of running along a host of participants and electrodes through a heavy code score, I found that the drop-down lists of the app let me seamlessly surf the data. By Participant 7, though, my wave dropped me…

Those data were very poor—systematically poorer than those of any other participants. I then checked the EEG preprocessing logs, and confirmed that those data had to be discarded. So much for the analysis utility of such an app. On the open science utility, what I did on discovering the fault was maintain the discarded data in the app, with a note, so that any colleagues and reviewers could consider it too. Now, although this example of use concerns a rather salient trait in the data, some other Shiny app might help to spot patterns such as individual differences, third variables.

Building a Shiny app is not difficult. Apps basically draw on some data presentation code (tables, plots) that you already have. Then just add a couple of scripts into the folder: one for the user interface (named iu.R), one for the process (named server.R), and perhaps another one compiling the commands for deploying the app and checking any errors.

The steps to start a Shiny app from scratch are:

1: Tutorials. Being open-source software, the best manuals are available through a Google search.

2: User token. Signing up and reading in your private key—just once.

3: GO. Plunge into the ui and server scripts, and deployApp().

4: Bugs and logs. They are not bugs in fact—rather fancies. For instance, some special characters have to get even more special (technically, UTF-8 encoding). For a character such as μ, Shiny prefers Âμ. Just cling to error logs by calling:

showLogs(appPath = getwd(), appFile = NULL, appName = NULL, account = NULL, entries = 50, streaming = FALSE)

The log output will mention any typos and unaccepted characters, pointing to specific lines in your code.

It may take a couple of intense days to get a first app running. As usual with programming, it’s easy to run into the traps which are there to spice up the way. The app’s been around for years, so tips and tricks abound on the Internet. For greater companionship, there are dedicated Google groups, and then there’s StackExchange etc., where you can post any needs/despair. Post your code, log, and explanation, and you’ll be rescued out in a couple of days. Long live those contributors.

It will often be enough to upload a bare app, but you might then think it can look better.

5 (optional): Pro up.
Use tabs to combine multiple apps in one, use different widgets, etc. Tutorials like this one on Youtube can take you there, especially those that provide the code, as in the description of that video. Use those scripts as templates. For example, see this script in which the function conditionalPanel() is used to modify the app’s sidebar based on which tab is selected. The utility of tabs is illustrated in the upper cover of this article and in the app shown in the text: When having multiple data sections, the tabs allow you to have all in one (cover screenshot), instead of linking to other apps in each (screenshot in text).

Time for logistics. You can include any text in your app’s webpage, such as explanations of any length, web links, and author information. Oh, also importantly: the Shiny application allows for the presentation of data in any of the forms available in R—notably, plots and tables. Costs: Shiny is free to use, in principle. You may only have to pay if you use your app(s) a lot—first month, commonly—, in which case you may pay 9 euros a month. There are different subscription plans. The free plan allows 25 hours of use per month, distributed among up to five apps.

There do exist alternatives to Shiny. One is fairly similar: It’s called Tableau. A nice comparison of the two apps is available here. Then, there’s a more advanced application called D3.js, which allows for lush graphics but proves more restrictive to newbies.

In sum, if you already program in R or even in Python, but have not tried online data presentation yet, consider it.

Feel free to share your ideas, experiences, or doubts in a comment on the original post.

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Watch presentations from R/Finance 2017

Wed, 05/31/2017 - 23:23

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It was another great year for the R/Finance conference, held earlier this month in Chicago. This is normally a fairly private affair: with attendance capped at around 300 people every year, it's a somewhat exclusive gathering of the best and brightest minds from industry and academia in financial data analysis with R. But for the first time this year (and with thanks to sponsorship from Microsoft), videos of the presentations are available for viewing by everyone. I've included the complete list (copied from the R/Finance website) below, but here are a few of my favourites:

You can find an up-to-date version of the table below at the R/Finance website (click on the "Program" tab), and you can also browse the videos at Channel 9. Note that the lightning talk sessions (in orange) are bundled together in a single video, which you can find linked after the first talk in each session.

Friday, May 19th, 2017 09:30 – 09:35   Kickoff (video) 09:35 – 09:40   Sponsor Introduction 09:40 – 10:10   Marcelo Perlin: GetHFData: An R package for downloading and aggregating high frequency trading data from Bovespa (pdf) (video)    Jeffrey Mazar: The obmodeling Package (html)    Yuting Tan: Return Volatility, Market Microstructure Noise, and Institutional Investors: Evidence from High Frequency Market (pdf)    Stephen Rush: Adverse Selection and Broker Execution (pdf)    Jerzy Pawlowski: How Can Machines Learn to Trade? (html) 10:10 – 10:30   Michael Hirsch: Revealing High-Frequency Trading Provisions of Liquidity with Visualization in R (html) (video) 10:30 – 10:50   Eric Glass: Equity Factor Portfolio Case Study (html) (video) 10:50 – 11:10   Break 11:10 – 11:30   Seoyoung Kim: Zero-Revelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News (pdf) (video) 11:30 – 12:10   Szilard Pafka: No-Bullshit Data Science (pdf) (video) 12:10 – 13:30   Lunch 13:30 – 14:00   Francesco Bianchi: Measuring Risk with Continuous Time Generalized Autoregressive Conditional Heteroscedasticity Models (pdf) (video)    Eina Ooka: Bunched Random Forest in Monte Carlo Risk Simulation (pdf)    Matteo Crimella: Operational Risk Stress Testing: An Empirical Comparison of Machine Learning Algorithms and Time Series Forecasting Methods (pdf)    Thomas Zakrzewski: Using R for Regulatory Stress Testing Modeling (pdf)    Andy Tang: How much structure is best? (pptx) 14:00 – 14:20   Robert McDonald: Ratings and Asset Allocation: An Experimental Analysis (pdf) 14:20 – 14:50   Break 14:50 – 15:10   Dries Cornilly: Nearest Comoment Estimation with Unobserved Factors and Linear Shrinkage (pdf) (video) 15:10 – 15:30   Bernhard Pfaff: R package: mcrp: Multiple criteria risk contribution optimization (pdf) (video) 15:30 – 16:00   Oliver Haynold: Practical Options Modeling with the sn Package, Fat Tails, and How to Avoid the Ultraviolet Catastrophe (pdf) (video)    Shuang Zhou: A Nonparametric Estimate of the Risk-Neutral Density and Its Applications (pdf)    Luis Damiano: A Quick Intro to Hidden Markov Models Applied to Stock Volatility    Oleg Bondarenko: Rearrangement Algorithm and Maximum Entropy (pdf)    Xin Chen: Risk and Performance Estimator Standard Errors for Serially Correlated Returns (pdf) 16:00 – 16:20   Qiang Kou: Text analysis using Apache MxNet (pdf) (video) 16:20 – 16:40   Robert Krzyzanowski: Syberia: A development framework for R (pdf) (video) 16:40 – 16:52   Matt Dancho: New Tools for Performing Financial Analysis Within the 'Tidy' Ecosystem (pptx) (video)    Leonardo Silvestri: ztsdb, a time-series DBMS for R users (pdf) Saturday, May 20th, 2017 09:05 – 09:35   Stephen Bronder: Integrating Forecasting and Machine Learning in the mlr Framework (pdf) (video)    Leopoldo Catania: Generalized Autoregressive Score Models in R: The GAS Package (pdf)    Guanhao Feng: Regularizing Bayesian Predictive Regressions (pdf)    Jonas Rende: partialCI: An R package for the analysis of partially cointegrated time series (pdf)    Carson Sievert: Interactive visualization for multiple time series (pdf) 09:35 – 09:55   Emanuele Guidotti: yuimaGUI: A graphical user interface for the yuima package (pptx) (video) 09:55 – 10:15   Daniel Kowal: A Bayesian Multivariate Functional Dynamic Linear Model (pdf) (video) 10:15 – 10:45   Break 10:45 – 11:05   Jason Foster: Scenario Analysis of Risk Parity using RcppParallel (pdf) (video) 11:05 – 11:35   Michael Weylandt: Convex Optimization for High-Dimensional Portfolio Construction (pdf) (video)    Lukas Elmiger: Risk Parity Under Parameter Uncertainty (pdf)    Ilya Kipnis: Global Adaptive Asset Allocation, and the Possible End of Momentum (pptx)    Vyacheslav Arbuzov: Dividend strategy: towards the efficient market (pdf)    Nabil Bouamara: The Alpha and Beta of Equity Hedge UCITS Funds – Implications for Momentum Investing (pdf) 11:35 – 12:15   Dave DeMers: Risk Fast and Slow (pdf) (video) 12:15 – 13:35   Lunch 13:35 – 13:55   Matthew Dixon: MLEMVD: A R Package for Maximum Likelihood Estimation of Multivariate Diffusion Models (pdf) (video) 13:55 – 14:15   Jonathan Regenstein: Reproducible Finance with R: A Global ETF Map (html) (video) 14:15 – 14:35   David Ardia: Markov-Switching GARCH Models in R: The MSGARCH Package (pdf) (video) 14:35 – 14:55   Keven Bluteau: Forecasting Performance of Markov-Switching GARCH Models: A Large-Scale Empirical Study (pdf) (video) 14:55 – 15:07   Riccardo Porreca: Efficient, Consistent and Flexible Credit Default Simulation (pdf) (video)    Maisa Aniceto: Machine Learning and the Analysis of Consumer Lending (pdf) 15:07 – 15:27   David Smith: Detecting Fraud at 1 Million Transactions per Second (pptx) (video) 15:27 – 15:50   Break 15:50 – 16:10   Thomas Harte: The PE package: Modeling private equity in the 21st century (pdf) (video) 16:10 – 16:30   Guanhao Feng: The Market for English Premier League (EPL) Odds (pdf) (video) 16:30 – 16:50   Bryan Lewis: Project and conquer (html) (video)



To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

VIF and multicollinearity diagnostics

Wed, 05/31/2017 - 23:14

(This article was first published on R code – Serious Stats, and kindly contributed to R-bloggers)

In the book I use the car package to get VIF and other multicollinearity diagnostics. I’ve occasionally found this breaks down (usually through mixing different versions of R on different machines at work home or on the move). I recently saw the mctest package and thought it would be useful to use that as a backup – and also because it offers a slightly different set of diagnostics.

If you’d like to try it out I’ve written a helper function that makes it easier apply directly to a linear model. You can find the function and a simple example here.


Filed under: R code, serious stats Tagged: multicollinearity, R, statistics, vif

To leave a comment for the author, please follow the link and comment on their blog: R code – Serious Stats. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bland-Altman/Tukey Mean-Difference Plots using ggplot2

Wed, 05/31/2017 - 23:04

(This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers)

A very useful data visualisation tool in science, particularly in medical and sports settings, is the Bland-Altman/Tukey Mean-Difference plot. When comparing two sets of measurements for the same variable made by different instruments, it is often required to determine whether the instruments are in agreement or not.

Correlation and linear regression can tell us something about the bivariate relationship which exists between two sets of measurements. We can identify the strength, form and direction of a relationship but this approach is not recommended for comparative analyses.

The Bland-Altman plot’s first use was in 1983 by J.M Bland and D.G Altman who applied it to medical statistics. The Tukey Mean-Difference Plot was one of many exploratory data visualisation tools created by John Tukey who, interestingly, also created the beloved boxplot.

A simple implementation of the Bland-Altman/Tukey Mean-Difference plot in R is described below. The dataset used for this example is one of R’s built-in datasets, Loblolly, which contains data on the growth of Loblolly pine trees.

Prepare the workspace and have a look at the structure of the data.

library(dplyr) library(ggplot2) pine_df <- Loblolly glimpse(pine_df)


Randomly split the data into two samples of equal length. These samples will be used as our two sets of pine tree height measurements. Arrange in order of descending height to mimic realistic dual measurements of single trees and combine into a dataframe. Note that these data are used simply to demonstrate and the measures in reality are NOT of the same tree using different instruments.

sample_df <- data.frame(sample_n(pine_df, size = nrow(pine_df) * 0.5) %>% select(height) %>% arrange(desc(height)), sample_n(pine_df, size = nrow(pine_df) * 0.5) %>% select(height) %>% arrange(desc(height)))


Assign sensible names to the two pine tree height samples in the dataframe.

names(sample_df) <- c("Sample_1", "Sample_2")

Each row in the dataframe consists of a pair of measurements. The Bland-Altman plot has the average of the two measures in a pair on the x-axis. The y-axis contains the difference between the two measures in each pair. Add the averages and differences data to the dataframe.

sample_df$Avg <- (sample_df$Sample_1 + sample_df$Sample_2) / 2 sample_df$Dif <- sample_df$Sample_1 - sample_df$Sample_2

Finally, code the plot and add the mean difference (blue line) and a 95% confidence interval (red lines) for predictions of a mean difference. This prediction interval gives the level of agreement (1.96 * SD).

ggplot(sample_df, aes(x = Avg, y = Dif)) +   geom_point(alpha = 0.5) +   geom_hline(yintercept = mean(sample_df$Dif), colour = "blue", size = 0.5) +   geom_hline(yintercept = mean(sample_df$Dif) - (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) +   geom_hline(yintercept = mean(sample_df$Dif) + (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) +   ylab("Diff. Between Measures") +   xlab("Average Measure")


To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Server Pro is now available on AWS Marketplace

Wed, 05/31/2017 - 23:02

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio is excited to announce the availability of its flagship enterprise-ready integrated development environment for R in AWS Marketplace.

RStudio Server Pro AWS is identical to RStudio Server Pro, but with turnkey convenience. It comes pre-configured with multiple versions of R, common systems libraries, and the most popular R packages.

RStudio Server Pro AWS helps you adapt to your unique circumstances. It allows you to choose different AWS computing instances no matter how large, whenever a project requires it (flat hourly pricing). Or you can set up a persistent instance of RStudio Server Pro ready to be used anytime you need it (annual pricing), avoiding the sometimes complicated processes for procuring on-premises software.

If the enhanced security, elegant support for multiple R versions and multiple sessions, and commercially licensed and supported features of RStudio Server Pro appeal to you and your enterprise, consider RStudio Server Pro for AWS. It’s ready to go!

Read the FAQ         Try RStudio Server Pro AWS

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Evaluate your model with R Exercises

Wed, 05/31/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient.

Now that even an average computer give us incredible computing power, it’s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models.

In this set of exercises, we’ll see how to apply the most used error metrics on your models with the intention to rate them and choose the one that is the more appropriate for the situation.
Most of those errors metrics are not part of any R package, in consequence you have to apply the equation I give you on your data. Personally, I preferred to write a function which I can easily
use on everyone of my models, but there’s many ways to code those equations. If your code is different from the one in the solution, feel free to post your code in the comments.

Answers to the exercises are available here.

Exercise 1
We start by looking at error metrics we can use for regression model. For linear regression problem, the more used metrics are the coefficient of determination R2 which show what percentage of variance is explained by the model and the adjusted R2 which penalize model who use variables that doesn’t have an effective contribution to the model (see this page for more details). Load the attitude data set from the package of the same name and make three linear models with the goal to explain the rating variable. The first one use all the variables from the dataset, the second one use the variable complaints, privileges, learning and advance as independent variables and the third one use only the complaints, learning and advance variables. Then use the summary() function to print R2 and the adjusted R2.

Exercise 2
Another way to measure how your model fit your data is to use the Root Mean Squared Error (RMSE), which is defined as the square root of the average of the square of all the error made by your model. You can find the mathematical definition of the RMSE on this page.
Calculate the RMSE of the prediction made by your three models.

Exercise 3
The mean absolute error (MAE) is a good alternative to the RMSE if you don’t want to penalise too much the large estimation error of your model. The MAE is given by the equation
The mathematical definition of the MAE can be found here.
Calculate the MAE of the prediction made by the 3 models.

Exercise 4
Sometime some prediction error hurt your model than other. For example, if you are trying to predict the financial loss of a business over a period of time, under estimation of the loss would
put the business at risk of bankruptcy, while overestimation of the loss will result in a conservative model. In those case, using the Root Mean Squared Logarithmic Error (RMSLE) as an error
metric is useful since this metric penalize under estimation. The RMSLE is given by the equation given on this page.
Calculate the RMSLE of the prediction made by the three models.

Exercise 5
Now that we’ve seen some examples of error metrics which can be used in a regression context, let’s see five examples of error metrics which can be used when you perform clustering analysis. But
first, we must create a clustering model to test those metrics on. Load the iris dataset and apply the kmeans algorithm. Since the iris dataset has three distinct
labels, use the kmeans algorithm with three centers. Also, use set the maximum number of iterations at 50 and use the “Lloyd” algorithm. Once it’s done, take time to rearrange the labels of your
prediction so they are compatible with the factors in the iris dataset.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 6
Print the confusion matrix of your model.

Exercise 7
The easiest way to measure how well your model did categorize the data, is to calculate the accuracy, the recall and the precision of your results. Write three functions which return those individual values and calculate those metrics for your models.

Exercise 8
The F-measure summarize the precision and recall value of your model by calculating the harmonic mean of those two values.
Write a function which return the F-measure of your model and compute twice this measure for your data: once with a parameter of 2 and then with a parameter of 0.5.

Exercise 9
The Purity is a measure of the homogeneity of your cluster: if all your cluster regroup object of the same class, you’ll get a purity score of one and if there’s no majority class in any of the cluster, you’ll get a purity score of 0. Write a function which return the purity score of your model and test it on your predictions.

Exercise 10
The last error metrics we’ll see today is the Dunn index, which indicate if the clusters are compact and separated. You can find the mathematical definition of the Dunn index here. Load the cvalid package and use the dunn() on your model, to compute the Dunn index of your classification. Note that this function take an integer vector representing the cluster partitioning as parameter.

Related exercise sets:
  1. Forecasting for small business Exercises (Part-4)
  2. Forecasting: ARIMAX Model Exercises (Part-5)
  3. Forecasting: Exponential Smoothing Exercises (Part-3)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New online datacamp course: Forecasting in R

Wed, 05/31/2017 - 17:23

(This article was first published on, and kindly contributed to R-bloggers)

Forecasting in R is taught by Rob J. Hyndman, author of the forecast package

Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning. This course provides an introduction to time series forecasting using R.

What You’ll Learn
Chapter 1: Exploring and Visualizing Time Series in R
The first thing to do in any data analysis task is to plot the data.
Chapter 2: Benchmark Methods and Forecast Accuracy
In this chapter, you will learn general tools that are useful for many different forecasting situations.
Chapter 3: Exponential Smoothing
is framework generates reliable forecasts quickly and for a wide range of time series.
Chapter 4: Forecasting with ARIMA Models
ARIMA models provide another approach to time series forecasting.
Chapter 5: Advanced Methods
In this chapter, you will look at some methods that handle more complicated seasonality.

You can start the free chapter for free of Forecasting in R.

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Drilling Into CSVs — Teaser Trailer

Wed, 05/31/2017 - 17:12

(This article was first published on R –, and kindly contributed to R-bloggers)

I used reading a directory of CSVs as the foundational example in my recent post on idioms.

During my exchange with Matt, Hadley and a few others — in the crazy Twitter thread that spawned said post — I mentioned that I’d personally “just use Drill.

I’ll use this post as a bit of a teaser trailer for the actual post (or, more likely, series of posts) that goes into detail on where to get Apache Drill, basic setup of Drill for standalone workstation use and then organizing data with it.

You can get ahead of those posts by doing two things:

  1. Download, install and test your Apache Drill setup (it’s literally 10 minutes on any platform)
  2. Review the U.S. EPA annual air quality data archive (they have individual, annual CSVs that are perfect for the example)

My goals for this post are really to just to pique your interest enough in Drill and parquet files (yes, I’m ultimately trying to socially engineer you into using parquet files) to convince you to read the future post(s) and show that it’s worth your time to do Step #1 above.

Getting EPA Air Quality Data

The EPA has air quality data going back to 1990 (so, 27 files as of this post). They’re ~1-4MB ZIP compressed and ~10-30MB uncompressed.

You can use the following code to grab them all with the caveat that the libcurl method of performing simultaneous downloads caused some pretty severe issues — like R crashing — for some of my students who use Windows. There are plenty of examples for doing sequential downloads of a list of URLs out there that folks should be able to get all the files even if this succinct method does not work on your platform.

dir.create("airq") urls <- sprintf("", 1990L:2016L) fils <- sprintf("airq/%s", basename(urls)) download.file(urls, fils, method = "libcurl")

I normally shy away from this particular method since it really hammers the remote server, but this is a beefy U.S. government server, the files are relatively small in number and size and I’ve got a super-fast internet connection (no long-lived sockets) so it should be fine.

Putting all those files under the “control” of Drill is what the next post is for. For now, i’m going to show the basic code and benchmarks for reading in all those files and performing a basic query for all the distinct years. Yes, we know that information already, but it’s a nice, compact task that’s easy to walk through and illustrates the file reading and querying in all three idioms: Drill, tidyverse and data.table.

Data Setup

I’ve converted the EPA annual ZIP files into bzip2 format. ZIP is fine for storage and downloads but it’s not a great format for data analysis tasks. gzip would be slightly faster but it’s not easily splittable and — even though I’m not using the data in a Hadoop context — I think it’s wiser to not have to re-process data later on if I ever had to move raw CSV or JSON data into Hadoop. Uncompressed CSVs are the most portable, but there’s no need to waste space.

All the following files are in a regular filesystem directory accessible to both Drill and R:

> (epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2")) [1] "annual_all_1990.csv.bz2" "annual_all_1991.csv.bz2" "annual_all_1992.csv.bz2" [4] "annual_all_1993.csv.bz2" "annual_all_1994.csv.bz2" "annual_all_1995.csv.bz2" [7] "annual_all_1996.csv.bz2" "annual_all_1997.csv.bz2" "annual_all_1998.csv.bz2" [10] "annual_all_1999.csv.bz2" "annual_all_2000.csv.bz2" "annual_all_2001.csv.bz2" [13] "annual_all_2002.csv.bz2" "annual_all_2003.csv.bz2" "annual_all_2004.csv.bz2" [16] "annual_all_2005.csv.bz2" "annual_all_2006.csv.bz2" "annual_all_2007.csv.bz2" [19] "annual_all_2008.csv.bz2" "annual_all_2009.csv.bz2" "annual_all_2010.csv.bz2" [22] "annual_all_2011.csv.bz2" "annual_all_2012.csv.bz2" "annual_all_2013.csv.bz2" [25] "annual_all_2014.csv.bz2" "annual_all_2015.csv.bz2" "annual_all_2016.csv.bz2"

Drill can directly read plain or compressed JSON, CSV and Apache web server log files plus can treat a directory tree of them as a single data source. It can also read parquet & avro files (both are used frequently in distributed “big data” setups) and access MySQL, MongoDB and other JDBC resources as well as query data stored in Amazon S3 and HDFS (I’ve already mentioned it works fine in plain ‘ol filesystems, too).

I’ve tweaked my Drill configuration to support reading column header info from .csv files (which I’ll show in the next post). In environments like Drill or even Spark, CSV columns are usually queried with some type of column index (e.g. COLUMN[0]) so having named columns makes for less verbose query code.

I turned those individual bzip2 files into parquet format with one Drill query:

CREATE TABLE dfs.pq.`/epa/annual.parquet` AS SELECT * FROM dfs.csv.`/epa/annual/*.csv.bz2`

Future posts will explain the dfs... component but they are likely familiar path specifications for folks used to Spark and are pretty straightforward. The first bit (up to the back-tick) is an internal Drill shortcut to the actual storage path (which is a plain directory in this test) followed by the tail end path spec to the subdirectories and/or target files. That one statement said ‘take all the CSV files in that directory and make one big table out of them”.

The nice thing about parquet files is that they work much like R data frames in that they can be processed on the column level. We’ll see how that speeds up things in a bit.

Benchmark Setup

The tests were performed on a maxed out 2016 13″ MacBook Pro.

There are 55 columns of data in the EPA annual summary files.

To give both read_csv and fread some benchmark boosts, we’ll define the columns up-front and pass those in to each function on data ingestion and I’ll leave them out of this post for brevity (they’re just a cols() specification and colClasses vector). Drill gets no similar help for this at least when it comes to CSV processing.

I’m also disabling progress & verbose reporting in both fread and read_csv despite not stopping Drill from writing out log messages.

Now, we need some setup code to connect to drill and read in the list of files, plus we’ll setup the five benchmark functions to read in all the files and get the list of distinct years from each.

library(sergeant) library(data.table) library(tidyverse) (epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2", full.names = TRUE)) db <- src_drill("localhost") # Remember, defining ct & ct_dt - the column types specifications - have been left out for brevity mb_drill_csv <- function() { epa_annual <- tbl(db, "dfs.csv.`/epa/annual/*.csv.bz2`") select(epa_annual, Year) %>% distinct(Year) %>% collect() } mb_drill_parquet <- function() { epa_annual_pq <- tbl(db, "dfs.pq.`/epa/annual.parquet`") select(epa_annual_pq, Year) %>% distinct(Year) %>% collect() } mb_tidyverse <- function() { map_df(epa_annual_fils, read_csv, col_types = ct, progress = FALSE) -> tmp unique(tmp$Year) } mb_datatable <- function() { rbindlist( lapply( epa_annual_fils, function(x) { fread(sprintf("bzip2 -c -d %s", x), colClasses = ct_dt, showProgress = FALSE, verbose = FALSE) })) -> tmp unique(tmp$Year) } mb_rda <- function() { read_rds("~/Data/rds/epa/annual.rds") -> tmp unique(tmp$Year) } microbenchmark( csv = { mb_drill_csv() }, pq = { mb_drill_parquet() }, df = { mb_tidyverse() }, dt = { mb_datatable() }, rda = { mb_rda() }, times = 5 ) -> mb

Yep, it’s really as simple as:

tbl(db, "dfs.csv.`/epa/annual/*.csv.bz2`")

to have Drill treat a directory tree as a single table. It’s also not necessary for all the columns to be in all the files (i.e. you get the bind_rows/map_df/rbindlist behaviour for “free”).

I’m only doing 5 evaluations here since I don’t want to make you wait if you’re going to try this at home now or after the Drill series. I’ve run it with a more robust benchmark configuration and the results are aligned with this one.

Unit: milliseconds expr min lq mean median uq max neval csv 15473.5576 16851.0985 18445.3905 19586.1893 20087.1620 20228.9450 5 pq 493.7779 513.3704 616.2634 550.5374 732.6553 790.9759 5 df 41666.1929 42361.1423 42701.2682 42661.9521 43110.3041 43706.7498 5 dt 37500.9351 40286.2837 41509.0078 42600.9916 43105.3040 44051.5247 5 rda 9466.6506 9551.7312 10012.8560 9562.9114 9881.8351 11601.1517 5

The R data route, which is the closest to the parquet route, is definitely better than slurping up CSVs all the time. Both parquet and R data files require pre-processing, so they’re not as flexible as having individual CSVs (that may get added hourly or daily to a directory).

Drill’s CSV slurping handily beats the other R methods even with some handicaps the others did not have.

This particular example is gamed a bit, which helped parquet to ultimately “win”. Since Drill can target the singular column (Year) that was asked for, it doesn’t need to read all the extra columns just to compute the final product (the distinct list of years).

IMO both the Drill CSV ingestion and Drill parquet access provide compelling enough use-cases to use them over the other three methods, especially since they are easily transferrable to remote Drill servers or clusters with virtually no code changes. A single node Drillbit (like R) is constrained by the memory on that individual system, so it’s not going to get you out of a memory jam, but it may make it easier to organize and streamline your core data operations before other analysis and visualization tasks.


I’m sure some member of some other tribe will come up with an example that proves superiority of their particular tribal computations. I’m hoping one of those tribes is the R/Spark tribe so that can get added into the mix (using Spark standalone is much like using Drill, but with more stats/ML functions directly available).

I’m hopeful that this post has showcased enough of Drill’s utility to general R users that you’ll give it a go and consider adding it to your R data analysis toolbox. It can be beneficial having both a precision tools as well as a Swiss Army knife — which is what Drill really is — handy.

You can find the sergeant package on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

GSoC 2017 : Parser for Biodiversity Checklists

Wed, 05/31/2017 - 11:24

(This article was first published on Vijay Barve, and kindly contributed to R-bloggers)

Guest post by Qingyue Xu

Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. In the GSoC 2017 project Parser for Biodiversity checklists, my overall goal is to extract taxonomic names from given text into a tabular format so that easy aggregation of biodiversity data in a structured format that can be used for further processing can be facilitated.

I mainly plan to build three major functions which serve different purposes and take various sources of text into account.

However, before building functions, we need to first identify and cover as many different formats of scientific names as possible. The inconsistencies of scientific names make things complicated for us. The common rules for scientific names follow the order of:

genus, [species], [subspecies], [author, year], [location]

Many components are optional, and some components like author and location can be one or even more. Therefore, when we’re parsing the text, we need to analyze the structure of the text and match it with all possible patterns of scientific names and identify the most likely one. To resolve the problem more accurately, we can even draw help from NLTK (Natural Language Toolkit) packages to help us identify “PERSON” and “LOCATION” so that we can analyze the components of scientific names more efficiently.

Function: find_taxoname (url/input_file, output_file)

  • Objective: This is a function to search scientific names with supplied texts, especially applied to the situation when the text is not well structured.
  • Parameters: The first parameter is the URL of a web page (HTML based) or the file path of a PDF/TXT file, which is our source text to search for the biology taxonomic names. The second parameter is the file path of the output file and it will be in a tabular format including columns of the genus, species, subspecies, author, year.
  • Approach: Since this function is intended for the unstructured text, we can’t find a certain pattern to parse the taxonomic names. What we can do is utilizing existing standard dictionaries of scientific names to locate the genus, species, subspecies. By analyzing the surrounding structure and patterns, we can find corresponding genus, species, subspecies, author, year, if they exist, and output the findings in a tabular format.

Function: parse_taxolist(input_file, filetype, sep, output_file, location)

  • Objective: This is a function to parse and extract taxonomic names from a given structured text file and each row must contain exactly one entry of the scientific names. If the location information is given, the function can also list and return the exact location (latitude and longitude) of the species. The output is also in a tabular format including columns of genus, species, subspecies, author(s), year(s), location(s), latitude(s), longitude(s).
  • Parameters: The first parameter is the file path of the input file and the second parameter is the file type, which supports txt, PDF, CSV types of files. The third parameter ‘sep’ should indicate the separators used in the input file to separate every word in the same row. The fourth parameter is the intended file path of the output file. The last parameter is a Boolean, indicating whether the input file contains the location information. If ‘true’, then the output will contain detailed location information.
  • Approach: The function will parse the input file based on rows and the given separators into a well-organized tabular format. An extended function is to point the exact location of the species if the related information is given. With the location info such as “Mirik, West Bengal, India”, the function will return the exact latitude and longitude of this location as 26°53’7.07″N and 88°10’58.06″E. It can be realized through crawling the web page of or utilizing the API of Google Map. This is also a possible solution to help us identify whether the content of the text represents a location. If it cannot get exact latitude and longitude, then it’s not a location. If a scientific name doesn’t contain location information, the function will return NULL value for the location part. If it contains multiple locations, the function will return multiple values as a list as well as the latitudes and longitudes.

Function: recursive_crawler(url, htmlnode_taxo, htmlnode_next, num, output_file, location)

  • Objective: This function is intended to crawl the web pages containing information about taxonomic names recursively. The start URL must be given and the html_node of the scientific names should also be indicated. Also, if the text contains location info, the output will also include the detailed latitude and longitude.
  • Parameters: The first parameter is the start URL of the web page and the following web pages must follow the same structure as the first web page. The second parameter is the html_node of the taxonomic names, such as “.SP .SN > li”. (There’re a lot of tools for the users to identify the HTML nodes code for certain contexts). The third parameter is the html_node of the next page, which can lead us to the next page of another genus. The fourth parameter ‘num’ is the intended number of web pages the user indicates. If ‘num’ is not given, the function will automatically crawl and stop until the htmlnode_next cannot return a valid URL. The next two parameters are the same with the above two functions.
  • Approach: For the parsing part and getting the location parts, the approach is the same as the above functions. For the crawling part, for a series of structured web pages, we can parse and get valid scientific names based on the given HTML nodes. The HTML nodes for the next pages should also be given, and we can always get the URL of the next page by extracting it from the source code. For example, the following screenshot from the web page we used provides a link which leads us to the next page. By recursively fetching the info from the current page and jump to the following pages, we can output a well-organized tabular file including all the following web pages.
  • Other possible functionalities to be realized

Since inconsistencies might exist in the format of scientific names, I also need to construct a function to normalize the names. The complication always lies in the author part, and there can be two approaches to address the problem. The first one is still analyzing the structure of the scientific name and we can try to capture as many exceptions as possible, such as author names which have multiple parts or there’re two authors. The second approach is to draw help from the NLTK package to identify possible PERSON names. However, when it gets too complicated, the parsing result won’t be very accurate all the time. Therefore, we can add a parameter to suggest how reliable our result is and indicate the need for further manual process if the parser cannot work reliably.


To leave a comment for the author, please follow the link and comment on their blog: Vijay Barve. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting data for every Census tract in the US with purrr and tidycensus

Wed, 05/31/2017 - 10:00

(This article was first published on KYLE WALKER DATA, and kindly contributed to R-bloggers)

Interested in more tips on working with Census data? Click here to join my email list!

Last week, I published the development version of my new R package, tidycensus. You can read through the documentation and some examples at I’m working on getting the package CRAN-ready with better error handling; in the meantime, I’m sharing a few examples to demonstrate its functionality.

If you are working on a national project that includes demographic data as a component, you might be interested in acquiring Census tract data for the entire United States. However, Census tract data are commonly available by state (with the exception of NHGIS, which is a wonderful resource), meaning that an analyst would have to spend time piecing the data together.

tidycensus solves this problem directly within R with help from the purrr package, a member of the tidyverse. In tidycensus, there is a built-in data frame named fips_codes that includes US state and county IDs; tidycensus uses this data frame to handle translations between state/county names and FIPS codes. However, this data frame can also be used to generate a vector of state codes to be fed to the map_df function in purrr. As such, this is all it takes to get a tibble of total population estimates for all US Census tracts from the 2011-2015 ACS:

library(tidycensus) library(purrr) # Un-comment below and set your API key # census_api_key("YOUR KEY GOES HERE") us <- unique(fips_codes$state)[1:51] totalpop <- map_df(us, function(x) { get_acs(geography = "tract", variables = "B01003_001", state = x) }) str(totalpop) ## Classes 'tbl_df', 'tbl' and 'data.frame': 73056 obs. of 5 variables: ## $ GEOID : chr "01001020100" "01001020200" "01001020300" "01001020400" ... ## $ NAME : chr "Census Tract 201, Autauga County, Alabama" "Census Tract 202, Autauga County, Alabama" "Census Tract 203, Autauga County, Alabama" "Census Tract 204, Autauga County, Alabama" ... ## $ variable: chr "B01003_001" "B01003_001" "B01003_001" "B01003_001" ... ## $ estimate: num 1948 2156 2968 4423 10763 ... ## $ moe : num 203 268 404 493 624 478 436 281 1000 535 ...

Get any ACS or decennial Census data in this way.

However – what if you also want tract geometry for mapping? This only requires a few small modifications. map_df in purrr uses the bind_rows function under the hood, which doesn’t work with simple features objects (yet). However, sf does have an rbind method that works for sf objects and can be fed to purrr’s reduce function.

library(sf) options(tigris_use_cache = TRUE) totalpop_sf <- reduce( map(us, function(x) { get_acs(geography = "tract", variables = "B01003_001", state = x, geometry = TRUE) }), rbind ) str(totalpop_sf) ## Classes 'sf' and 'data.frame': 72843 obs. of 6 variables: ## $ GEOID : chr "01003010500" "01003011501" "01009050500" "01015981901" ... ## $ NAME : chr "Census Tract 105, Baldwin County, Alabama" "Census Tract 115.01, Baldwin County, Alabama" "Census Tract 505, Blount County, Alabama" "Census Tract 9819.01, Calhoun County, Alabama" ... ## $ variable: chr "B01003_001" "B01003_001" "B01003_001" "B01003_001" ... ## $ estimate: num 5321 5771 7007 4 1607 ... ## $ moe : num 452 825 556 6 235 309 506 386 425 310 ... ## $ geometry:sfc_GEOMETRY of length 72843; first list element: List of 1 ## ..$ :List of 1 ## .. ..$ : num [1:55, 1:2] -87.8 -87.8 -87.8 -87.8 -87.8 ... ## ..- attr(*, "class")= chr "XY" "MULTIPOLYGON" "sfg" ## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA ## ..- attr(*, "names")= chr "GEOID" "NAME" "variable" "estimate" ... ## - attr(*, "sf_column")= chr "geometry"

By declaring geometry = TRUE, tidycensus fetches tract feature geometry using the tigris package and merges it to the ACS data automatically for you. I recommend using the caching feature in the tigris package if you plan to use this workflow multiple times. You might note the discrepancy in tracts between the geometry-enabled and regular data frames; this is due to the removal of some water-only tracts in the cartographic boundary shapefiles used by tidycensus.

To leave a comment for the author, please follow the link and comment on their blog: KYLE WALKER DATA. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Training dates for Marketing Analytics course announced

Wed, 05/31/2017 - 01:00

(This article was first published on CYBAEA on R, and kindly contributed to R-bloggers)

Classes 3-7 July 2017: We have updated our very popular training course to take advantage of the latest Microsoft Advanced Analytics technology and re-launched it as Marketing Analytics using Microsoft R and Azure. We have just announced public training dates and are looking forward to giving the course in the week 3-7 July 2017. This is a live, instructor-led course in an online classroom environment. There are class times to suit everyone; participants in Middle East and Asia register here for live classes 08:00-12:00 London time while participants in North and South America register here for classes 10:00-14:00 New York time. Please see the listings for additional time zones. This course has consistently achieved 100% positive promoter scores both on recommending the course and recommending the instructor.

This intensive course gives you hands-on experience using Microsoft R and Azure covering the key techniques for analysing customer data for Sales and Marketing. The focus is on getting to the business results and you will return to your organization with the skills you need to deliver and demonstrate commercial impact from your work.

For more information and detailed syllabus, please see the course page: Marketing Analytics using Microsoft R and Azure

We are offering the course in partnership with Microsoft (following their acquisition of Revolutions); you can see our listing on Microsoft’s Learn Analytics portal.

Hope to see you there! If you have any questions, just contact us.

To leave a comment for the author, please follow the link and comment on their blog: CYBAEA on R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science Podcasts

Wed, 05/31/2017 - 00:00

(This article was first published on Jon Calder's R Blog, and kindly contributed to R-bloggers)

Make the most of your commute! –

Podcasts are awesome. Especially when you’re stuck in traffic on the way to work.

Below are some podcasts I listen to that relate to data science and statistics. Each of them has something slightly different to offer, so if this is an area of interest to you then I recommend you give these a try!

Not So Standard Deviations

Roger Peng and Hilary Parker talk about the latest in data science and data analysis in academia and industry.

Data Skeptic

Data Skeptic is your source for a perspective of scientific skepticism on topics in statistics, machine learning, big data, artificial intelligence, and data science.

More or Less: Behind the Stats

Tim Harford and the More or Less team from BBC Radio 4 try to make sense of the statistics that surround us.

The R-Podcast

Giving practical advice on how to use R for powerful and innovative data analyses. The host of the R-Podcast is Eric Nantz, a statistician working in the life sciences industry who has been using R since 2004.

Partially Derivative

Hosted by Jonathon, Vidya, and Chris, Partially Derivative is a podcast about data science in the world around us. Episodes are a mix of explorations into the techniques used in data science and discussions with the field’s leading experts.

Linear Digressions

Hosts Katie Malone and Ben Jaffe explore machine learning and data science through interesting (and often very unusual) applications.

Are there other data science podcasts missing from this list that you can recommend? Feel free to comment below and let me know!

To leave a comment for the author, please follow the link and comment on their blog: Jon Calder's R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Reflections on ROpenSci Unconference 2017

Tue, 05/30/2017 - 23:37

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week I attended the ROpenSci Unconference in Los Angeles, and it was fantastic. Now in its fourth year, the ROpenSci team brought together a talented and diverse group of about 70 R developers from around the world to work on R-related projects in an intense 2-day hackathon. Not only did everyone have a lot of fun, make new connections and learn from others, but the event also advanced the ROpenSci mission of creating packages, tools and resources to support scientific endeavours using R.

During the two-way workshop, the attendees self-organized into teams of 4-8 to work on projects. There were 20 projects started at the ROpenSci conference, and all of them produced a working package. You can details on all the projects on Github, but here are a few examples to give you a sense of what the

@LucyStats @rdpeng Including an R-generated maze with an R-generated bot that can find his way out! @Inchio @revodavid @kwbroman @daroczig @alikzaidi

— Brooke Anderson (@gbwanderson) May 30, 2017

I was very fortunate to be part of that last team: we had a blast connecting R to Minecraft, like creating a procedurally-generated maze and an AI bot to navigate it. (I plan to write a full blog post about the project in due course.) But for more on #runconf17, take a look at Karthik Ram's storify which will give you a good sense of the conference as it unfolded in tweets. I also highly recommend checking out these post-conference wrap-ups by Jasmine Dumas, Karl Broman and Bob Rudis.

Thanks to the organizers for such a fantastic event, and also to the sponsors (including my own employer, Microsoft) for making it all possible. I'm looking forward to next year already! 

ROpenSci: Unconference 2017

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hospital Infection Scores – R Shiny App

Tue, 05/30/2017 - 20:43

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

Medicare Data – R Shiny App

About two weeks ago I created an extremely rough version of an R Shiny Application surrounding Medicare data. Right after publishing the blog post, I received a lot of input for improvement and help from others.

Here’s a look a look at the latest version of the Medicare Shiny App. This utilizes found here.

I was traveling for two weeks and had very little time to do any work on it. After creating a GitHub Repository for it, the user Ginberg played a huge role in cleaning it up and adding a lot more functionality. I found it incredible that a complete stranger to me would put in such effort to something like this. In fact, he isn’t even a resident of the USA – so Medicare probably isn’t on his radar as often as it is for some of us. Fantastic generosity!

Ultimately, I will be looking to keep this project alive and grow it to fully utilize a lot more of the Medicare data available. The infections data set was very simple and easy to use, so I started off with it but there are a lot more tables listed on The purpose of this application is to allow people who don’t want to spend time digging through tables to utilize the information available. This isn’t necessarily just for people seeking care to make a decision but this could perhaps be utilized for others doing research regarding hospitals in the US.

The R Shiny App allows you to filter by location and infection information. These are important in helping to actually find information on what you care about.

Three key tabs were created by (@Ginberg):

  • Sorting hospitals by infection score
  • Maps of hospitals in the area
  • Data table of hospital data
Sorting hospital data by score:
  • This is a tricky plot because “score” is different for each type of metric
  • Higher “scores” aren’t necessarily bad because they can be swayed by more heavily populated areas (or density)
  • Notice the use of plotly and its interactivity

Map of the data:
  • Location of the hospitals is incredibly important if you happen to live in a certain area
  • The radius of the circle allows for easy identification of places with larger “scores”
  • Hovering and clicking the circle will display more information

Displaying the data in table format:
  • This allows for visibility into the raw data that is populating the other tabs
  • It provides the ability to download a CSV or PDF file containing the relevant information
  • More information may be added such as the address and telephone number of the hospital – depending on the direction of the app

Again, I want to give a big thanks to Ginberg for essentially creating 95% of what makes this app great!

Here’s the link to the Medicare Shiny App.

You can find the code on my GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Summarizing big data in R

Tue, 05/30/2017 - 18:12

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in R?

The answer is: yes, but we suggest you use the replyr package to do so.

Let’s set up a trivial example.

suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.5.0' library("sparklyr") packageVersion("sparklyr") ## [1] '0.5.5' library("replyr") packageVersion("replyr") ## [1] '0.3.902' sc <- sparklyr::spark_connect(version='2.0.2', master = "local") diris <- copy_to(sc, iris, 'diris')

The usual S3–summary() summarizes the handle, not the data.

summary(diris) ## Length Class Mode ## src 1 src_spark list ## ops 3 op_base_remote list

tibble::glimpse() throws.

packageVersion("tibble") ## [1] '1.3.3' # errors-out glimpse(diris) ## Observations: 150 ## Variables: 5 ## Error in if (width[i] <= max_width[i]) next: missing value where TRUE/FALSE needed

broom::glance() throws.

packageVersion("broom") ## [1] '0.4.2' broom::glance(diris) ## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl

replyr_summary() works, and returns results in a data.frame.

replyr_summary(diris) %>% select(-nunique, -index, -nrows) ## column class nna min max mean sd lexmin lexmax ## 1 Sepal_Length numeric 0 4.3 7.9 5.843333 0.8280661 ## 2 Sepal_Width numeric 0 2.0 4.4 3.057333 0.4358663 ## 3 Petal_Length numeric 0 1.0 6.9 3.758000 1.7652982 ## 4 Petal_Width numeric 0 0.1 2.5 1.199333 0.7622377 ## 5 Species character 0 NA NA NA NA setosa virginica sparklyr::spark_disconnect(sc) rm(list=ls()) gc() ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 762515 40.8 1442291 77.1 1168576 62.5 ## Vcells 1394407 10.7 2552219 19.5 1820135 13.9

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New package polypoly (helper functions for orthogonal polynomials)

Tue, 05/30/2017 - 07:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

Last week, I released a new package called polypoly to CRAN. It wraps up
some common tasks for dealing with orthogonal polynomials into a single package.
The README shows off the main
functionality, as well as the neat “logo” I made for the package.
In this post, I use the package on some word recognition data.

Demo: Growth curve analysis

I primarily use orthogonal polynomials to model data from eyetracking
experiments where growth curves describe how the probability of looking at a
image changes as the image is named. The analysis technique, including
orthogonal polynomials and mixed effects models of eyetracking data, are
described in Mirman’s 2014 book.

In our 2015 paper, toddlers saw
two images on a computer screen. The objects in the images started with
different consonants: for example, duck and ball. The toddlers heard
sentences like “find the ball”, and we measured how their gaze location onscreen
changed in response to speech. This setup is a pretty standard procedure for
studying spoken word recognition.

We manipulated the vowel in the word the. In the facilitating condition, the
vowel has acoustic information (via anticipatory coarticulation) which would
allow an adult listener to predict the upcoming consonant. In the neutral
condition, the vowel provides no cues about the upcoming consonant. The
scientific question is whether these kiddos can take advantage of these acoustic
cues during word recognition.

Here’s how the data look, both in R and in a plot.

library(ggplot2) library(dplyr) # The data d #> # A tibble: 986 x 6 #> Subj Condition Time ToDistractor ToTarget Proportion #> #> 1 1 facilitating 200 9 9 0.5000000 #> 2 1 facilitating 250 9 10 0.5263158 #> 3 1 facilitating 300 6 12 0.6666667 #> 4 1 facilitating 350 6 12 0.6666667 #> 5 1 facilitating 400 6 12 0.6666667 #> 6 1 facilitating 450 6 12 0.6666667 #> 7 1 facilitating 500 6 12 0.6666667 #> 8 1 facilitating 550 6 12 0.6666667 #> 9 1 facilitating 600 4 12 0.7500000 #> 10 1 facilitating 650 3 15 0.8333333 #> # ... with 976 more rows # Helper dataframe of where to put condition labels on the next plot df_labs <- data_frame( Time = c(650, 800), Proportion = c(.775, .625), Condition = c("facilitating", "neutral")) p <- ggplot(d) + aes(x = Time, y = Proportion, color = Condition) + geom_hline(yintercept = .5, size = 2, color = "white") + stat_summary( = mean_se) + geom_text(aes(label = Condition), data = df_labs, size = 6) + labs(x = "Time after noun onset [ms]", y = "Proportion looks to named image", caption = "Mean ± SE. N = 29 children.") + guides(color = "none") p

Early on, children look equal amounts to both images on average (.5), and the
proportion of looks to the named image increase as the word unfolds. In the
facilitating condition, that rise happens earlier.

We fit a mixed-effects logistic regression model to estimate how the probability
of looking to the named image changes over time, across conditions, and within
children. We use cubic orthogonal polynomials to represent Time. For each time
point, we have three predictors available to us: Time1,
Time2, and Time3. (Plus, there’s a constant “intercept”
term.) Our model’s growth curve will be a weighted combination of these polynomial
curves. The code below shows off about half the functionality of the package

poly(unique(d$Time), 3) %>% # Force Time^1 term to range from -.5 to .5. Rescale others accordingly. polypoly::poly_rescale(scale_width = 1) %>% polypoly::poly_plot()

I think people sometimes describe the contributions of these curves to the
overall growth curve as trends: “A negative linear trend”, “a significant
quadratic trend”, etc. I like that word because it makes the terminology a
little less intimidating.

Quick aside: Why orthogonal polynomials?

Why do we use orthogonal polynomial terms? First, note that simple polynomials
x, x2 and x3 are correlated. Orthogonal ones are not
correlated. (Hence, the name.)

# Simple poly(1:10, 3, raw = TRUE) %>% cor() %>% round(2) #> 1 2 3 #> 1 1.00 0.97 0.93 #> 2 0.97 1.00 0.99 #> 3 0.93 0.99 1.00 # Orthogonal poly(1:10, 3, raw = FALSE) %>% cor() %>% round(2) #> 1 2 3 #> 1 1 0 0 #> 2 0 1 0 #> 3 0 0 1

Adding new correlated predictors to a model is a problem. The parameter
estimates will change as different predictors are added. Here we simulate some
fake data, and fit three models with 1-, 2- and 3-degree raw polynomials.

x <- 1:10 y <- x + rnorm(1, mean = 100) * (x) + rnorm(1, mean = 0, sd = .01) * (x) ^ 2 + rnorm(1, mean = -1) * (x) ^ 3 + rnorm(10) models <- list( m1 = lm(y ~ x), m2 = lm(y ~ x + I(x^2)), m3 = lm(y ~ x + I(x^2) + I(x^3)) )

As expected, the estimates for the effects change from model to model:

models %>% lapply(broom::tidy) %>% bind_rows(.id = "model") %>% select(model:estimate) %>% mutate(estimate = round(estimate, 2)) #> model term estimate #> 1 m1 (Intercept) 75.69 #> 2 m1 x 72.91 #> 3 m2 (Intercept) -23.91 #> 4 m2 x 122.72 #> 5 m2 I(x^2) -4.53 #> 6 m3 (Intercept) 1.15 #> 7 m3 x 100.48 #> 8 m3 I(x^2) 0.29 #> 9 m3 I(x^3) -0.29

But with orthogonal polynomials, the parameter estimates don’t change from model
to model.

models2 <- list( m1 = lm(y ~ poly(x, 1)), m2 = lm(y ~ poly(x, 2)), m3 = lm(y ~ poly(x, 3)) ) models2 %>% lapply(broom::tidy) %>% bind_rows(.id = "model") %>% select(model:estimate) %>% mutate(estimate = round(estimate, 2)) #> model term estimate #> 1 m1 (Intercept) 476.72 #> 2 m1 poly(x, 1) 662.27 #> 3 m2 (Intercept) 476.72 #> 4 m2 poly(x, 2)1 662.27 #> 5 m2 poly(x, 2)2 -104.03 #> 6 m3 (Intercept) 476.72 #> 7 m3 poly(x, 3)1 662.27 #> 8 m3 poly(x, 3)2 -104.03 #> 9 m3 poly(x, 3)3 -16.24

That’s probably the simplest reason why orthogonal polynomials are preferred. (I
can’t remember any others right now.)

Back to the data

Before fitting the model, I use poly_add_columns() to add polynomial terms as
columns to the dataframe. (For speed here, I use a simplified random effects
structure, estimating growth curve parameters for each Child x Condition

library(lme4) #> Loading required package: Matrix #> Loading required package: methods d <- d %>% polypoly::poly_add_columns(Time, degree = 3, prefix = "ot", scale_width = 1) %>% # Change the reference level mutate(Condition = factor(Condition, c("neutral", "facilitating"))) m <- glmer( cbind(ToTarget, ToDistractor) ~ (ot1 + ot2 + ot3) * Condition + (ot1 + ot2 + ot3 | Subj:Condition), family = binomial, data = d)

We can confirm that the model captures the overall shape of the growth curves.

# The lines here are not quite the overall average, but the averages of 29 # individual fits (for each participant). That's why the caption is a little # weird. p + stat_summary(aes(y = fitted(m)), fun.y = mean, geom = "line") + labs(caption = "Line: Average of model-fitted values. Points: Mean ± SE.")

We can inspect the model summary as well.

arm::display(m) #> glmer(formula = cbind(ToTarget, ToDistractor) ~ (ot1 + ot2 + #> ot3) * Condition + (ot1 + ot2 + ot3 | Subj:Condition), data = d, #> family = binomial) #> coef.est #> (Intercept) 0.47 0.10 #> ot1 1.57 0.28 #> ot2 0.45 0.11 #> ot3 -0.34 0.09 #> Conditionfacilitating 0.23 0.14 #> ot1:Conditionfacilitating 0.45 0.39 #> ot2:Conditionfacilitating -0.44 0.16 #> ot3:Conditionfacilitating 0.11 0.13 #> #> Error terms: #> Groups Name Std.Dev. Corr #> Subj:Condition (Intercept) 0.53 #> ot1 1.46 0.23 #> ot2 0.52 -0.05 0.31 #> ot3 0.39 -0.08 -0.64 0.09 #> Residual 1.00 #> --- #> number of obs: 986, groups: Subj:Condition, 58 #> AIC = 4788.2, DIC = -3961.1 #> deviance = 395.6

The model summary indicates a significant Condition x Time2
interaction, but really, only the intercept and Time1 can ever be
interpreted directly. To understand the model fit, we visualize how each of the
polynomial terms are weighted.

Here we create a matrix of the polynomial terms plus a column of ones for the

time_mat <- poly(sort(unique(d$Time)), 3) %>% polypoly::poly_rescale(1) %>% cbind(constant = 1, .) round(time_mat, 2) #> constant 1 2 3 #> [1,] 1 -0.50 0.57 -0.57 #> [2,] 1 -0.44 0.36 -0.14 #> [3,] 1 -0.37 0.17 0.14 #> [4,] 1 -0.31 0.01 0.30 #> [5,] 1 -0.25 -0.11 0.36 #> [6,] 1 -0.19 -0.22 0.34 #> [7,] 1 -0.12 -0.29 0.26 #> [8,] 1 -0.06 -0.33 0.14 #> [9,] 1 0.00 -0.34 0.00 #> [10,] 1 0.06 -0.33 -0.14 #> [11,] 1 0.12 -0.29 -0.26 #> [12,] 1 0.19 -0.22 -0.34 #> [13,] 1 0.25 -0.11 -0.36 #> [14,] 1 0.31 0.01 -0.30 #> [15,] 1 0.37 0.17 -0.14 #> [16,] 1 0.44 0.36 0.14 #> [17,] 1 0.50 0.57 0.57

To compute the weighted values, we multiply by a diagonal matrix of the

neut_coefs <- fixef(m)[1:4] faci_coefs <- neut_coefs + fixef(m)[5:8] faci_coefs #> (Intercept) ot1 ot2 ot3 #> 0.699926322 2.014186150 0.006646146 -0.226658408 set_colnames <- `colnames<-` m_neut <- time_mat %*% diag(neut_coefs) %>% set_colnames(c("constant", "ot1", "ot2", "ot3")) m_faci <- time_mat %*% diag(faci_coefs) %>% set_colnames(c("constant", "ot1", "ot2", "ot3")) # Convince ourselves with an example round(m_faci, 2) #> constant ot1 ot2 ot3 #> [1,] 0.7 -1.01 0 0.13 #> [2,] 0.7 -0.88 0 0.03 #> [3,] 0.7 -0.76 0 -0.03 #> [4,] 0.7 -0.63 0 -0.07 #> [5,] 0.7 -0.50 0 -0.08 #> [6,] 0.7 -0.38 0 -0.08 #> [7,] 0.7 -0.25 0 -0.06 #> [8,] 0.7 -0.13 0 -0.03 #> [9,] 0.7 0.00 0 0.00 #> [10,] 0.7 0.13 0 0.03 #> [11,] 0.7 0.25 0 0.06 #> [12,] 0.7 0.38 0 0.08 #> [13,] 0.7 0.50 0 0.08 #> [14,] 0.7 0.63 0 0.07 #> [15,] 0.7 0.76 0 0.03 #> [16,] 0.7 0.88 0 -0.03 #> [17,] 0.7 1.01 0 -0.13

Then, we can use the poly_melt() function to get a dataframe from each
weighted matrix and then plot each of the effects.

df_neut <- m_neut %>% polypoly::poly_melt() %>% tibble::add_column(Condition = "neutral") df_faci <- m_faci %>% polypoly::poly_melt() %>% tibble::add_column(Condition = "facilitating") df_both <- bind_rows(df_faci, df_neut) %>% mutate(Condition = factor(Condition, c("neutral", "facilitating"))) ggplot(df_both) + aes(x = observation, y = value, color = Condition) + geom_line() + facet_wrap("degree")

Visually, the quadratic effect on the neutral curve pulls down the values during
the center (when the curves are most different) and pushes the values in the
tails upwards (when the curves are closest). Although only the quadratic effect
is nominally significant, the constant and linear terms suggest other smaller
effects but they are too noisy to pin down.

It’s worth noting that the predictors and weights discussed above are on the
log-odds/logit scale used inside of the model, instead of the proportion scale
used in the plots of the data and model fits. Basically, these weighted values
are summed together and then squeezed into the range [0, 1] with a nonlinear
transformation. For these data, the two scales produce similar looking growth
curves, but you can notice that the right end of the curves are pinched slightly
closer together in the probability-scale plot:

ggplot(df_both) + aes(x = observation, y = value, color = Condition) + stat_summary(fun.y = sum, geom = "line") + ggtitle("logit scale") + guides(color = "none") ggplot(df_both) + aes(x = observation, y = value, color = Condition) + stat_summary(fun.y = function(xs) plogis(sum(xs)), geom = "line") + ggtitle("probability scale") + guides(color = "none")

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mixed models for ANOVA designs with one observation per unit of observation and cell of the design

Mon, 05/29/2017 - 22:34

(This article was first published on Computational Psychology - Henrik Singmann, and kindly contributed to R-bloggers)

Together with David Kellen I am currently working on an introductory chapter to mixed models for a book edited by Dan Spieler and Eric Schumacher (the current version can be found here). The goal is to provide a theoretical and practical introduction that is targeted mainly at experimental psychologists, neuroscientists, and others working with experimental designs and human data. The practical part focuses obviously on R, specifically on lme4 and afex.

One part of the chapter was supposed to deal with designs that cannot be estimated with the maximal random effects structure justified by the design because there is only one observation per participant and cell of the design. Such designs are the classical repeated-measures ANOVA design as ANOVA cannot deal with replicates at the cell levels (i.e., those are usually aggregated to yield one observation per cell and unit of observation). Based on my previous thoughts that turned out to be wrong we wrote the following:

Random Effects Structures for Traditional ANOVA Designs

The estimation of the maximal model is not possible when there is only one observation per participant and cell of a repeated-measures design. These designs are typically analyzed using a repeated-measures ANOVA. Currently, there are no clear guidelines on how to proceed in such situations, but we will try to provide some advice. If there is only a single random effects grouping factor, for example participants, we feel that instead of a mixed model, it is appropriate to use a standard repeated-measures ANOVA that addresses sphericity violations via the Greenhouse-Geisser correction.

One alternative strategy that employs mixed models and that we \emph{do not recommend} consists of using the random-intercept only model or removing the random slopes for the highest within-subject interaction. The resulting model assumes invariance of the omitted random effects across participants. If this assumption is violated such a model produces results that cannot be trusted . […]

Fortunately, we asked Jake Westfall to take a look at the chapter and Jake responded:

I don’t think I agree with this. In the situation you describe, where we have a single random factor in a balanced ANOVA-like design with 1 observation per unit per cell, personally I am a proponent of the omit-the-the-highest-level-random-interaction approach. In this kind of design, the random slopes for the highest-level interaction are perfectly confounded with the trial-level error term (in more technical language, the model is only identifiable up to the sum of these two variance components), which is what causes the identifiability problems when one tries to estimate the full maximal model there. (You know all of this of course.) So two equivalent ways to make the model identifiable are to (1) omit the error term, i.e., force the residual variance to be 0, or (2) omit the random slopes for the highest-level interaction. Both of these approaches should (AFAIK) result in a statistically equivalent model, but lme4 does not provide an easy way to do (1), so I generally recommend (2). The important point here is that the standard errors should still be correct in either case — because these two variance components are confounded, omitting e.g. the random interaction slopes simply causes that omitted variance component to be implicitly added to the residual variance, where it is still incorporated into the standard errors of the fixed effects in the appropriate way (because the standard error of the fixed interaction looks roughly like sqrt[(var_error + var_interaction)/n_subjects]). I think one could pretty easily put together a little simulation that would demonstrate this.

Hmm, that sounds very reasonable, but can my intuition on the random effects structure and mixed models really be that wrong? To investigate this I followed Jake’s advise and coded a short simulation that tested this and as it turns out, Jake is right and I was wrong.

In the simulation we will simulate a simple one-factor repeated-measures design with one factor with three levels. Importantly, each unit of observation will only have one observation per factor level. We will then fit this simulated data with both repeated-measures ANOVA and random-intercept only mixed and compare their p-values. Note again that for such a design we cannot estimate random slopes for the condition effect.

First, we need a few packages and set some parameters for our simulation:

require(afex) set_sum_contrasts() # for orthogonal sum-to-zero contrasts require(MASS) NSIM <- 1e4 # number of simulated data sets NPAR <- 30 # number of participants per cell NCELLS <- 3 # number of cells (i.e., groups)

Now we need to generate the data. For this I employed an approach that is clearly not the most parsimonious, but most clearly follows the formulation of a mixed model that has random variability in the condition effect and on top of this residual variance (i.e., the two confounded factors).

We first create a bare bone data.frame with participant id and condition column and a corresponding model.matrix. Then we create the three random parameters (i.e., intercept and the two parameters for the three conditions) using a zero-centered multivarite normal with specified variance-covariance matrix. We then loop over the participant and estimate the predictions deriving from the three random effects parameters. Only after this we add uncorrelated residual variance to the observations for each simulated data set.

dat <- expand.grid(condition = factor(letters[seq_len(NCELLS)]), id = factor(seq_len(NPAR))) head(dat) # condition id # 1 a 1 # 2 b 1 # 3 c 1 # 4 a 2 # 5 b 2 # 6 c 2 mm <- model.matrix(~condition, dat) head(mm) # (Intercept) condition1 condition2 # 1 1 1 0 # 2 1 0 1 # 3 1 -1 -1 # 4 1 1 0 # 5 1 0 1 # 6 1 -1 -1 Sigma_c_1 <- matrix(0.6, NCELLS,NCELLS) diag(Sigma_c_1) <- 1 d_c_1 <- replicate(NSIM, mvrnorm(NPAR, rep(0, NCELLS), Sigma_c_1), simplify = FALSE) gen_dat <- vector("list", NSIM) for(i in seq_len(NSIM)) { gen_dat[[i]] <- dat gen_dat[[i]]$dv <- NA_real_ for (j in seq_len(NPAR)) { gen_dat[[i]][(j-1)*3+(1:3),"dv"] <- mm[1:3,] %*% d_c_1[[i]][j,] } gen_dat[[i]]$dv <- gen_dat[[i]]$dv+rnorm(nrow(mm), 0, 1) }

Now we only need a function that estimates the ANOVA and mixed model for each data set and returns the p-value and loop over it.

## functions returning p-value for ANOVA and mixed model within_anova <- function(data) { suppressWarnings(suppressMessages( a <- aov_ez(id = "id", dv = "dv", data, within = "condition", return = "univariate", anova_table = list(es = "none")) )) c(without = a[["univariate.tests"]][2,6], gg = a[["pval.adjustments"]][1,2], hf = a[["pval.adjustments"]][1,4]) } within_mixed <- function(data) { suppressWarnings( m <- mixed(dv~condition+(1|id),data, progress = FALSE) ) c(mixed=anova(m)$`Pr(>F)`) } p_c1_within <- vapply(gen_dat, within_anova, rep(0.0, 3)) m_c1_within <- vapply(gen_dat, within_mixed, 0.0)

The following graph shows the results (GG is the results using the Greenhouse-Geisser adjustment for sphericity violations).

ylim <- c(0, 700) par(mfrow = c(1,3)) hist(p_c1_within[1,], breaks = 20, main = "ANOVA (default)", xlab = "p-value", ylim=ylim) hist(p_c1_within[2,], breaks = 20, main = "ANOVA (GG)", xlab = "p-value", ylim=ylim) hist(m_c1_within, breaks = 20, main = "Random-Intercept Model", xlab = "p-value", ylim=ylim)

What these graph clearly shows is that the p-value distribution for the standard repeated-measures ANOVA and the random-intercept mixed model is virtually identical. This clearly shows that my intuition was wrong and Jake was right.

We also see that for ANOVA and mixed model the rate of significant findings with p < .05 is slightly above the nominal level. More specifically:

mean(p_c1_within[1,] < 0.05) # ANOVA default # [1] 0.0684 mean(p_c1_within[2,] < 0.05) # ANOVA GG # [1] 0.0529 mean(p_c1_within[3,] < 0.05) # ANOVA HF # [1] 0.0549 mean(m_c1_within < 0.05) # random-intercept mixed model # [1] 0.0701

These additional results indicate that maybe one also needs to adjust the degrees of freedom for mixed models for violations of sphericity. But this is not the topic of today’s post.

To sum this up, this simulation shows that removing the highest-order random slope seems to be the right decision if one wants to use a mixed model for a design with one observation per cell of the design and participant, but wants to implement the ‘maximal random effects structure’.

One more thing to note. Ben Bolker raised the same issue and pointed us to one of his example analyses of the starling data that is relevant to the current question. We are very grateful that Jake and Ben took the time to go through our chapter!

You can also download the RMarkdown file of the simulation.

References 881472


To leave a comment for the author, please follow the link and comment on their blog: Computational Psychology - Henrik Singmann. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...