Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 22 min ago

Simulations to explore excessive lagged X variables in time series modelling

Sat, 03/11/2017 - 12:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

I was once in a meeting discussing a time series modelling and forecasting challenge where it was suggested that “the beauty of regression is you just add in more variables and more lags of variables and try the combinations until you get something that fits really well”. Well, no, it doesn’t work like that; at least not in many cases with realistically small social science or public policy datasets. Today’s post looks at one particular trap – a modelling strategy that tries more lags of explanatory variables than are justified given the amount of data available.

The data

Imagine you have a time series y and two candidate explanatory variables x1 and x2. We have values of x1 and x2 for the future – perhaps because they are genuinely known in advance, perhaps generated by independent forecasts. Let’s say there is good theoretical or subject matter reason for thinking there is a relationship between them, but it’s clearly not a simple one. It’s possible the relationship is lagged – perhaps x1 and x2 are leading economic indicators and y is the result of decisions people take based on the history of the xs. The job is to forecast y conditional on X, and if possible generate some insight about the structural relationship of the xs and y, and be prepared for some scenarios where x1 or x2 might change and we want to see the impact on y.

Here’s an example of something we might see:

Those series were generated by me. We have 110 historical observations, and a forecast period of 12 for which there are observations for x1 and x2 but not y. Each series is in reality completely independent of the others, something I’ve done deliberately to illustrate the point I’m about to make. For those with an interest, they are all ARIMA(2,1,1) models with the same parameterisation but different random sequences behind them.

Basic forecasting models Univariate

A simple strategy for forecasting y would be to find the best ARIMA model that fits the data using Rob Hyndman’s auto.arima algorithm, which has been shown to perform well. We can do this with or without considering the x variables at all. If we ignore the x variables, we get a forecast that looks like this:

Series: y ARIMA(0,1,2) Coefficients: ma1 ma2 1.4250 0.4415 s.e. 0.0865 0.0867 sigma^2 estimated as 0.7736: log likelihood=-141.77 AIC=289.54 AICc=289.77 BIC=297.62

In fact, because we secretly know how the data was generated, we know that this is a good model of this particular data. The auto.arima algorithm picked up that the data was non-stationary and needed to be differenced once before modelling; but it got the “wrong” specification of the number of auto-regressive and moving average terms to include (this isn’t unusual, nor is it necessarily a problem). By default it uses a stepwise method to identify the best combination of up to five auto-regressive parameters and five moving average parameters.

Linear regression with time series error terms

A second, nearly as simple method of forecasting would be to fit a linear regression with an ARIMA time series error term. The auto.arima method will do this, and look after all the tests of non-stationarity and difference in response that are required. It gives easily interpretable coefficients for the relationship between y and x1 and x2. Here’s what we see in our case:

Series: y Regression with ARIMA(0,1,2) errors Coefficients: ma1 ma2 x1 x2 1.4202 0.4349 0.0371 -0.0364 s.e. 0.0870 0.0878 0.0867 0.0796 sigma^2 estimated as 0.785: log likelihood=-141.59 AIC=293.18 AICc=293.77 BIC=306.64

Basically, the model has correctly identified there is a weak or no relationship between the xs and y. The momentum of y itself is the main predictor of future y, and this shows in the forecasts which are only marginally different from our first univariate case. Using the Akaike Information Criterion, the model with x regressors is inferior to the simpler model.

More complex models

A problem with above approaches is that they don’t take into account the (wrongly) suspected lagged relationship between the xs and y, other than through the current/simultaneous in time relationship. There’s an obvious solution – introduce lagged terms of x1 and x2 into the regression. There are a variety of ways of doing this and specialist R packages like dynlm, dyn and dse that I’m not going to introduce because I want things from here on to be simple and explicit. See the CRAN time series task view for a starting point to explore further.

I want to focus on how problematic this can be. How many parameters in our model are we prepared to estimate, given our 110 historical data points? Frank Harrell’s advice in Regression Modeling Strategies is:

“Studies in which models are validated on independent datasets have shown that in many situations a fitted regression model is likely to be reliable when the number of predictors (or candidate predictors if using variable selection) p is less than m/10 or m/20, where m is the limiting sample size”

Frank Harrell

Our “limiting sample size” is 110, and our 110 time series observations are actually worth less than 110 independent ones would be because each observation contains only marginal new information for us; knowing the value at time t gives us a good indication of where it will be at time t+1, something not the case in more independent sampling situations.

This suggests we should be considering five or six, or eleven at the very most, candidate variables for inclusion. Just by using auto.arima we’ve looked at ten parameters (five autoregression lags, and five moving average parameters), and by adding in x1 and x2 we’ve definitely reached our maximum. Ouch. As we’re usually not interested in individual AR and MA parameters we can perhaps get away with stretching a point, but there’s no get out of jail card here that means we can meaningfully keep adding in more explanatory variables.

So what happens if we proceed anyway? To check this out, I tried three strategies:

  1. Created variables for up to 20 lag periods for each of the three original variables, leaving us with 62 candidate predictors, and fit a model by ordinary least squares.
  2. Take the full model from above and use stepwise selection to knock out individual variables with p values < 0.05 (the conventional cut-off point for a variable being “not significant”). I had to torture R’s step function to do this, because quite rightly it prefers to choose models based on AIC rather than individual p-values; I wanted to mimic the p-value method for demonstration purposes.
  3. Take the full model and use lasso-style shrinkage/regularisation on the 62 predictors’ coefficients.

Of those methods, I think #1 and #2 are both potentially disastrous and will lead to overfitting (which means that the model will perform badly when confronted with new data ie in predicting our twelve future points). Stepwise variable selection is particularly pernicious because it can create the illusion that “significant” relationships between the variables have been found. At the end of a stepwise procedure, software can provide you with F tests, t tests and p-values but they are invalid because you have chosen the variables remaining in the model precisely because they score well on those tests. In today’s case, here is what we end up with.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.23003 0.10375 2.217 0.029642 * y_d_lag_1 1.15802 0.09837 11.772 < 2e-16 *** x1_d_lag_1 0.27902 0.07908 3.528 0.000718 *** y_d_lag_2 -1.04546 0.14885 -7.023 8.39e-10 *** y_d_lag_3 0.78761 0.16493 4.775 8.68e-06 *** y_d_lag_4 -0.44729 0.14928 -2.996 0.003704 ** x1_d_lag_4 0.24485 0.07325 3.342 0.001298 ** y_d_lag_5 0.33654 0.10118 3.326 0.001366 ** x1_d_lag_6 0.19763 0.06565 3.010 0.003555 ** x2_d_lag_8 -0.11246 0.05336 -2.107 0.038422 * x2_d_lag_16 -0.17330 0.06347 -2.731 0.007876 ** x1_d_lag_17 0.22910 0.07073 3.239 0.001789 ** x1_d_lag_18 -0.18724 0.07147 -2.620 0.010643 * x2_d_lag_19 -0.11310 0.05536 -2.043 0.044546 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8418 on 75 degrees of freedom Multiple R-squared: 0.7594, Adjusted R-squared: 0.7177 F-statistic: 18.21 on 13 and 75 DF, p-value: < 2.2e-16

A high R-squared, even a high “Adjusted R-squared” and many variables with p values well below the 0.05 cut-off. So (for example) the unwary analyst would conclude we have strong evidence that y is impacted by x2 16 periods ago and by x1 17 periods ago, but mysteriously not by anything from lags 9 through 15.

Remember that the true data generation process had zero influence from x1 and x2 on y or vice versa.

Working with lags in a time series situation does not give any wild cards that make stepwise selection valid. The usual problems with stepwise selection within a model building strategy apply:

  1. R^2 values are biased high
  2. The usual F and \chi^2 test statistics of the final model do not have the claimed distribution
  3. The standard errors of coefficients are biased low and confidence intervals of effects are too narrow
  4. p-values are far too small, do not have the proper meaning, and correction is very difficult
  5. The coefficients of effects that remain in the model are strongly biased away from zero
  6. Variable selection is made arbitrary
  7. The whole process allows us to escape thinking about the problem

From Harrell’s Regression Modeling Strategies.

In the case of the full model, with 62 variables, the exact problems above do not apply. The p-values for example will be basically ok (and in fact in this case none of the x1 and x2 lags show up as significant until we start knocking out the variables one at a time). The problem is that with 64 parameters (including the intercept and variance) estimated from only 109 data points (after we have differenced the series to make them stationary), we have extremely high variance and instability in the model.

Shrinkage of the estimated coefficients via the lasso is one way of fixing this, and indeed the lasso is generally used precisely in this situation of excessive candidate explanatory variables when predictive power is important. We trade off some bias in the coefficients for better stability and predictive power. Using the lasso in this case reduces nearly all the x coefficients to zero, and lags 1 and 2 of the y variable staying in the model as they should.

Comparison of results

Here’s the actual forecasts of the five methods we’ve tried so far:

What we see is that the two methods I’ve labelled “bad” give very similar results – a sharp decline. As a result of their overfitting they are both picking up structure in the xs that does not really impact on y.

The three “ok” methods give results that are more in tune with the secret data generating process. The most encouraging thing for me here is the success of the lasso in avoiding the pitfalls of the stepwise method. If there really is a complex lagged relationship between the xs and y, this method of looking at all 62 variables and forcing a relatively stable estimation process at least gives the analyst a sporting chance of finding it. The orderedLasso R package – which I haven’t tried using yet – looks to take this idea further in implementing the idea of an “Ordered Lasso and Sparse Time-lagged Regression”, and I’m pretty confident is worth a look.

Final word – this is an illustration rather than a comprehensive demonstration. Obviously, the thing to be done to really prove the issues would be to generate many such cases, including “real” future values of y for test purposes, and test the error rates of the forecasts from the different methods. I don’t particularly feel the need to do that, but maybe one day if I run out of things to do will give it a go.

Code library(tidyverse) library(scales) library(forecast) library(glmnet) library(forcats) library(directlabels) #===========generate data============ n <- 110 h <- 12 set.seed(123) y <- ts(cumsum(arima.sim(model = list(ar = c(0.5, -0.2), ma = 0.9), n = n) + 0.2)) x1 <- cumsum(arima.sim(model = list(ar = c(0.5, -0.2), ma = 0.9), n = n + h) - 0.2) x2 <- cumsum(arima.sim(model = list(ar = c(0.5, -0.2), ma = 0.9), n = n + h)) orig_data <- data_frame(Value = c(y, x1, x2), Variable = rep(c("y", "x1", "x2"), times = c(n, n + h, n + h)), TimePeriod = c(1:n, 1:(n+h), 1:(n+h))) orig_data %>% ggplot(aes(x = TimePeriod, y = Value)) + geom_line() + facet_wrap(~Variable) + ggtitle("Original data", "Three unrelated (but that is unknown to us) univariate time series") X_full <- cbind(x1, x2) X_historical <- X_full[1:n, ] X_future <- X_full[-(1:n), ] # A version that has been differenced once.... YX_diff_lags <- data_frame( y_d = c(diff(y), rep(NA, h)), x1_d = diff(x1), x2_d = diff(x2)) # And has lagged variables up to lag period of 20 for each variable: lagnumber <- 20 for (i in 1:lagnumber){ YX_diff_lags[, ncol(YX_diff_lags) + 1] <- lag(YX_diff_lags$y_d, i) YX_diff_lags[, ncol(YX_diff_lags) + 1] <- lag(YX_diff_lags$x1_d, i) YX_diff_lags[, ncol(YX_diff_lags) + 1] <- lag(YX_diff_lags$x2_d,i) } names(YX_diff_lags)[-(1:3)] <- paste0(names(YX_diff_lags)[1:3], "_lag_", rep(1:lagnumber, times = rep(3, lagnumber))) if(ncol(YX_diff_lags) != 3 + 3 * lagnumber){ stop("Wrong number of columns; something went wrong") } #===================Modelling options======== #-------------univariate----------------- mod1 <- auto.arima(y) fc1 <- forecast(mod1, h = h) autoplot(fc1) #------------xreg, no lags---------------- mod2 <- auto.arima(y, xreg = X_historical) fc2 <- forecast(mod2, xreg = X_future) autoplot(fc2) #-----------xreg, lags---------------- YX_hist <- YX_diff_lags[complete.cases(YX_diff_lags), ] mod3 <- lm(y_d ~ ., data = YX_hist) #' Forecast from a regression model with lags one row at a time #' Assumes the existence of: YX_diff_lags & y. #' Has to forecast one row at a time, then populate the explanatory data frame #' with the lagged values possible after that forecast. onerow_forecast <- function(model){ prediction_data <- YX_diff_lags[$y_d), ] y_names <- names(YX_diff_lags)[grepl("y_d", names(YX_diff_lags))] for(i in 1:nrow(prediction_data)){ for(j in 2:nrow(prediction_data)){ for(k in 2:length(y_names)){ prediction_data[j , y_names[k]] <- prediction_data[j-1 , y_names[k - 1]] } } if(class(model) == "cv.glmnet"){ new_y <- predict(model, newx = as.matrix(prediction_data[i, -1])) } else { new_y <- predict(model, newdata = prediction_data[i, -1]) } prediction_data[i , "y_d"] <- new_y } # de-diff ie return to the original, non-differenced scale fc_y <- cumsum(c(as.numeric(tail(y, 1)), prediction_data$y_d)) return(fc_y[-1]) } fc3 <- onerow_forecast(mod3) #-------xreg, lags, stepwise---------------- # high value of k to mimic stepwise selection based on p values: mod4 <- step(mod3, k = 4.7, trace = 0) fc4 <- onerow_forecast(mod4) #-----------xreg, lags, lasso--------------- # Use cross-validation to determine best value of lambda, for how much # shrinkage to force: mod5 <- cv.glmnet(y = YX_hist$y_d, x = as.matrix(YX_hist[ , -1])) fc5 <- onerow_forecast(mod5) #====================compare results============== p <- data_frame(Univariate = fc1$mean, SimpleXreg = fc2$mean, AllLags = fc3, StepLags = fc4, LassoLags = fc5, TimePeriod = 1:h + n) %>% gather(Variable, Value, -TimePeriod) %>% mutate(Judge = ifelse(grepl("Lags", Variable) & !grepl("Lasso", Variable), "Bad", "OK")) %>% mutate(Variable = fct_relevel(Variable, c("AllLags", "StepLags"))) %>% ggplot(aes(x = TimePeriod, y = Value, colour = Variable)) + facet_wrap(~Judge, ncol = 1) + geom_line(size = 1.2) + scale_colour_brewer(palette = "Set1") + theme(legend.position = "right") + geom_line(data = filter(orig_data, Variable == "y"), colour = "black", linetype = 3) + xlim(c(0, 130)) + annotate("text", x = 25, y = 15, label = "Historical data", colour = "grey40", family = "myfont") + ggtitle("Excessive inclusion of lagged explanatory variables leads to overfitting", "Regularization by lasso, or using a simpler model in the first place, gives much better results.") + labs(caption = "Simulations from forecasting one time series based on two unrelated time series.") direct.label(p)

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

cricketr flexes new muscles: The final analysis

Sat, 03/11/2017 - 11:49

Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

Jabberwocky by Lewis Carroll

No analysis of cricket is complete, without determining how players would perform in the host country. Playing Test cricket on foreign pitches, in the host country, is a ‘real test’ for both batsmen and bowlers. Players, who can perform consistently both on domestic and foreign pitches are the genuinely ‘class’ players. Player performance on foreign pitches lets us differentiate the paper tigers, and home ground bullies among batsmen. Similarly, spinners who perform well, only on rank turners in home ground or pace bowlers who can only swing and generate bounce on specially prepared pitches are neither  genuine spinners nor  real pace bowlers.

So this post, helps in identifying those with real strengths, and those who play good only when the conditions are in favor, in home grounds. This post brings a certain level of finality to the analysis of players with my R package ‘cricketr’

Besides, I also meant ‘final analysis’ in the literal sense, as I intend to take a long break from cricket analysis/analytics and focus on some other domains like Neural Networks, Deep Learning and Spark.

As already mentioned, my R package ‘cricketr’ uses the statistics info available in ESPN Cricinfo Statsguru. You should be able to install the package from CRAN and use many of the functions available in the package. Please be mindful of ESPN Cricinfo Terms of Use

(Note: This page is also hosted at RPubs as cricketrFinalAnalysis. You can download the PDF file at cricketrFinalAnalysis.

For getting data of a player against a particular country for the match played in the host country, I just had to add 2 extra parameters to the getPlayerData() function. The cricketr package has been updated with the changed functions for getPlayerData() – Tests, getPlayerDataOD() – ODI and getPlayerDataTT() for the Twenty20s. The updated functions will be available in cricketr Version -0.0.14

The data for the following players have already been obtained with the new, changed getPlayerData() function and have been saved as *.csv files. I will be re-using these files, instead of getting them all over again. Hence the getPlayerData() lines have been commented below

library(cricketr) 1. Performance of a batsman against a host ountry in the host country

For e.g We can the get the data for Sachin Tendulkar for matches played against Australia and in Australia Here opposition=2 and host =2 indicate that the opposition is Australia and the host country is also Australia


All cricketr functions can be used with this data frame, as before. All the charts show the performance of Tendulkar in Australia against Australia.

par(mfrow=c(2,3)) par(mar=c(4,4,2,2)) batsman4s("./data/tendulkarVsAusInAus.csv","Tendulkar") batsman6s("./data/tendulkarVsAusInAus.csv","Tendulkar") batsmanRunsRanges("./data/tendulkarVsAusInAus.csv","Tendulkar") batsmanDismissals("./data/tendulkarVsAusInAus.csv","Tendulkar") batsmanAvgRunsGround("./data/tendulkarVsAusInAus.csv","Tendulkar") batsmanMovingAverage("./data/tendulkarVsAusInAus.csv","Tendulkar") ## null device ## 1 2. Relative performances of international batsmen against England in England

While we can analyze the performance of a player against an opposition in some host country, I wanted to compare the relative performances of players, to see how players from different nations play in a host country which is not their home ground.

The following lines gets player’s data of matches played in England and against England.The Oval, Lord’s are famous for generating some dangerous swing and bounce. I chose the following players

  1. Sir Don Bradman (Australia)
  2. Steve Waugh (Australia)
  3. Rahul Dravid (India)
  4. Vivian Richards (West Indies)
  5. Sachin Tendulkar (India)
#tendulkarEng=getPlayerData(35320,opposition=1,host=1,file="tendulkarVsEngInEng.csv",type="batting") #bradmanEng=getPlayerData(4188,opposition=1,host=1,file="bradmanVsEngInEng.csv",type="batting") #srwaughEng=getPlayerData(8192,opposition=1,host=1,file="srwaughVsEngInEng.csv",type="batting") #dravidEng=getPlayerData(28114,opposition=1,host=1,file="dravidVsEngInEng.csv",type="batting") #vrichardEng=getPlayerData(52812,opposition=1,host=1,file="vrichardsEngInEng.csv",type="batting") frames <- list("./data/tendulkarVsEngInEng.csv","./data/bradmanVsEngInEng.csv","./data/srwaughVsEngInEng.csv", "./data/dravidVsEngInEng.csv","./data/vrichardsEngInEng.csv") names <- list("S Tendulkar","D Bradman","SR Waugh","R Dravid","Viv Richards")

The Lords and the Oval in England are some of the best pitches in the world. Scoring on these pitches and weather conditions, where there is both swing and bounce really requires excellent batting skills. It can be easily seen that Don Bradman stands heads and shoulders over everybody else, averaging close a cumulative average of 100+. He is followed by Viv Richards, who averages around ~60. Interestingly in English conditions, Rahul Dravid edges out Sachin Tendulkar.


# The other 2 plots on relative strike rate and cumulative average strike rate, shows Viv Richards really blasts the bowling. Viv Richards has a strike rate of 70, while Bradman 62+, followed by Tendulkar. relativeBatsmanSR(frames,names)


3. Relative performances of international batsmen against Australia in Australia

The following players from these countries were chosen

  1. Sachin Tendulkar (India)
  2. Viv Richard (West Indies)
  3. David Gower (England)
  4. Jacques Kallis (South Africa)
  5. Alastair Cook (Emgland)
frames <- list("./data/tendulkarVsAusInAus.csv","./data/vrichardsVAusInAus.csv","./data/dgowerVsAusInAus.csv", "./data/kallisVsAusInAus.csv","./data/ancookVsWIInWI.csv") names <- list("S Tendulkar","Viv Richards","David Gower","J Kallis","AN Cook")

Alastair Cook of England has fantastic cumulative average of 55+ on the pitches of Australia. There is a dip towards the end, but we cannot predict whether it would have continued. AN Cook is followed by Tendulkar who has a steady average of 50+ runs, after which there is Viv Richards.


#With respect to cumulative or relative strike rate Viv Richards is a class apart.He seems to really #tear into bowlers. David Gower has an excellent strike rate and is followed by Tendulkar relativeBatsmanSR(frames,names)


4. Relative performances of international batsmen against India in India

While England & Australia are famous for bouncy tracks with swing, Indian pitches are renowed for being extraordinary turners. Also India has always thrown up world class spinners, from the spin quartet of BS Chandraskehar, Bishen Singh Bedi, EAS Prasanna, S Venkatraghavan, to the times of dangerous Anil Kumble, and now to the more recent Ravichander Ashwon and Harbhajan Singh.

A batsmen who can score runs in India against Indian spinners has to be really adept in handling all kinds of spin.

While Clive Lloyd & Alvin Kallicharan had the best performance against India, they have not been included as ESPN Cricinfo had many of the columns missing.

So I chose the following international players for the analysis against India

  1. Hashim Amla (South Africa)
  2. Alastair Cook (England)
  3. Matthew Hayden (Australia)
  4. Viv Richards (West Indies)
frames <- list("./data/amlaVsIndInInd.csv","./data/ancookVsIndInInd.csv","./data/mhaydenVsIndInInd.csv", "./data/vrichardsVsIndInInd.csv") names <- list("H Amla","AN Cook","M Hayden","Viv Riachards")

Excluding Clive Lloyd & Alvin Kallicharan the next best performer against India is Hashim Amla,followed by Alastair Cook, Viv Richards.


#With respect to strike rate, there is no contest when Viv Richards is around. He is clearly the best #striker of the ball regardless of whether it is the pacy wickets of #Australia/England or the spinning tracks of the subcontinent. After #Viv Richards, Hayden and Alastair Cook have good cumulative strike rates #in India relativeBatsmanSR(frames,names)


5. All time greats of Indian batting

I couldn’t resist checking out how the top Indian batsmen perform when playing in host countries So here is a look at how the top Indian batsmen perform against different host countries

6. Top Indian batsmen against Australia in Australia

The following Indian batsmen were chosen

  1. Sunil Gavaskar
  2. Sachin Tendulkar
  3. Virat Kohli
  4. Virendar Sehwag
  5. VVS Laxman
frames <- list("./data/tendulkarVsAusInAus.csv","./data/gavaskarVsAusInAus.csv","./data/kohliVsAusInAus.csv", "./data/sehwagVsAusInAus.csv","./data/vvslaxmanVsAusInAus.csv") names <- list("S Tendulkar","S Gavaskar","V Kohli","V Sehwag","VVS Laxman")

Virat Kohli has the best overall performance against Australia, with a current cumulative average of 60+ runs for the total number of innings played by him (15). With 15 matches the 2nd best is Virendar Sehwag, followed by VVS Laxman. Tendulkar maintains a cumulative average of 48+ runs for an excess of 30+ innings.


# Sehwag leads the strike rate against host Australia, followed by # Tendulkar in Australia and then Kohli relativeBatsmanSR(frames,names)


7. Top Indian batsmen against England in England

The top Indian batmen’s performances against England are shown below

  1. Rahul Dravid
  2. Dilip Vengsarkar
  3. Rahul Dravid
  4. Sourav Ganguly
  5. Virat Kohli
frames <- list("./data/tendulkarVsEngInEng.csv","./data/dravidVsEngInEng.csv","./data/vengsarkarVsEngInEng.csv", "./data/gangulyVsEngInEng.csv","./data/gavaskarVsEngInEng.csv","./data/kohliVsEngInEng.csv") names <- list("S Tendulkar","R Dravid","D Vengsarkar","S Ganguly","S Gavaskar","V Kohli")

Rahul Dravid has the best performance against England and edges out Tendulkar. He is followed by Tendulkar and then Sourav Ganguly. Note:Incidentally Virat Kohli’s performance against England in England so far has been extremely poor and he averages around 13-15 runs per innings. However he has a long way to go and I hope he catches up. In any case it will be an uphill climb for Kohli in England.


#Tendulkar, Ganguly and Dravid have the best strike rate and in that order. relativeBatsmanSR(frames,names)


8. Top Indian batsmen against West Indies in West Indies frames <- list("./data/tendulkarVsWInWI.csv","./data/dravidVsWInWI.csv","./data/vvslaxmanVsWIInWI.csv", "./data/gavaskarVsWIInWI.csv") names <- list("S Tendulkar","R Dravid","VVS Laxman","S Gavaskar")

Against the West Indies Sunil Gavaskar is heads and shoulders above the rest. Gavaskar has a very impressive cumulative average against West Indies


# VVS Laxman followed by  Tendulkar & then Dravid have a very # good strike rate against the West Indies relativeBatsmanCumulativeStrikeRate(frames,names)

9. World’s best spinners on tracks suited for pace & bounce

In this part I compare the performances of the top 3 spinners in recent years and check out how they perform on surfaces that are known for pace, and bounce. I have taken the following 3 spinners

  1. Anil Kumble (India)
  2. M Muralitharan (Sri Lanka)
  3. Shane Warne (Australia)
#kumbleEng=getPlayerData(30176 ,opposition=3,host=3,file="kumbleVsEngInEng.csv",type="bowling") #muraliEng=getPlayerData(49636 ,opposition=3,host=3,file="muraliVsEngInEng.csv",type="bowling") #warneEng=getPlayerData(8166 ,opposition=3,host=3,file="warneVsEngInEng.csv",type="bowling") 10. Top international spinners against England in England frames <- list("./data/kumbleVsEngInEng.csv","./data/muraliVsEngInEng.csv","./data/warneVsEngInEng.csv") names <- list("Anil KUmble","M Muralitharan","Shane Warne")

Against England and in England, Muralitharan shines with a cumulative average of nearly 5 wickets per match with a peak of almost 8 wickets. Shane Warne has a steady average at 5 wickets and then Anil Kumble.


# The order relative cumulative Economy rate, Warne has the best figures,followed by Anil Kumble. Muralitharan # is much more expensive. relativeBowlerCumulativeAvgEconRate(frames,names)

11. Top international spinners against South Africa in South Africa frames <- list("./data/kumbleVsSAInSA.csv","./data/muraliVsSAInSA.csv","./data/warneVsSAInSA.csv") names <- list("Anil Kumble","M Muralitharan","Shane Warne")

In South Africa too, Muralitharan has the best wicket taking performance averaging about 4 wickets. Warne averages around 3 wickets and Kumble around 2 wickets


# Muralitharan is expensive in South Africa too, while Kumble and Warne go neck-to-neck in the economy rate. # Kumble edges out Warne and has a better cumulative average economy rate relativeBowlerCumulativeAvgEconRate(frames,names)

11. Top international pacers against India in India

As a final analysis I check how the world’s pacers perform in India against India. India pitches are supposed to be flat devoid of bounce, while being terrific turners. Hence Indian pitches are more suited to spin bowling than pace bowling. This is changing these days.

The best performers against India in India are mostly the deadly pacemen of yesteryears

For this I have chosen the following bowlers

  1. Courtney Walsh (West Indies)
  2. Andy Roberts (West Indies)
  3. Malcolm Marshall
  4. Glenn McGrath
#cawalshInd=getPlayerData(53216 ,opposition=6,host=6,file="cawalshVsIndInInd.csv",type="bowling") #arobertsInd=getPlayerData(52817 ,opposition=6,host=6,file="arobertsIndInInd.csv",type="bowling") #mmarshallInd=getPlayerData(52419 ,opposition=6,host=6,file="mmarshallVsIndInInd.csv",type="bowling") #gmccgrathInd=getPlayerData(6565 ,opposition=6,host=6,file="mccgrathVsIndInInd.csv",type="bowling") frames <- list("./data/cawalshVsIndInInd.csv","./data/arobertsIndInInd.csv","./data/mmarshallVsIndInInd.csv", "./data/mccgrathVsIndInInd.csv") names <- list("C Walsh","A Roberts","M Marshall","G McGrath")

Courtney Walsh has the best performance, followed by Andy Roberts followed by Andy Roberts and then Malcom Marshall who tips ahead of Glenn McGrath


#On the other hand McGrath has the best economy rate, followed by A Roberts and then Courtney Walsh relativeBowlerCumulativeAvgEconRate(frames,names)

12. ODI performance of a player against a specific country in the host country

This gets the data for MS Dhoni in ODI matches against Australia and in Australia

#dhoniAusODI=getPlayerDataOD(28081,opposition=2,host=2,file="dhoniVsAusInAusODI.csv",type="batting") 13. Twenty 20 performance of a player against a specific country in the host country #dhoniAusTT=getPlayerDataOD(28081,opposition=2,host=2,file="dhoniVsAusInAusTT.csv",type="batting")

All the ODI and Twenty20 functions of cricketr can be used on the above dataframes of MS Dhoni.

Some key observations

Here are some key observations

  1. At the top of the batting spectrum is Don Bradman with a very impressive average 100-120 in matches played in England and Australia. Unfortunately there weren’t matches he played in other countries and different pitches. 2.Viv Richard has the best cumulative strike rate overall.
  2. Muralitharan strikes more often than Kumble or Warne even in pitches at ENgland, South Africa and West Indies. However Muralitharan is also the most expensive
  3. Warne and Kumble have a much better economy rate than Muralitharan.
  4. Sunil Gavaskar has an extremely impressive performance in West Indies.
  5. Rahul Dravid performs much better than Tendulkar in both England and West Indies.
  6. Virat Kohli has the best performance against Australia so far and hope he maintains his stellar performance followed by Sehwag. However Kohli’s performance in England has been very poor
  7. West Indies batsmen and bowlers seem to thrive on Indian pitches, with Clive Lloyd and Alvin Kalicharan at the top of the list.

You may like my Shiny apps on cricket

  1. Inswinger- Analyzing International. T20s
  2. GooglyPlus – An app for analyzing IPL
  3. Sixer – App based on R package cricketr

Also see

  1. Exploring Quantum Gate operations with QCSimulator
  2. Neural Networks: The mechanics of backpropagation
  3. Re-introducing cricketr! : An R package to analyze performances of cricketers
  4. yorkr crashes the IPL party ! – Part 1
  5. cricketr and yorkr books – Paperback now in Amazon
  6.  Hand detection through Haartraining: A hands-on approach

To see all my posts see Index of posts

The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, Part One

Sat, 03/11/2017 - 03:11

The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, Part One

A Tutorial by D. M. Wiig, Professor of Political Science, Grand View University

In my most recent tutorials I have discussed the use of the tabplot() package to visualize multivariate mixed data types in large datasets. This type of table display is a handy way to identify possible relationships among variables, but is limited in terms of interpretation and the number of variables that can be meaningfully displayed.

Social science research projects often start out with many potential independent predictor variables for a given dependant variable. If these variables are all measured at the interval or ratio level a correlation matrix often serves as a starting point to begin analyzing relationships among variables.

In this tutorial I will use the R packages SemiPar, qgraph and Hmisc in addition to the basic packages loaded when R is started. The code is as follows:

#data from package SemiPar; dataset milan.mort
#dataset has 3652 cases and 9 vars

One of the datasets contained in the SemiPar packages is milan.mort. This dataset contains nine variables and data from 3652 consecutive days for the city of Milan, Italy. The nine variables in the dataset are as follows:

rel.humid (relative humidity)
tot.mort (total number of deaths)
resp.mort (total number of respiratory deaths)
SO2 (measure of sulphur dioxide level in ambient air)
TSP (total suspended particles in ambient air)
day.num (number of days since 31st December, 1979)
day.of.week (1=Monday; 2=Tuesday; 3=Wednesday; 4=Thursday; 5=Friday; 6=Saturday; 7=Sunday
holiday (indicator of public holiday: 1=public holiday, 0=otherwise
mean.temp (mean daily temperature in degrees celsius)

To look at the structure of the dataset use the following


Resulting in the output:

> str(milan.mort)
‘data.frame’: 3652 obs. of 9 variables:
$ day.num : int 1 2 3 4 5 6 7 8 9 10 …
$ day.of.week: int 2 3 4 5 6 7 1 2 3 4 …
$ holiday : int 1 0 0 0 0 0 0 0 0 0 …
$ mean.temp : num 5.6 4.1 4.6 2.9 2.2 0.7 -0.6 -0.5 0.2 1.7 …
$ rel.humid : num 30 26 29.7 32.7 71.3 80.7 82 82.7 79.3 69.3 …
$ tot.mort : num 45 32 37 33 36 45 46 38 29 39 …
$ resp.mort : int 2 5 0 1 1 6 2 4 1 4 …
$ SO2 : num 267 375 276 440 354 …
$ TSP : num 110 153 162 198 235 …

As is seen above, the dataset contains 9 variables all measured at the ratio level and 3652 cases.

In doing exploratory research a correlation matrix is often generated as a first attempt to look at inter-relationships among the variables in the dataset. In this particular case a researcher might be interested in looking at factors that are related to total mortality as well as respiratory mortality rates.

A correlation matrix can be generated using the cor function which is contained in the stats package. There are a variety of functions for various types of correlation analysis. The cor function provides a fast method to calculate Pearson’s r with a large dataset such as the one used in this example.

To generate a zero order Pearson’s correlation  matrix use the following:

#round the corr output to 2 decimal places
#put output into variable cormatround
#coerce data to matrix

cormatround round(cormatround, 2)

The output is:

> cormatround > round(cormatround, 2) Day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort SO2 TSP day.num 1.00 0.00 0.01 0.02 0.12 -0.28 0.22 -0.34 0.07 day.of.week 0.00 1.00 0.00 0.00 0.00 -0.05 0.03 -0.05 -0.05 holiday 0.01 0.00 1.00 -0.07 0.01 0.00 0.01 0.00 -0.01 mean.temp 0.02 0.00 -0.07 1.00 -0.25 -0.43 -0.26 -0.66 -0.44 rel.humid 0.12 0.00 0.01 -0.25 1.00 0.01 -0.03 0.15 0.17 tot.mort -0.28 -0.05 0.00 -0.43 0.01 1.00 0.47 0.44 0.25 resp.mort -0.22 -0.03 -0.01 -0.26 -0.03 0.47 1.00 0.32 0.15 SO2 -0.34 -0.05 0.00 -0.66 0.15 0.44 0.32 1.00 0.63 TSP 0.07 -0.05 -0.01 -0.44 0.17 0.25 0.15 0.63 1.00 >

The matrix can be examined to look at intercorrelations among the nine variables, but it is very difficult to detect patterns of correlations within the matrix.  Also, when using the cor() function raw Pearson’s coefficients are reported, but significance levels are not.

A correlation matrix with significance can be generated by using the rcorr() function, also found in the Hmisc package. The code is:

rcorr(as.matrix(milan.mort, type=”pearson”))

The output is:

> rcorr(as.matrix(milan.mort, type="pearson")) day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort SO2 TSP day.num 1.00 0.00 0.01 0.02 0.12 -0.28 -0.22 -0.34 0.07 day.of.week 0.00 1.00 0.00 0.00 0.00 -0.05 -0.03 -0.05 -0.05 holiday 0.01 0.00 1.00 -0.07 0.01 0.00 -0.01 0.00 -0.01 mean.temp 0.02 0.00 -0.07 1.00 -0.25 -0.43 -0.26 -0.66 -0.44 rel.humid 0.12 0.00 0.01 -0.25 1.00 0.01 -0.03 0.15 0.17 tot.mort -0.28 -0.05 0.00 -0.43 0.01 1.00 0.47 0.44 0.25 resp.mort -0.22 -0.03 -0.01 -0.26 -0.03 0.47 1.00 0.32 0.15 SO2 -0.34 -0.05 0.00 -0.66 0.15 0.44 0.32 1.00 0.63 TSP 0.07 -0.05 -0.01 -0.44 0.17 0.25 0.15 0.63 1.00 n= 3652 P day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort SO2 TSP day.num 0.9771 0.5349 0.2220 0.0000 0.0000 0.0000 0.0000 day.of.week 0.9771 0.7632 0.8727 0.8670 0.0045 0.1175 0.0061 holiday 0.5349 0.7632 0.0000 0.4648 0.8506 0.6115 0.7793 0.4108 mean.temp 0.2220 0.8727 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 rel.humid 0.0000 0.8670 0.4648 0.0000 0.3661 0.1096 0.0000 0.0000 tot.mort 0.0000 0.0045 0.8506 0.0000 0.3661 0.0000 0.0000 0.0000 resp.mort 0.0000 0.1175 0.6115 0.0000 0.1096 0.0000 0.0000 0.0000 SO2 0.0000 0.0024 0.7793 0.0000 0.0000 0.0000 0.0000 0.0000 TSP 0.0000 0.0061 0.4108 0.0000 0.0000 0.0000 0.0000 0.0000 >

In a future tutorial I will discuss using significance levels and correlation strengths as methods of reducing complexity in very large correlation network structures.

The recently released package qgraph () provides a number of interesting functions that are useful in visualizing complex inter-relationships among a large number of variables. To quote from the CRAN documentation file qraph() “Can be used to visualize data networks as well as provides an interface for visualizing weighted graphical models.” (see CRAN documentation for ‘qgraph” version 1.4.2. See also

The qgraph() function has a variety of options that can be used to produce specific types of graphical representations. In this first tutorial segment I will use the milan.mort dataset and the most basic qgraph functions to produce a visual graphic network of intercorrelations among the 9 variables in the dataset.

The code is as follows:

#use cor function to create a correlation matrix with milan.mort dataset
#and put into cormat variable
cormat=cor(milan.mort)  #correlation matrix generated
#now plot a graph of the correlation matrix
qgraph(cormat, shape=”circle”, posCol=”darkgreen”, negCol=”darkred”, layout=”groups”, vsize=10)

This code produces the following correlation network:

The correlation network provides a very useful visual picture of the intercorrelations as well as positive and negative correlations. The relative thickness and color density of the bands indicates strength of Pearson’s r and the color of each band indicates a positive or negative correlation – red for negative and green for positive.

By changing the “layout=” option from “groups” to “spring” a slightly different perspective can be seen. The code is:

#Code to produce alternative correlation network:
#use cor function to create a correlation matrix with milan.mort dataset
#and put into cormat variable
cormat=cor(milan.mort) #correlation matrix generated
#now plot a circle graph of the correlation matrix
qgraph(cormat, shape=”circle”, posCol=”darkgreen”, negCol=”darkred”, layout=”spring”, vsize=10)

The graph produced is below:

Once again the intercorrelations, strength of r and positive and negative correlations can be easily identified. There are many more options, types of graph and procedures for analysis that can be accomplished with the qgraph() package. In future tutorials I will discuss some of these.

Madrid R User Group, A Brief History

Fri, 03/10/2017 - 18:00

by Carlos Ortega

(Editors note: A Spanish verison of the post follows the English text)

In the first meeting we were 5, now we are consistently over 60. It was not difficult for us to start up the group of users of R of Madrid. Gregorio Serrano, Carlos Gil Bellosta, Pedro Concejero and I started our own help list R-help-es to regularly answers questions within this new community. It was in March 2012 when we first got together in a classroom of the Faculty of Economics of one of the Madrid’s public Universities. We discussed: a) How to model different trading strategies with R, b) How to model the number of bugs present in the R code, and c) the presentation of a library to create spatial graphs of the map of Spain.

The group of Users of R of Madrid was born as a group promoted by the Association of R of Spain. The Association allowed us to host the agendas of new calls and the content generated in the meetings in a space of its online portal, which helped to provide initial credibility to the group. From then on, we have held a monthly meeting on a regular basis, except the months of July and August and December – a total of about forty-five meetings in these five years.

In the beginning, we changed our headquarters several times. Then we found a place where we could meet on a regular basis, which was a classroom of the School of Sciences of the UNED. There we spent the first three years, approximately. It was there where the group forged its future in terms of creating a stable core team of members, and established a cadence for the meetings. We covered many of the emerging issues that were being addressed in the Community of R: Social Network Analysis, Spatial representation (maps), and different types of analysis on open sets.

From the first meeting, we agreed to record all the meetings in video and be very consistent in making accessible the presented material (video and presentations).

After this initial period, the leap that allowed us to achieve greater visibility and notoriety was to use the portal and get a more central venue in Madrid. This put us in immediate contact with other people having similar interests in holding meetings in a space dedicated to new technologies. MediaLab Prado, promoted by the Madrid City Council, put us in touch with other groups with the same type of interest in everything related to the data: Data Journalists, Analytics, Python Users, etc. Interestingly, joining a Meetup was a must when we introduced ourselves to Revolution Analytics, a company that helped many other R user groups. This sponsorship has helped us keep the group in Meetup and be able to host all the videos on Vimeo.

Now that R has become a language of great success and traction, the group continues to evolve: The Big Data is here, we are sharing the experiences of having participated in Kaggle, Bayesian statistics, etc. This evolution has also occurred in the profile of the attendees: we started to see presentations of companies, and fortunately there are a lot of young people among the attendees and those who present.

Based on experience we’ve accumulated over the years, to those who want to create their own group, we would give these recommendations from Cultivating Communities of Practice:

  • Keep the consistency, the monthly periodicity is the most valuable. Attendees appreciate maintaining regularity.
  • Using Meetup has been shown to be a key element of success by the visibility it offers and by reaching out to other groups and users with the same interests.
  • Another key element to become known and expand the scope of the group has been to participate in meetings of other communities: Data Journalists, Machine Learning, Python, SQLSaturdays, etc.
  • Be rigorous in announcing the meetings in advance and sharing the material presented.
  • Record the video sessions (we use a tripod and a Smartphone) has also helped to maintain a loyal community.

We have accomplished a lot, but certainly the best is yet to come. Long life to R.


En la primera reunión éramos 5, ahora somos más de 60 y de forma consistente. No nos costó mucho poner en marcha el grupo de usuarios de R de Madrid. A través de la propia lista de ayuda R-help-es los que solíamos responder de forma más habitual a las dudas de esta nueva comunidad (Gregorio Serrano, Carlos Gil Bellosta, Pedro Concejero y Carlos Ortega), acordamos juntarnos por primera vez. Fue en marzo de 2012, nos juntamos en un aula de la Facultad de Economia de uno de los Campus de Madrid. Hablamos de: a) Cómo modelizar diferentes estrategias de trading con R, b) Cómo modelizar el número de bugs presentes en el código R, y c) la presentación de una librería para crear gráficos espaciales del mapa de España.

El grupo de Usuarios de R de Madrid, nació como grupo promovido por la Asociación de R de España. La Asociación nos permitió alojar las agendas de nuevas convocatorias y el contenido generado en las reuniones en un espacio de su portal online. Este punto ayudó a dotar de un punto inicial de credibilidad sobre el grupo.

A partir de este momento, hemos mantenido una reunión mensual de forma constante, excepto los meses de julio y agosto y diciembre – un total de unas 45 reuniones.

Inicialmente cambiamos de sede varias veces, hasta que encontramos un sitio en donde reunirnos de forma estable en un aula de la Escuela de Ciencias de la UNED, donde estuvimos los tres primeros años aproximadamente. Fue aquí donde el grupo forjó su futuro. Cubrimos muchos de los temas emergentes que en la Comunidad de R se venían tratando: análisis de redes sociales, representación espacial (mapas), y diferentes tipos de análisis sobre conjuntos abiertos.

Desde un primer momento, acordamos grabar todas las reuniones en video y ser muy consistentes en hacer accesible el material presentado (video y presentaciones).

Después de esta primera época, el salto que nos permitió conseguir mayor visibilidad y notoriedad fue el utilizar el portal de Meetup y conseguir una sede más céntrica en Madrid.

Meetup por un lado, nos puso en contacto inmediato con otras personas con el mismo tipo de interés y el celebrar las reuniones en un espacio dedicado a las nuevas tecnologías (MediaLab Prado), promovido por el Ayuntamiento de Madrid nos puso en contacto con otros grupos con el mismo tipo de interés en todo lo relativo a los datos: Periodistas de Datos, Analytics, Usuarios de Python, etc.

Curiosamente, el entrar en Meetup fue también una necesidad cuando nos presentamos para conseguir la ayuda que a los diferentes grupos de usuarios de R recibían de Revolution Analytics. Esta esponsorización nos ha ayudado a mantener el grupo en Meetup y el poder alojar todos los videos en Vimeo.

Ahora que R se ha convertido en un lenguaje de gran éxito y tracción, el grupo sigue evolucionando: El Big Data ya está con nosotros, compartir las experiencias de haber participado en Kaggle, la estadística bayesiana, etc.

Igualmente esta evolución se ha producido en el perfil de los asistentes: comenzamos a ver presentaciones de empresas, y afortunadamente es mucha la gente joven entre los asistentes y los que presentan.

De la experiencia acumulada en estos años, a aquellos que quieran crear su propio grupo, les daríamos estas recomendaciones:

  • Mantener la consistencia, la periodicidad mensual es la más ajustada. Los asistentes valoran mantener la regularidad.
  • Utilizar Meetup se ha demostrado como un elemento clave del éxito por la visibilidad que ofrece y por el alcance a otros grupos y usuarios con los mismos intereses.
  • Otro elemento clave para darse a conocer y ampliar el alcance del grupo ha sido el participar en reuniones de otras comunidades: Periodistas de Datos, Analytics, Usuarios de Python, etc.
  • Ser rigurosos en el anuncio de las reuniones con cierta antelación y en compartir el material presentado.
  • El grabar las sesiones en video (usamos para ello un trípode y un Smartphone) también ha ayudado a mantener una comunidad fiel.

Hemos hecho mucho, pero sin duda lo mejor está todavía por llegar.

Updates to the Data Science Virtual Machine for Linux

Fri, 03/10/2017 - 16:47

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The Data Science Virtual Machine (DSVM) is a virtual machine image on the Azure Marketplace assembled for data scientists. The goal of the DSVM is provide a broad array of popular data-oriented tools in a single environment, and make data scientists and developers highly productive in their work. It's available for both Windows and Linux, and the Linux edition has just received a major update. (The Windows edition was also updated recently.) 

This latest update adds local standalone instances of Hadoop, YARN and Spark: ideal for prototyping code in your DSVM instance before deploying to a large cluster for production-level applications. This update also upgrades Microsoft R Server to version 9, adding the MicrosoftML package for machine learning, and the mrsdeploy package to operationalize R code on a production server.

You can find instructions for provisioning the Linux DSVM on Azure here, which also includes a detailed list of its components and how to launch them. For more on the latest updates to the DSVM, follow the link below.

Cortana Intelligence and Machine Learning Blog: Build & Deploy Machine Learning Apps on Big Data Platforms with Microsoft Linux Data Science Virtual Machine

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Making a Case for case_when

Fri, 03/10/2017 - 16:36

This is a brief (and likely obvious, for some folks) post on the dplyr::case_when() function. Part of my work-work is dealing with data from internet scans. When we’re performing a deeper inspection of a particular internet protocol or service we try to capture as much system and service metadata as possible. Sifting through said metadata… Continue reading →

RDocumentation: Scoring and Ranking

Fri, 03/10/2017 - 12:13

One of the core features of is its search functionality. From the start, we wanted to have a super simple search bar that finds what you are looking for, without a complex form asking for a package name, function name, versions or anything else. Just a simple search bar.

In today’s technical blog post, we will highlight the technologies and techniques that we use to provide relevant and meaningful results to our users.

Elasticsearch uses Elasticsearch to index and search through all R packages and topics.

Elasticsearch is an open-source, scalable, distributed, enterprise-grade search engine.

Elasticsearch is perfect for querying documentation because it doesn’t use conventional SQL data, but it stores documents in a JSON-like data structure instead. Each document is just a set of key-value pairs with simple data types (strings, numbers, lists, dates, …). Its distributed nature means Elasticsearch can be extremely fast.

An Elasticsearch cluster can have multiple indexes and each index can have multiple document types. A document type just describes what the structure of the document should look like. To learn more about Elasticsearch types, you can visit the guide on uses three different types: package_version, topic, and package. The first 2 are the main ones; let’s discuss package later.

Because is open-source, you can see the Elasticsearch’s mappings in our github repo

package_version type

The package_version type is like a translation of the DESCRIPTION file of a package, it features the main field that one can find in there; package_name, version, title, description, release_date, license, url, copyright, created_at, updated_at, latest_version, maintainer and collaborators. The maintainer and collaborators are extracted from the Authors field in the DESCRIPTION file

topic type

The topic documents are parsed from the Rd files, the standard format of documentation in R. The topic type has the following keys: name, title, description, usage, details, value, references, note, author, seealso, examples, created_at, updated_at, sections, aliases and keywords.

Scoring in Elasticsearch

Before doing any scoring, Elasticseach first tries to reduce the set of candidates by checking if the document actually matches the query. Basically, a query is a word (or a set of words). Based on the query setting, Elasticsearch searches for a match in certain fields of certain types.

However, a match does not necessarily mean that the document is relevant; the same word can have different meanings in different contexts. Based on the query settings, we can filter by type and field, and include more contextual information. This contextual information will improve the relevancy and this is where scoring comes into place.

Elasticsearch uses Lucene under the hood, so the scoring is based on Lucene’s Practical Scoring Function which brings together some models like the TF-IDF, Vector Space Model and Boolean Model to score the document.

If you want to lean more about how that function is used in Elasticsearch, you can check out this section of guide.

One way to improve relevancy is to apply a boost to some fields. For example, in the full search, we naturally boost fields like package_name and title for packages and aliases and name for topics.

Boosting the popular documents

Another effective way to improve relevancy is to boost documents based on their popularity. The idea behind that is that if a package is more popular, the user is more likely to search for this package. Showing the more populars packages first will increase the probability that we show what the user is actually looking for.

Using downloads as a popularity measure

There are multiple ways to measure popularity. We could use direct measures like votes or rankings that users give (like ratings on Amazon products), or indirect measures like the number of items sold or the number of views (for YouTube videos).

At, we chose the latter. More specifically, we use the number of downloads as a measure of popularity. Indirect measures are typically easier to collect because they don’t require active user input.


One problem that arises when using the number of downloads is that old packages will naturally have more total downloads than newer packages. That does not mean that they are more popular, however, they have just been around longer. What if a package was very popular years ago, but has now become obsolete and is no longer being actively used by the community?

To solve this problem, we only take into account the number of downloads in the last month. That way, older packages’ popularity score is not artificially boosted, and obsolete packages will quickly fade out.

Direct vs Indirect downloads

Another problem arises from reverse dependencies. R packages typically depend on a wide range of other packages. Packages with a lot of reverse dependencies will get downloaded way more than others. However, these packages are typically more low-level and are not used directly by the end-user. We have to watch out to not give their number of downloads too much weight.

As an example, take Rcpp. Over 70 percent of all packages on CRAN, the comprehensive R archive network, depend on this package, which obviously makes it the most downloaded R package. However, rather few R users will directly use this package and search for its documentation.

To solve this problem, we needed to separate direct downloads (downloads that happens because a user requested it) and indirect downloads (downloads that happen because a dependent packages was downloaded). To distinguish the direct and indirect downloads from the CRAN logs, we use the same heuristic described in the cran.stats package by Arun Srinivasan.

We now have a meaningful popularity metric: the number of direct downloads in the last month. Elasticsearch provides an easy way to inject this additional information; for more details, check out this article on

The score is modified as follows:

new_score = old_score * log(1 + number of direct downloads in the last month)

We use a log() function to smooth out the number of downloads value, because each subsequent downloads has less weight; the difference between 0 and 1000 downloads should have a bigger impact on a popularity score than the difference between 100.000 and 101.000 downloads.

This re-scoring improves the overall relevancy of the search results presented by and as a result, users can focus on reading documentation instead of searching for it.

If you want to find out more about how exactly the Elasticsearch query is implemented, you can take a look at the RDocumentation project on GitHub. The query itself is located in the SearchController.

If you want to learn more about how is implemented, check out our repos:

About RDocumentation  

RDocumentation aggregates help documentation for R packages from CRAN, BioConductor, and GitHub – the three most common sources of current R documentation. goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentation package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples. 

Create an RDocumentation account today!

New data and functions in nzelect 0.3.0 R package

Fri, 03/10/2017 - 12:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Polling data and other goodies ready for download

A new version, 0.3.0, of the nzelect R package is now available on CRAN.

  • historical polling data from 2002 to February 2017, sourced from Wikipedia
  • some small functions to help convert voting numbers into seats in a New Zealand or similar proportional representation system; and to weight polling numbers in the way done by two of New Zealand’s polling aggregator websites.
  • some small bits of metadata on political parties, most notably named vectors of their colours for use in graphics.

nzelect was originally developed in response to Ari Lamstein’s R election analysis contest, and my series of blog posts drawing on its data won me that competition. The first major version included polling-place election results from the 2014 election. The main purpose of the package remains to make these data available in tidier, more analysis-ready format than the Electoral Commission’s official election results site.

I still have plans to add the results from earlier elections, but with a New Zealand general election now scheduled for 23 September 2017 I thought I’d prioritise some polling data. I’m hoping to up the level of sophistication of at least a corner of the debate in the leadup to the election. I’ve got some blog posts on things like house effects (eg historical biases of different polling firms when confronted with actual election results) and probabilistic prediction coming up, and needed a clean and tidy set of historical data to do this.

Historical polling data

The polling data was the main addition to this version of nzelect. The data have been scraped from a range of Wikipedia pages and subjected to some cleaning. With this somewhat sketchy provenance, they can’t be guaranteed but they look very plausible. All the data have been combined in a single data frame polls which looks like this:

Pollster WikipediaDates StartDate EndDate MidDate Party VotingIntention Client ElectionYear 5965 Colmar Brunton 11–15 February 2017 2017-02-11 2017-02-15 2017-02-13 United Future 0.00 One News 2017 5966 Colmar Brunton 11–15 February 2017 2017-02-11 2017-02-15 2017-02-13 Maori 0.01 One News 2017 5967 Colmar Brunton 11–15 February 2017 2017-02-11 2017-02-15 2017-02-13 Destiny NA One News 2017 5968 Colmar Brunton 11–15 February 2017 2017-02-11 2017-02-15 2017-02-13 Progressive NA One News 2017 5969 Colmar Brunton 11–15 February 2017 2017-02-11 2017-02-15 2017-02-13 Mana 0.01 One News 2017 5970 Colmar Brunton 11–15 February 2017 2017-02-11 2017-02-15 2017-02-13 Conservative 0.00 One News 2017

Election results (actual total party vote by party) are also included for convenience. This doesn’t remove the need to bring in the detailed historical election results by polling place in future versions of nzelect, but gives a start on the historical perspective.

Combining multiple election cycles of polling data together makes it possible to see the longer game in party political change in New Zealand:

Here’s the code behind that graphic.

library(tidyverse) library(scales) library(nzelect) library(forcats) library(stringr) library(rvest) library(magrittr) #=========polls demo========= election_dates <- polls %>% filter(Pollster == "Election result") %>% select(MidDate) %>% distinct() polls %>% filter(!Party %in% c("Destiny", "Progressive")) %>% mutate(Party = gsub("M.ori", "Maori", Party)) %>% mutate(Party = fct_reorder(Party, VotingIntention, .desc = TRUE), Pollster = fct_relevel(Pollster, "Election result")) %>% ggplot(aes(x = MidDate, y = VotingIntention, colour = Pollster)) + geom_line(alpha = 0.5) + geom_point(size = 0.7) + geom_smooth(aes(group = Party), se = FALSE, colour = "grey15", span = .20) + scale_y_continuous("Voting intention", label = percent) + scale_x_date("") + facet_wrap(~Party, scales = "free_y") + geom_vline(xintercept = as.numeric(election_dates$MidDate), colour = "grey80") + ggtitle("15 years of voting intention opinion polls in New Zealand") + labs(caption = "Source: nzelect #Rstats package on CRAN")

The CRAN version of an R package isn’t the appropriate way to make day to day updates available – upgrades on CRAN should only be every three months or so at the most. I will probably keep the GitHub version up to date as more polling data comes in, but I’m not in the position to give a service level commitment on timeliness.

Converting voting results to seats

One thing we need to be able to do efficiently if we’re going to facilitate polling punditry is convert election results – real or hypothetical – into actual seats in Parliament. Since 1996, New Zealand has 120 or more seats in its single house of parliament to be allocated by a system known as “mixed-member proportional”. Each elector has two votes – an electorate vote and a party vote. 71 of the seats (at the time of writing) are “first past the post” electorates. However, these electorate votes have very little (not quite none) impact on the total make-up of Parliament, because the remaining seats are allocated to “lists” provided by the parties in such a way as to be in proportion to the electors’ party votes.

Only parties that received five percent of the party vote, or that have won at least one electorate, are counted in the proportional representation part.

The reason why there are 120 “or more” seats is the same as the reason why electorate votes have not quite exactly zero impact on the overall makeup. If a party wins more electorates than they would be entitled to from their party vote, the difference is translated into “overhang seats”. After the 2014 election there was one such seat; the leader of the United Future party won the Ōhariu electorate, but the party received only 0.22% of the party vote (much less than 1/120th), so he holds the 121st seat in Parliament.

The method of allocating the list seats is known as the Sainte-Laguë or Webster/Sainte-Laguë method. It’s well explained on Wikipedia. It’s now available in nzelect via the allocate_seats() function.

allocate_seats defaults to New Zealand settings (5% threshold, 120 seats if no overhang seats) but these can be set to other values for use in other electoral systems or conducting thought experiments about New Zealand. For example, the 5% threshhold acts as a barrier to small parties getting representation in Parliament proportionate to their support. A 2012 review recommended amongst other things reducing this to 4%, although this wasn’t adopted by Parliament. Other countries with similar systems have lower thresholds; for example, Israel has had no less than four different threshold figures in the past thirty years (1%, 1.5%, 2%, and the current value of 3.25%).

Here’s how the New Zealand Parliament would have looked with different values of the threshold for getting access to the proportional representation part of the system:

We can see that the Conservative party were the big losers from the 5% threshold rule; with 3.97% of the party vote and no seats in Parliament under current rules (or indeed the 4% threshold proposed and rejected in 2012), but five seats if using the Israeli threshold.

Using electoral results data and functions from the nzelect and other packages, here’s how that analysis was done:

#================2014 theshold scenarios========= # Party vote results from the 2014 election results <- GE2014 %>% mutate(VotingType = paste0(VotingType, "Vote")) %>% group_by(Party, VotingType) %>% summarise(Votes = sum(Votes, na.rm = TRUE)) %>% spread(VotingType, Votes) %>% select(Party, PartyVote) %>% ungroup() %>% filter(! %>% filter(Party != "Informal Party Votes") %>% arrange(desc(PartyVote)) %>% mutate(Party = gsub(" Party", "", Party), Party = gsub("New Zealand First", "NZ First", Party), Party = gsub("Internet MANA", "Mana", Party)) %>% left_join(parties_df[, c("Shortname", "Colour")], by = c("Party" = "Shortname")) # Electorate seats from the 2014 election electorate = c(41, 27, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,0,0) # Seats awarded under different scenarios: res2014 <- data_frame( party = results$Party, `5% threshold\n(actual result)` = allocate_seats(results$PartyVote, electorate = electorate)$seats_v, `3.25% threshold like Israel\n(hypothetical result)` = allocate_seats(results$PartyVote, electorate = electorate, threshold = 0.0325)$seats_v, `No threshold\n(hypothetical result)` = allocate_seats(results$PartyVote, electorate = electorate, threshold = 0)$seats_v ) %>% gather(scenario, seats, -party) %>% mutate(party = str_wrap(party, 15), party = fct_reorder(party, -seats, fun = mean), scenario = fct_relevel(scenario, "5% threshold\n(actual result)"), party = fct_other(party, keep = levels(party)[1:10])) %>% group_by(party, scenario) %>% summarise(seats = sum(seats)) res2014 %>% ggplot(aes(x = seats, y = scenario, label = seats)) + facet_wrap(~party, ncol = 4) + geom_segment(xend = 0, aes(yend = scenario, colour = party), size = 3, alpha = 0.3) + scale_colour_manual(values = parties_v) + scale_y_discrete("", limits = levels(res2014$scenario)[3:1]) + geom_text(size = 3) + theme(legend.position = "none") + ggtitle("Impact of changing minimum threshold for seats in Parliament", "New Zealand 2014 election seat allocation") + labs(x = "Seats", caption = "Analysis using the `nzelect` R package")

For comparison, here are the same scenarios with the 2011 election results:

Because the full 2011 election results aren’t yet available in nzelect, the code below needs to scrape them from a HTML table on the Electoral Commission’s site, using the very user-friendly rvest R package:

#================2011 theshold scenarios========= res2011 <- "" %>% read_html() %>% # extract all the tables: html_nodes(css = "table") %>% # we only want the fourth table: extract2(4) %>% # turn it into a data frame: html_table() %>% # drop the first four rows: slice(-(1:4)) # give it some real names: names(res2011) <- c("Party", "PartyVote", "PropVote", "ElectorateSeats", "ListSeats", "TotalSeats") res2011 <- res2011 %>% # turn into numbers: mutate(PartyVote = as.numeric(gsub(",", "", PartyVote)), ElectorateSeats = as.numeric(ElectorateSeats)) %>% # remove the "total seats" row: filter(! %>% # estimate seats mutate( `5% threshold\n(actual result)` = allocate_seats(PartyVote, electorate = ElectorateSeats)$seats_v, `3.25% threshold like Israel\n(hypothetical result)` = allocate_seats(PartyVote, electorate = ElectorateSeats, threshold = 0.0325)$seats_v, `No threshold\n(hypothetical result)` = allocate_seats(PartyVote, electorate = ElectorateSeats, threshold = 0)$seats_v ) res2011 %>% select(-(PartyVote:TotalSeats)) %>% gather(scenario, seats, -Party) %>% mutate(Party = gsub(" Party", "", Party), Party = gsub("New Zealand First", "NZ First", Party)) %>% left_join(parties_df[, c("Shortname", "Colour")], by = c("Party" = "Shortname")) %>% mutate(Party = fct_reorder(Party, -seats), Party = fct_other(Party, keep = levels(Party)[1:11])) %>% ggplot(aes(x = seats, y = scenario, label = seats)) + facet_wrap(~Party, ncol = 4) + geom_segment(xend = 0, aes(yend = scenario, colour = Party), size = 3, alpha = 0.3) + scale_colour_manual(values = parties_v) + scale_y_discrete("", limits = levels(res2014$scenario)[3:1]) + geom_text(size = 3) + theme(legend.position = "none") + ggtitle("Impact of changing minimum threshold for seats in Parliament", "New Zealand 2011 election seat allocation") + labs(x = "Seats", y = "", caption = "Analysis using the `nzelect` R package")

New Zealand has a Westminster-like rather than USA-like relation of parliament to the executive, in that the government needs to command a majority in Parliament (indicated in budget or confidence votes) or resign. In practice, this generally leads to government by a coalition of parties, although this is not inevitable; in the 2014 election the National Party came within a whisker of being able to govern by itself (although political logic suggests they would have kept at least some junior coalition parties anyway).

nzelect is available for installation from CRAN the usual way (install.packages("nzelect")). Report any bugs, enhancement requests or issues on GitHub please. I already know about the wrong vignette title on CRAN :(… will fix it next release.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hierarchical Clustering Nearest Neighbors Algorithm in R

Thu, 03/09/2017 - 21:00

Hierarchical clustering is a widely used and popular tool in statistics and data mining for grouping data into ‘clusters’ that exposes similarities or dissimilarities in the data. There are many approaches to hierarchical clustering as it is not possible to investigate all clustering possibilities. One set of approaches to hierarchical clustering is known as agglomerative, whereby in each step of the clustering process an observation or cluster is merged into another cluster. The first approach we will explore is known as the single linkage method, also known as nearest neighbors.

The data we will cluster is seven different measures of air pollution of 41 cities throughout the United States (qtd. Rencher 502). The data were obtained from the companion FTP site of the book Methods of Multivariate Analysis and contains the variables as follows:

  • SO2 content of air in mg/cm
  • Average annual temperature in F
  • Number of manufacturing plants with 20 or more workers
  • Population size as recorded by 1970 census in thousands
  • Average annual wind speed in mph
  • Average annual precipitation in inches
  • Average number of days with precipitation per year
poll <- read.table('T15_13_POLLUTION.dat', col.names = c('city', 'so2', 'tempf', 'manufacturing', 'pop', 'wind', 'precip', 'days.w.precip')) Variable Scaling Consideration

Although recommended by many authors, the question of scaling in the context of hierarchical clustering (particularly using the Euclidean distance measure) is not so black and white (Rencher 2002, pp. 454). The variables that best separate clusters might not do so after scaling. We will scale the pollution data because the variables’ scale of measurement is quite different from each other, but it should be noted standardization should be thoughtfully applied (or not applied) rather than proceeding to scale automatically.

poll.scale <- data.frame(scale(poll[,2:8])) Measures of Similarity (or Dissimilarity) for Cluster Analysis

Before performing hierarchical clustering, we must find the similarity between each pair of observations, which is referred to as the distance. The distance measure is more of a measure of dissimilarity as it increases as the observations move farther away. As in agglomerative hierarchical clustering, there are many approaches to measuring the distance, the most common of which is the Euclidean distance. There is no ‘best’ distance measure, and one must consider the data at hand and the assumptions of each measure to select an appropriate method. Euclidean distance is the straight line between two pairs of observations and is defined as:

d(x,y) = \sqrt{(x – y)^\prime (x – y)} = \sqrt{\sum^p_{j=1} (x_j – y_j)^2}

The following function implements the Euclidean distance calculations for each pair of observations in the dataset.

euclidean.distance <- function(x) { n <- nrow(x) dist.mat <- matrix(0, n, n) xj <- x[1,] for (i in 1:n) { for (j in 1:n) { yj <- x[j,] d <- sqrt(as.matrix(xj - yj) %*% as.matrix(t(xj - yj))) dist.mat[j,i] <- d } xj <- x[1+i,] } return(dist.mat) } euclid.dist <- euclidean.distance(poll.scale)

The dist() function in R also calculates the distance.

poll.dist <- dist(poll.scale, method = 'euclidean')

The output of the dist() function is an atomic vector. We convert it to a matrix here to compare our results with the function.

poll.dist.mat <- as.matrix(poll.dist)

As the resulting matrices are 41 \times 41, check the first and bottom three rows of each to verify.

cbind(head(poll.dist.mat[,1:3]), tail(poll.dist.mat[,1:3])) ## 1 2 3 1 2 3 ## 1 0.000000 4.789018 3.171606 4.034076 3.264390 1.782405 ## 2 4.789018 0.000000 3.009865 5.751432 2.007170 3.321471 ## 3 3.171606 3.009865 0.000000 4.790368 1.171199 2.906303 ## 4 3.871066 3.491389 1.262450 6.637764 3.208636 4.153093 ## 5 6.230609 2.817075 3.800426 5.675892 2.526646 4.136598 ## 6 5.305038 1.729939 2.964289 6.546541 4.008765 3.474188 cbind(head(euclid.dist[,1:3]), tail(euclid.dist[,1:3])) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [36,] 0.000000 4.789018 3.171606 4.034076 3.264390 1.782405 ## [37,] 4.789018 0.000000 3.009865 5.751432 2.007170 3.321471 ## [38,] 3.171606 3.009865 0.000000 4.790368 1.171199 2.906303 ## [39,] 3.871066 3.491389 1.262450 6.637764 3.208636 4.153093 ## [40,] 6.230609 2.817075 3.800426 5.675892 2.526646 4.136598 ## [41,] 5.305038 1.729939 2.964289 6.546541 4.008765 3.474188 Hierarchical Clustering with Single Linkage

Johnson’s algorithm describes the general process of hierarchical clustering given N observations to be clustered and an N \times N distance matrix. The steps of Johnson’s algorithm as applied to hierarchical clustering is as follows:

  1. Begin with disjoint clustering with level L(0) = 0 and m = 0.
  2. In the case of single linkage, find the pair with the minimum distance, with pairs denoted as r and s, according to:
  • d[(r), (s)] = min (d[(i),(j)])
  1. Add one to m, m = m + 1. Merge clusters r and s into one cluster to form the next clustering at m. L(m) then becomes: L(m) = d[(r), (s)]
  2. The distance matrix is updated by removing the rows and columns corresponding to clusters r and s and inserting a row and column for the newly formed cluster. The distance between the newly formed cluster (r,s) and the old cluster k is calculated again with the minimum distance (in the case of single linkage):
  • d[(k), (r,s)] = min (d[(k),(r)], d[(k),(s)])
  1. Stop if all N observations are in one cluster, otherwise repeat starting at Step 2.
Implementing the Single Linkage Hierarchical Clustering Technique

Although hierarchical clustering with a variety of different methods can be performed in R with the hclust() function, we can also replicate the routine to an extent to better understand how Johnson’s algorithm is applied to hierarchical clustering and how hclust() works. To plot our clustering results, we will rely on one output of the hclust() function, the order value, which is purely used for plotting. To plot and compare our clustering results, the best method (that I know of) is to create an hclust object, which according to ?hclust, requires merge, height and the aforementioned order components.

Quoting from the hclust documentation of the required components:

merge is an n-1 by 2 matrix. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.

height is a set of n-1 real values (non-decreasing for ultrametric trees). The clustering height: that is, the value of the criterion associated with the clustering method for the particular agglomeration.

Order is a vector giving the permutation of the original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches.

Therefore we need to build the merge and height objects. We will only use the order output of hclust() to plot.

To start, initialize the merge and height objects and set the diagonal of the distance matrix to Inf to avoid including 0s in our calculations.

Lm <- data.frame(euclid.dist) N <- nrow(Lm) merge <- matrix(0, N-1, 2) height <- vector(length = N-1) diag(Lm) <- Inf

The row and column names of the distance matrix are set to -1 increasing up to -N, which allows us to note if the merged cluster at a particular step was an individual cluster.

colnames(Lm) <- -(1:N) rownames(Lm) <- -(1:N)

The following loop builds the merge and height objects needed for plotting the hierarchical clustering results. A visual example of how clustering is performed shows how the rows and columns are merged with a simple dataset.

for (m in 1:(N-1)) { cols <- colnames(Lm) # Johnson's algorithm Step 2L Find the pair with the minimum distance # The which() function returns the row and column position of the pair d <- which(Lm == min(Lm), arr.ind = TRUE)[1,,drop=FALSE] height[m] <- min(Lm) # The height is the value of the pair with the minimum distance # The row and column position of the minimum pair is stored as sequence m in the merge object merge[m,] <- as.numeric(cols[d]) # Johnson's algorithm Step 3: The pair with the minimum distance is merged # The cluster object is used to find previous clusters that the pair belong to (if they exist) # Does this by finding any columns above 0 (since all column names are negative, a positive # column value implies it has been clustered) cluster <- c(d, which(cols %in% cols[d[1, cols[d] > 0]])) colnames(Lm)[cluster] <- m # Rename the columns indicated by cluster to the sequence number, m # Merge the pairs according to Johnson's algorithm and the single linkage method sl <- apply(Lm[d,], 2, min) # Johnson's algorithm Step 4: Remove column and row corresponding to old clusters and # insert a new column and row for newly formed cluster. # The insertion of the cluster is done by setting the first sequential row and column of the # minimum pair in the distance matrix (top to bottom, left to right) as the cluster resulting # from the single linkage step Lm[min(d),] <- sl Lm[,min(d)] <- sl # Make sure the minimum distance pair is not used again by setting it to Inf Lm[min(d), min(d)] <- Inf # The removal step is done by setting the second sequential row and column of the minimum pair # (farthest right, farthest down) to Inf Lm[max(d),] <- Inf Lm[,max(d)] <- Inf } Plotting the Hierarchical Clustering as a Dendrogram

We now have the merge and height components of an hclust object. The order component comes from the hclust() function. As noted, the order component is used just by hclust to plot and has no bearing on our understanding of cluster analysis.

poll.clust <- hclust(poll.dist, method = 'single')

According to the documentation in ?hclust, an hclust object is a list with class hclust of the above components. We construct the hclust class with the following:

hclust.obj <- list() # Initialize an empty list hclust.obj$merge <- merge # Add the merge component obtained earlier hclust.obj$order <- poll.clust$order # Here the order component from hclust is added to the list hclust.obj$height <- height # The height component determines the lengths of the dendogram nodes hclust.obj$labels <- poll$city # Add the city names to the labels component class(hclust.obj) <- 'hclust' # The list is set to class hclust plot(hclust.obj)

Plot the hierarchical clustering as given by the hclust() function to verify our results.

plot(poll.clust, labels = poll$city)

Interpretation of the Cluster Plot

Moving left to the right along the x-axis, we see there Chicago, Phoenix, and Philadelphia are separated from other cities to a decreasing degree. Otherwise, there isn’t too much distinction between clusters. At the height of slightly more than two we have a three cluster solution (counting the number of lines crossed going left to right on the x-axis), but the third cluster would contain the bulk of the cities and thus isn’t particularly useful. Using a lower height increases the number of clusters significantly, for example, a height of about 1.9 gives a six-cluster solution. However, the final cluster still contains a large amount of the cities and therefore isn’t very descriptive of how the cities group themselves. This lack of separation in the clusters might indicate there isn’t too much variation in the cities’ pollution levels to begin with, the method of clustering used, or a combination of the two. The former point is further evidenced by small distances in the y-axis between many of the clusters.

We will see in later posts if using a different distance measure or clustering method improves the clustering of the cities. The hierarchical cluster could be ‘cut’ to the number of groups we are interested in, but since the clustering isn’t that great, we will save this step once for when a better clustering and distance method are used.

Disadvantages of the Single Linkage Approach

The single linkage (nearest neighbors) approach has several disadvantages compared to other clustering methods and is therefore not recommended by many authors (Rencher, 2002, pp. 475). Single linkage is rather prone to chaining (also known as space-contracting), which is the tendency for newly formed clusters to move closer to individual observations, so observations end up joining other clusters rather than another individual observation. We can see this markedly in the hierarchical cluster plot above. The single linkage method is also sensitive to errors in distances between observations.

It should be noted there is no clear ‘best’ clustering method and often a good approach is to try several different methods. If the resulting clusters are somewhat similar, that is evidence there may be natural clusters in the data. Many studies, however, conclude Ward’s method and the average linkage method are the overall best performers (Rencher, 2002, pp. 479).


Gan, Guojun, Chaoqun Ma, and Jianhong Wu, Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 2007.

Murtagh, F. (n.d.). Multivariate Data Analysis with Fortran C and Java. Queen’s University Belfast.

Rencher, A. (2002). Methods of Multivariate Analysis (2nd ed.). Brigham Young University: John Wiley & Sons, Inc.

The post Hierarchical Clustering Nearest Neighbors Algorithm in R appeared first on Aaron Schlegel.

Online data science (with R) courses at Udemy for only $10 (from March 9th until the 14th)

Thu, 03/09/2017 - 17:28

In order to get the discount, simply click choose a link below and when paying use the promo code: CHANGEIT

Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

Click here to browse ALL (R and non-R) courses

Advanced R courses: 

Introductory R courses: 

Non R courses for data science: 


The Rise of Civilization, Visualized with R

Thu, 03/09/2017 - 17:09

This animation by geographer James Cheshire shows something at once simple and profound: the founding and growth of the cities of the world since the dawn of civilization.

Dr Cheshire created the animation using R and the rworldmap package, using data from this Nature dataset. The complete R code is provided in the blog post linked below, and you can easily tweak the code to change the projection of the map. Check out the complete post below to see two other versions of the animation: one focusing on the birth of civilization in the Old World, and another depicting the pre-colonial civilizations and the subsequent colonization of the Americas. Mapping 5,000 Years of City Growth

Some Win-Vector R packages

Thu, 03/09/2017 - 16:35

This post concludes our mini-series of Win-Vector open source R packages. We end with WVPlots, a collection of ready-made ggplot2 plots we find handy.

Please read on for list of some of the Win-Vector LLC open-source R packages that we are pleased to share.

For each package we have prepared a short introduction, so you can quickly check if a package suits your needs using the links below.

Forecasting Stock Returns using ARIMA model

Thu, 03/09/2017 - 16:30

By Milind Paradkar

“Prediction is very difficult, especially about the future”. Many of you must have come across this famous quote by Neils Bohr, a Danish physicist. Prediction is the theme of this blog post. In this post, we will cover the popular ARIMA forecasting model to predict returns on a stock and demonstrate a step-by-step process of ARIMA modeling using R programming.

What is a forecasting model in Time Series?

Forecasting involves predicting values for a variable using its historical data points or it can also involve predicting the change in one variable given the change in the value of another variable. Forecasting approaches are primarily categorized into qualitative forecasting and quantitative forecasting. Time series forecasting falls under the category of quantitative forecasting wherein statistical principals and concepts are applied to a given historical data of a variable to forecast the future values of the same variable. Some time series forecasting techniques used include:

  • Autoregressive Models (AR)
  • Moving Average Models (MA)
  • Seasonal Regression Models
  • Distributed Lags Models
What is Autoregressive Integrated Moving Average (ARIMA)?

ARIMA stands for Autoregressive Integrated Moving Average. ARIMA is also known as Box-Jenkins approach. Box and Jenkins claimed that non-stationary data can be made stationary by differencing the series, Yt. The general model for Yt is written as,

Yt =ϕ1Yt−1 +ϕ2Yt−2…ϕpYt−p +ϵt + θ1ϵt−1+ θ2ϵt−2 +…θqϵt−q

Where, Yt is the differenced time series value, ϕ and θ are unknown parameters and ϵ are independent identically distributed error terms with zero mean. Here, Yt is expressed in terms of its past values and the current and past values of error terms.

The ARIMA model combines three basic methods:

  • AutoRegression (AR) – In auto-regression the values of a given time series data are regressed on their own lagged values, which is indicated by the “p” value in the model.
  • Differencing (I-for Integrated) – This involves differencing the time series data to remove the trend and convert a non-stationary time series to a stationary one. This is indicated by the “d” value in the model. If d = 1, it looks at the difference between two time series entries, if d = 2 it looks at the differences of the differences obtained at d =1, and so forth.
  • Moving Average (MA) – The moving average nature of the model is represented by the “q” value which is the number of lagged values of the error term.

This model is called Autoregressive Integrated Moving Average or ARIMA(p,d,q) of Yt.  We will follow the steps enumerated below to build our model.

Step 1: Testing and Ensuring Stationarity

To model a time series with the Box-Jenkins approach, the series has to be stationary. A stationary time series means a time series without trend, one having a constant mean and variance over time, which makes it easy for predicting values.

Testing for stationarity – We test for stationarity using the Augmented Dickey-Fuller unit root test. The p-value resulting from the ADF test has to be less than 0.05 or 5% for a time series to be stationary. If the p-value is greater than 0.05 or 5%, you conclude that the time series has a unit root which means that it is a non-stationary process.

Differencing – To convert a non-stationary process to a stationary process, we apply the differencing method. Differencing a time series means finding the differences between consecutive values of a time series data. The differenced values form a new time series dataset which can be tested to uncover new correlations or other interesting statistical properties.

We can apply the differencing method consecutively more than once, giving rise to the “first differences”, “second order differences”, etc.

We apply the appropriate differencing order (d) to make a time series stationary before we can proceed to the next step.

Step 2: Identification of p and q

In this step, we identify the appropriate order of Autoregressive (AR) and Moving average (MA) processes by using the Autocorrelation function (ACF) and Partial Autocorrelation function (PACF).  Please refer to our blog, “Starting out with Time Series” for an explanation of ACF and PACF functions.

Identifying the p order of AR model

For AR models, the ACF will dampen exponentially and the PACF will be used to identify the order (p) of the AR model. If we have one significant spike at lag 1 on the PACF, then we have an AR model of the order 1, i.e. AR(1). If we have significant spikes at lag 1, 2, and 3 on the PACF, then we have an AR model of the order 3, i.e. AR(3).

Identifying the q order of MA model

For MA models, the PACF will dampen exponentially and the ACF plot will be used to identify the order of the MA process. If we have one significant spike at lag 1 on the ACF, then we have an MA model of the order 1, i.e. MA(1). If we have significant spikes at lag 1, 2, and 3 on the ACF, then we have an MA model of the order 3, i.e. MA(3).

Step 3: Estimation and Forecasting

Once we have determined the parameters (p,d,q) we estimate the accuracy of the ARIMA model on a training data set and then use the fitted model to forecast the values of the test data set using a forecasting function. In the end, we cross check whether our forecasted values are in line with the actual values.

Building ARIMA model using R programming

Now, let us follow the steps explained to build an ARIMA model in R. There are a number of packages available for time series analysis and forecasting. We load the relevant R package for time series analysis and pull the stock data from yahoo finance.

library(quantmod);library(tseries); library(timeSeries);library(forecast);library(xts); # Pull data from Yahoo finance getSymbols('TECHM.NS', from='2012-01-01', to='2015-01-01') # Select the relevant close price series stock_prices = TECHM.NS[,4]

In the next step, we compute the logarithmic returns of the stock as we want the ARIMA model to forecast the log returns and not the stock price. We also plot the log return series using the plot function.

# Compute the log returns for the stock stock = diff(log(stock_prices),lag=1) stock = stock[!] # Plot log returns plot(stock,type='l', main='log returns plot')

Next, we call the ADF test on the returns series data to check for stationarity. The p-value of 0.01 from the ADF test tells us that the series is stationary. If the series were to be non-stationary, we would have first differenced the returns series to make it stationary.

# Conduct ADF test on log returns series print(adf.test(stock))

In the next step, we fixed a breakpoint which will be used to split the returns dataset in two parts further down the code.

# Split the dataset in two parts - training and testing breakpoint = floor(nrow(stock)*(2.9/3))

We truncate the original returns series till the breakpoint, and call the ACF and PACF functions on this truncated series.

# Apply the ACF and PACF functions par(mfrow = c(1,1)) acf.stock = acf(stock[c(1:breakpoint),], main='ACF Plot', lag.max=100) pacf.stock = pacf(stock[c(1:breakpoint),], main='PACF Plot', lag.max=100)

We can observe these plots and arrive at the Autoregressive (AR) order and Moving Average (MA) order.

We know that for AR models, the ACF will dampen exponentially and the PACF plot will be used to identify the order (p) of the AR model. For MA models, the PACF will dampen exponentially and the ACF plot will be used to identify the order (q) of the MA model. From these plots let us select AR order = 2 and MA order = 2. Thus, our ARIMA parameters will be (2,0,2).

Our objective is to forecast the entire returns series from breakpoint onwards. We will make use of the For Loop statement in R and within this loop we will forecast returns for each data point from the test dataset.

In the code given below, we first initialize a series which will store the actual returns and another series to store the forecasted returns.  In the For Loop, we first form the training dataset and the test dataset based on the dynamic breakpoint.

We call the arima function on the training dataset for which the order specified is (2, 0, 2). We use this fitted model to forecast the next data point by using the forecast.Arima function. The function is set at 99% confidence level. One can use the confidence level argument to enhance the model. We will be using the forecasted point estimate from the model. The “h” argument in the forecast function indicates the number of values that we want to forecast, in this case, the next day returns.

We can use the summary function to confirm the results of the ARIMA model are within acceptable limits. In the last part, we append every forecasted return and the actual return to the forecasted returns series and the actual returns series respectively.

# Initialzing an xts object for Actual log returns Actual_series = xts(0,as.Date("2014-11-25","%Y-%m-%d")) # Initialzing a dataframe for the forecasted return series forecasted_series = data.frame(Forecasted = numeric()) for (b in breakpoint:(nrow(stock)-1)) { stock_train = stock[1:b, ] stock_test = stock[(b+1):nrow(stock), ] # Summary of the ARIMA model using the determined (p,d,q) parameters fit = arima(stock_train, order = c(2, 0, 2),include.mean=FALSE) summary(fit) # plotting a acf plot of the residuals acf(fit$residuals,main="Residuals plot") # Forecasting the log returns arima.forecast = forecast.Arima(fit, h = 1,level=99) summary(arima.forecast) # plotting the forecast par(mfrow=c(1,1)) plot(arima.forecast, main = "ARIMA Forecast") # Creating a series of forecasted returns for the forecasted period forecasted_series = rbind(forecasted_series,arima.forecast$mean[1]) colnames(forecasted_series) = c("Forecasted") # Creating a series of actual returns for the forecasted period Actual_return = stock[(b+1),] Actual_series = c(Actual_series,xts(Actual_return)) rm(Actual_return) print(stock_prices[(b+1),]) print(stock_prices[(b+2),]) }

Before we move to the last part of the code, let us check the results of the ARIMA model for a sample data point from the test dataset.

From the coefficients obtained, the return equation can be written as:

Yt = 0.6072*Y(t-1)  – 0.8818*Y(t-2) – 0.5447ε(t-1) + 0.8972ε(t-1)

The standard error is given for the coefficients, and this needs to be within the acceptable limits. The Akaike information criterion (AIC) score is a good indicator of the ARIMA model accuracy. Lower the AIC score better the model. We can also view the ACF plot of the residuals; a good ARIMA model will have its autocorrelations below the threshold limit. The forecasted point return is -0.001326978, which is given in the last row of the output.

Let us check the accuracy of the ARIMA model by comparing the forecasted returns versus the actual returns. The last part of the code computes this accuracy information.

# Adjust the length of the Actual return series Actual_series = Actual_series[-1] # Create a time series object of the forecasted series forecasted_series = xts(forecasted_series,index(Actual_series)) # Create a plot of the two return series - Actual versus Forecasted plot(Actual_series,type='l',main='Actual Returns Vs Forecasted Returns') lines(forecasted_series,lwd=1.5,col='red') legend('bottomright',c("Actual","Forecasted"),lty=c(1,1),lwd=c(1.5,1.5),col=c('black','red')) # Create a table for the accuracy of the forecast comparsion = merge(Actual_series,forecasted_series) comparsion$Accuracy = sign(comparsion$Actual_series)==sign(comparsion$Forecasted) print(comparsion) # Compute the accuracy percentage metric Accuracy_percentage = sum(comparsion$Accuracy == 1)*100/length(comparsion$Accuracy) print(Accuracy_percentage)

If the sign of the forecasted return equals the sign of the actual returns we have assigned it a positive accuracy score. The accuracy percentage of the model comes to around 55% which looks like a decent number. One can try running the model for other possible combinations of (p,d,q) or instead use the auto.arima function which selects the best optimal parameters to run the model.


To conclude, in this post we covered the ARIMA model and applied it for forecasting stock price returns using R programming language. We also crossed checked our forecasted results with the actual returns. In our upcoming posts, we will cover other time series forecasting techniques and try them in Python/R programming languages.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

The post Forecasting Stock Returns using ARIMA model appeared first on .

[R] Kenntnis-Tage 2017: Save the date, save your ticket, save your talk

Thu, 03/09/2017 - 14:01

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

When data science success stories meet practical R tutorials, the result is a one of a kind event for data analysts with a business background. On November 8 and 9, the [R] Kenntnis-Tage 2017 will be the platform for exchanging everyday challenges and solutions when working with the programming language R.

[R] Kenntnis-Tage-2017 in Kassel – Germany

Diverse program and inspiring networking

A rich program around data science and R awaits the participants on those two days:

  • Practical and inspiring talks, case studies and keynote speeches which offer insights into the various fields of application
  • Tutorials for deepening the participants’ knowledge of R
  • Networking and exchange with other data scientists, R users, developers and IT decision makers

Participants can gather valuable inspiration for writing their own success story with data science and R in times of data-driven business processes.

Call for papers: Submit your topic and be part of the [R] Kenntnis-Tage 2017

Especially the exchange across all industries and topics provides participants with valuable best practices for their own issues. Therefore, data science specialist eoda invites users, developers, admins and IT decision makers to hand in topics with a call for papers until April 30.

Last year’s event a total success

With the [R] Kenntnis-Tage 2017 eoda continues its own success story. At last year’s event, around 50 partly international participants could enhance their knowledge of R in workshops and tutorials as well as gain new inspiration for use cases in guest lectures by companies such as Lufthansa Industry Solutions and TRUMPF Laser.

Participants from different industries praised the event for giving innovative impulses for applying data science and R in the future.

Registration now open: book early and benefit from early bird discount

For more information about the [R] Kenntnis-Tage 2017 and the registration with special conditions for early bookers visit

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Negative Binomial Regression for Complex Samples (Surveys) #rstats

Thu, 03/09/2017 - 09:57

The survey-package from Thomas Lumley is a great toolkit when analyzing complex samples. It provides svyglm(), to fit generalised linear models to data from a complex survey design. svyglm() covers all families that are also provided by R’s glm() – however, the survey-package has no function to fit negative binomial models, which might be useful for overdispersed count models. Yet, the package provides a generic svymle() to fit user-specified likelihood estimations. In his book, Appendix E, Thomas Lumley describes how to write your own likelihood-function, passed to svymle(), to fit negative binomial models for complex samples. So I wrote a small „wrapper“ and implemented a function svyglm.nb() in my sjstats-package.

# ------------------------------------------ # This example reproduces the results from # Lumley 2010, figure E.7 (Appendix E, p256) # ------------------------------------------ library(sjstats) library(survey) data(nhanes_sample) # create survey design des <- svydesign( id = ~ SDMVPSU, strat = ~ SDMVSTRA, weights = ~ WTINT2YR, nest = TRUE, data = nhanes_sample ) # fit negative binomial regression fit <- svyglm.nb(total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)), des) # print coefficients and standard errors round(cbind(coef(fit), survey::SE(fit)), 2) #> [,1] [,2] #> theta.(Intercept) 0.81 0.05 #> eta.(Intercept) 2.29 0.16 #> eta.factor(RIAGENDR)2 -0.80 0.18 #> eta.log(age) 1.07 0.23 #> eta.factor(RIDRETH1)2 0.08 0.15 #> eta.factor(RIDRETH1)3 0.09 0.18 #> eta.factor(RIDRETH1)4 0.82 0.30 #> eta.factor(RIDRETH1)5 0.06 0.38 #> eta.factor(RIAGENDR)2:log(age) -1.22 0.27 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)2 -0.18 0.26 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)3 0.60 0.19 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)4 0.06 0.37 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)5 0.38 0.44

The functions returns an object of class svymle, so all methods provided by the survey-package for this class work – it’s just that there are only a few, and common methods like predict() are currently not implemented. Maybe, hopefully, future updates of the survey-package will include such features.

Tagged: R, rstats, Statistik

Euler Problem 15: Pathways Through a Lattice

Thu, 03/09/2017 - 08:52

Euler Problem 15 analyses taxicab geometry. This system replaces the usual distance function with the sum of the absolute differences of their Cartesian coordinates. In other words, the distance a taxi would travel in a grid plan. The fifteenth Euler problem asks to determine the number of possible routes a taxi can take in a city of a certain size.

Euler Problem 15 Definition

Starting in the top left corner of a 2×2 grid, and only being able to move to the right and down, there are exactly 6 routes to the bottom right corner. How many possible routes are there through a 20×20 grid?


The defined lattice is one larger than the number of squares. Along the edges of the matrix, only one pathway is possible: straight to the right or down. We can calculate the number of possible pathways for the remaining number by adding the number to the right and below the point.

For the two by two lattice the solution space is:

6  3  1
3  2  1
1  1  0

The total number of pathways from the upper left corner to the lower right corner is thus 6. This logic can now be applied to a grid of any arbitrary size using the following code.

The code defines the lattice and initiates the boundary conditions. The bottom row and the right column are filled with 1 as there is only one solution from these points. The code then calculates the pathways by working backwards through the matrix. The final solution is the number is the first cell.

# Define lattice nLattice <- 20 lattice = matrix(ncol=nLattice + 1, nrow=nLattice + 1) # Boundary conditions lattice[nLattice + 1,-(nLattice + 1)] <- 1 lattice[-(nLattice + 1), nLattice + 1] <- 1 # Calculate Pathways for (i in nLattice:1) { for (j in nLattice:1) { lattice[i,j] <- lattice[i+1, j] + lattice[i, j+1] } } answer <- lattice[1,1] Taxicab Geometry

The post Euler Problem 15: Pathways Through a Lattice appeared first on The Devil is in the Data.

Changing names in the tidyverse: An example for many regressions

Thu, 03/09/2017 - 07:17

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

A collaborator posed an interesting R question to me today. She wanted to do
several regressions using different outcomes, with models being computed on
different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then
filter the strata based on p-value levels.

This seems straighforward, right? Let’s set up a toy example:

library(tidyverse) dat <- as_tibble(expand.grid(letters[1:4], 1:5)) d <- vector('list', nrow(dat)) set.seed(102) for(i in 1:nrow(dat)){ x <- rnorm(100) d[[i]] <- tibble(x = x, y1 = 3 - 2*x + rnorm(100), y2 = -4+5*x+rnorm(100)) } dat <- as_tibble(bind_cols(dat, tibble(dat=d))) %>% unnest() knitr::kable(head(dat), format='html') Var1 Var2 x y1 y2 a 1 0.1805229 4.2598245 -3.004535 a 1 0.7847340 0.0023338 -2.104949 a 1 -1.3531646 3.1711898 -9.156758 a 1 1.9832982 -0.7140910 5.966377 a 1 1.2384717 0.3523034 2.131004 a 1 1.2006174 0.6267716 1.752106

Now we’re going to perform two regressions, one using y1 and one using y2 as the dependent variables, for each stratum defined by Var1 and Var2.

out % nest(-Var1, -Var2) %>% mutate(model1 = map(data, ~lm(y1~x, data=.)), model2 = map(data, ~lm(y2~x, data=.)))

Now conceptually, all we do is tidy up the output for the models using the broom package, filter on the rows containg the slope information, and extract the p-values, right? Not quite….

library(broom) out_problem % mutate(output1 = map(model1, ~tidy(.)), output2 = map(model2, ~tidy(.))) %>% select(-data, -model1, -model2) %>% unnest() names(out_problem)

[1] "Var1" "Var2" "term" "estimate" "std.error"
[6] "statistic" "p.value" "term" "estimate" "std.error"
[11] "statistic" "p.value"

We've got two sets of output, but with the same column names!!! This is a problem! An easy solution would be to preface the column names with the name of the response variable. I struggled with this today until I discovered the secret function.

out_nice % mutate(output1 = map(model1, ~tidy(.)), output2 = map(model2, ~tidy(.)), output1 = map(output1, ~setNames(., paste('y1', names(.), sep='_'))), output2 = map(output2, ~setNames(., paste('y2', names(.), sep='_')))) %>% select(-data, -model1, -model2) %>% unnest()

This is a compact representation of the results of both regressions by strata, and we can extract the information we would like very easily. For example, to extract the stratum-specific slope estimates:

out_nice %>% filter(y1_term=='x') %>% select(Var1, Var2, ends_with('estimate')) %>% knitr::kable(digits=3, format='html') Var1 Var2 y1_estimate y2_estimate a 1 -1.897 5.036 b 1 -2.000 5.022 c 1 -1.988 4.888 d 1 -2.089 5.089 a 2 -2.052 5.015 b 2 -1.922 5.004 c 2 -1.936 4.969 d 2 -1.961 4.959 a 3 -2.043 5.017 b 3 -2.045 4.860 c 3 -1.996 5.009 d 3 -1.922 4.894 a 4 -2.000 4.942 b 4 -2.000 4.932 c 4 -2.033 5.042 d 4 -2.165 5.049 a 5 -2.094 5.010 b 5 -1.961 5.122 c 5 -2.106 5.153 d 5 -1.974 5.009


To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Being successful on Kaggle using `mlr`

Thu, 03/09/2017 - 01:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

Achieving a good score on a Kaggle competition is typically quite difficult.
This blog post outlines 7 tips for beginners to improve their ranking on the Kaggle leaderboards.
For this purpose, I also created a Kernel
for the Kaggle bike sharing competition
that shows how the R package, mlr, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.

7 Rules
  1. Use good software
  2. Understand the objective
  3. Create and select features
  4. Tune your model
  5. Validate your model
  6. Ensemble different models
  7. Track your progress
1. Use good software

Whether you choose R, Python or another language to work on Kaggle, you will
most likely need to leverage quite a few packages to follow best practices in
machine learning. To save time, you should use ‘software’
that offers a standardized and well-tested interface for the important steps
in your workflow:

  • Benchmarking different machine learning algorithms (learners)
  • Optimizing hyperparameters of learners
  • Feature selection, feature engineering and dealing with missing values
  • Resampling methods for validation of learner performance
  • Parallelizing the points above

Examples of ‘software’ that implement the steps above and more:

  • For python: scikit-learn (
  • For R: mlr ( or caret.
2. Understand the objective

To develop a good understanding of the Kaggle challenge, you should:

  • Understand the problem domain:
    • Read the description and try to understand the aim of the competition.
    • Keep reading the forum and looking into scripts/kernels of others, learn from them!
    • Domain knowledge might help you (i.e., read publications about the topic, wikipedia is also ok).
    • Use external data if allowed (e.g., google trends, historical weather data).
  • Explore the dataset:
    • Which features are numerical, categorical, ordinal or time dependent?
    • Decide how to handle missing values. Some options:
      • Impute missing values with the mean, median or with values that are out of range (for numerical features).
      • Interpolate missing values if the feature is time dependent.
      • Introduce a new category for the missing values or use the mode (for categorical features).
    • Do exploratory data analysis (for the lazy: wait until someone else uploads an EDA kernel).
    • Insights you learn here will inform the rest of your workflow (creating new features).

Make sure you choose an approach that directly optimizes the measure of interest!

  • The median minimizes the mean absolute error (MAE) and
    the mean minimizes the mean squared error (MSE).
  • By default, many regression algorithms predict the expected mean but there
    are counterparts that predict the expected median
    (e.g., linear regression vs. quantile regression).

  • For strange measures: Use algorithms where you can implement your own objective
    function, see e.g.

3. Create and select features:

In many kaggle competitions, finding a “magic feature” can dramatically increase your ranking.
Sometimes, better data beats better algorithms!
You should therefore try to introduce new features containing valuable information
(which can’t be found by the model) or remove noisy features (which can decrease model performance):

  • Concat several columns
  • Multiply/Add several numerical columns
  • Count NAs per row
  • Create dummy features from factor columns
  • For time series, you could try
    • to add the weekday as new feature
    • to use rolling mean or median of any other numerical feature
    • to add features with a lag…
  • Remove noisy features: Feature selection / filtering
4. Tune your model

Typically you can focus on a single model (e.g. xgboost) and tune its hyperparameters for optimal performance.

5. Validate your model

Good machine learning models not only work on the data they were trained on, but
also on unseen (test) data that was not used for training the model. When you use training data
to make any kind of decision (like feature or model selection, hyperparameter tuning, …),
the data becomes less valuable for generalization to unseen data. So if you just use the public
leaderboard for testing, you might overfit to the public leaderboard and lose many ranks once the private
leaderboard is revealed.
A better approach is to use validation to get an estimate of performane on unseen data:

  • First figure out how the Kaggle data was split into train and test data. Your resampling strategy should follow the same method if possible. So if kaggle uses, e.g. a feature for splitting the data, you should not use random samples for creating cross-validation folds.
  • Set up a resampling procedure, e.g., cross-validation (CV) to measure your model performance
  • Improvements on your local CV score should also lead to improvements on the leaderboard.
  • If this is not the case, you can try
    • several CV folds (e.g., 3-fold, 5-fold, 8-fold)
    • repeated CV (e.g., 3 times 3-fold, 3 times 5-fold)
    • stratified CV
  • mlr offers nice visualizations to benchmark different algorithms.
6. Ensemble different models (see, e.g. this guide):

After training many different models, you might want to ensemble them into one strong model using one of these methods:

  • simple averaging or voting
  • finding optimal weights for averaging or voting
  • stacking
7. Track your progress

A kaggle project might get quite messy very quickly, because you might try and prototype
many different ideas. To avoid getting lost, make sure to keep track of:

  • What preprocessing steps were used to create the data
  • What model was used for each step
  • What values were predicted in the test file
  • What local score did the model achieve
  • What public score did the model achieve

If you do not want to use a tool like git, at least make sure you create subfolders
for each prototype. This way you can later analyse which models you might want to ensemble
or use for your final commits for the competition.

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The nhmrcData package: NHMRC funding outcomes data made tidy

Thu, 03/09/2017 - 00:00

Do you like R? Information about Australian biomedical research funding outcomes? Tidy data? If the answers to those questions are “yes”, then you may also like nhmrcData, a collection of datasets derived from funding statistics provided by the Australian National Health & Medical Research Council. It’s also my first R package (more correctly, R data package).

Read on for the details.

1. Installation

The package is hosted at Github and is in a subdirectory of a top-level repository, so it can be installed using the devtools package, then loaded in the usual way:

devtools::install_github("neilfws/politics", subdir = "nhmrcData", build_vignettes = TRUE) library(nhmrcData)

There are currently 14 datasets in the package. No need to type data(); they are “lazy-loaded”, so just start typing “nhmrc” then hit Tab to see their names. The datasets are (somewhat) documented and briefly described in the repository README so for this post, I’ll focus on just four examples: those related to gender.

2. Examples
Example code is (or will be) in the package vignette (I’m a bit over copy-pasting code to WordPress). Here are the results.

2.1 nhmrcOutcomesGenderBRA
The nhmrcOutcomesGenderBRA dataset contains information on funding outcomes by gender, fellowship scheme and broad research area for 2013 – 2015. Here’s an attempt to capture all of that in an area plot.

Mmm, look at those success rates. My qualitative first impression is that there are more women applicants in several categories, without a corresponding increase in funded proposals.

2.2 nhmrcOutcomesGenderFellows
The nhmrcOutcomesGenderFellows dataset also summarises outcomes by fellowship scheme, this time broken down into levels.

What’s interesting here is the difference in numbers by gender after the career development stage.

2.3 nhmrcOutcomesGenderPartTime
The nhmrcOutcomesGenderPartTime dataset looks at the relatively-small number of part-time fellowship applications.

I tried to create this chart in the style of a “population pyramid”, to contrast directly the numbers by gender. Looks rather like women could use more part-time opportunities earlier in their careers.

2.4 nhmrcOutcomesGenderScheme
Finally, the nhmrcOutcomesGenderScheme dataset: yet another summary by fellowship scheme but this time, broken down into 29 categories, such as “Australian Biomedical (Peter Doherty)”.

That’s the category I chose to display here and again, it indicates that more women are trying as hard or harder, yet this is not reflected in funding rates.

Of course, these charts are exploratory qualitative observations and it would be useful to apply some statistical tests where appropriate.

3. Thoughts on publicly-available data and tidyness
It’s great when organisations make their data available, but the next step is making it usable. Excel files is a start – but why are they always so dreadfully-formatted? The answer of course is that people cannot decide whether Excel is a database or a presentation tool (it is neither), so we end up with the worst of all worlds. PDF as a data source is just plain wrong, end of story. For this project I used the excellent tabulizer package to extract PDF tables into R matrices, followed by a lot of customised, manual cleaning which was impossible to automate or to make reproducible. The best that I could do was to dump the results as CSV files in the package data-raw directory.

Given that organisations struggle to get even the basics of data correct, I suppose hoping for some consideration of tidyness is wildly optimistic. So many spreadsheets with a header row of years (2000, 2001, 2002…) as opposed to a column named year. The tidy philosophy certainly makes manipulation and visualisation using R much easier, though I note that the occasional “spread” to wide format is still required for certain calculations. For example, funding outcomes are often expressed as total applications and successful applications. When those values are stored in the same column, it’s easier to calculate the number of unsuccessful applications (total – successful) by spreading to two columns, then gathering back again.

4. Thoughts on package writing
I’ve long wanted to write an R package but like many people, I kept putting it into the “too hard just now” basket. This is no longer the case thanks to some excellent resources. The first of these of course is RStudio itself, which includes several tools to facilitate package writing, building, testing and documentation. For an overview, start here. There are also many excellent online guides and tutorials. One such resource (which includes links to others) is Karl Broman’s R package primer.

Time will tell how easy it will be to update the package with new NHMRC data. In the meantime, I hope some of you find this useful and welcome your constructive feedback.

Filed under: R, statistics

Meetup: Stream processing with R in AWS

Wed, 03/08/2017 - 18:29

Update: Slides click here.

Stream processing with R in AWS
by Gergely Daróczi

Abstract: R is rarely mentioned among the big data tools, although it’s fairly well scalable for most data science problems and ETL tasks. This talk presents an open-source R package to interact with Amazon Kinesis via the MultiLangDaemon bundled with the Amazon KCL to start multiple R sessions on a machine or cluster of nodes to process data from theoretically any number of Kinesis shards. Besides the technical background and a quick introduction on how Kinesis works, this talk will feature some stream processing use-cases at, and will also provide an overview and hands-on demos on the related data infrastructure built on the top of Docker, Amazon ECS, ECR, KMS, Redshift and a bunch of third-party APIs.

Bio: Gergely Daróczi is an enthusiast R user and package developer, founder of an R-based web application at, Ph.D. candidate in Sociology, Director of Analytics at with a strong interest in designing a scalable data platform built on the top of R, AWS and a dozen APIs. He maintains some CRAN packages mainly dealing with reporting and API integrations, co-authored a number of journal articles in social and medical sciences, and recently published the “Mastering Data Analysis with R” book at Packt.


– 6:30pm arrival, food/drinks and networking
– 7:30pm talks
– 9 pm more networking

You must have a confirmed RSVP and please arrive by 7:25pm the latest. Please RSVP here on Eventbrite.

Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds’ NEW LOCATION, don’t go to the old one)
Park underneath the building (Colorado Center), Edmunds will validate.