R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 16 min ago

### Tic Tac Toe Part 3: The Minimax Algorithm

Thu, 06/08/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

In two previous posts, I presented code to teach R to play the trivial game of Tic Tac Toe. I thought this was an unbeatable algorithm. Alas, a comment from Alberto shattered my pride as he was able to beat my code.

The reason for the demise of my code was that I didn’t implement a full minimax algorithm, but instead looked only two moves ahead. I thought the common strategy rules (start in the centre and occupy the corners) would make the program unbeatable. When I simulated the game by instructing the computer to play against itself, Alberto’s strategy never arose because the code forces the centre field. Alberto’s code shows that you need to look at least three moves ahead for a perfect game. He has been so kind to share his code and gave me permission to publish it.

Alberto recreated two functions, for completeness I have added the complete working code that merges his improvements with my earlier work. The first two functions are identical to the previous post. These functions draw the game board and process the human player’s move by waiting for a mouse click.

# Draw the game board draw.board <- function(game) { xo <- c("X", " ", "O") # Symbols par(mar = rep(1,4)) plot.new() plot.window(xlim = c(0,30), ylim = c(0,30)) abline(h = c(10, 20), col="darkgrey", lwd = 4) abline(v = c(10, 20), col="darkgrey", lwd = 4) text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 4) # Identify location of any three in a row square <- t(matrix(game, nrow = 3)) hor <- abs(rowSums(square)) if (any(hor == 3)) hor <- (4 - which(hor == 3)) * 10 - 5 else hor <- 0 ver <- abs(colSums(square)) if (any(ver == 3)) ver <- which(ver == 3) * 10 - 5 else ver <- 0 diag1 <- sum(diag(square)) diag2 <- sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (all(hor > 0)) for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col="red") if (all(ver > 0)) for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col="red") if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd = 10, col = "red") if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd = 10, col = "red") } # Human player enters a move move.human <- function(game) { text(4, 0, "Click on screen to move", col = "grey", cex=.7) empty <- which(game == 0) move <- 0 while (!move %in% empty) { coords <- locator(n = 1) # add lines coords$x <- floor(abs(coords$x) / 10) + 1 coords$y <- floor(abs(coords$y) / 10) + 1 move <- coords$x + 3 * (3 - coords$y) } return (move) }

Alberto rewrote the functions that analyse the board and determine the move of the computer. The ganador (Spanish for winning) function assesses the board condition by assigning -10 or + 10 for a winning game and 0 for any other situation.

ganador <- function(juego, player) { game <- matrix(juego, nrow = 3, byrow = T) hor <- rowSums(game) ver <- colSums(game) diag <- c(sum(diag(game)), sum(diag(apply(game, 1, rev)))) if (-3 %in% c(hor, ver, diag)) return(-10) if (3 %in% c(hor, ver, diag)) return(10) else return(0) }

The next function is the actual minimax algorithm. If the computer starts then the first move ( options to assess) takes a little while. The commented lines can be used to force a corner and make the games faster by forcing a random corner.

The minimax function returns a list with the move and its valuation through the ganador function. The function works recursively until it has filled the board and retains the best scoring move using the minimax method. To avoid the computer always playing the same move in the same situation random variables are added.

minimax <- function(juego, player) { free <- which(juego == 0) if(length(free) == 1) { juego[free] <- player return(list(move = free, U = ganador(juego, player))) } poss.results <- rep(0, 9) for(i in free) { game <- juego game[i] <- player poss.results[i] <- ganador(game, player) } mm <- ifelse(player == -1, "which.min", "which.max") if(any(poss.results == (player * 10))) { move <- do.call(mm, list(poss.results)) return(list(move = move, U = poss.results[move])) } for(i in free) { game <- juego game[i] <- player poss.results[i] <- minimax(game, -player)$U } random <- runif(9, 0, 0.1) poss.results[-free] <- 100 * -player poss.results <- poss.results + (player * random) move <- do.call(mm, list(poss.results)) return(list(move = move, U = poss.results[move])) } This final function stitches everything together and lets you play the game. Simply paste all functions in your R console and run them to play a game. The tic.tac.toe function can take two parameters, “human” and/or “computer”. The order of the parameters determines who starts the game. # Main game engine tic.tac.toe <- function(player1 = "human", player2 = "computer") { game <- rep(0, 9) # Empty board winner <- FALSE # Define winner player <- 1 # First player players <- c(player1, player2) draw.board(game) while (0 %in% game & !winner) { # Keep playing until win or full board if (players[(player + 3) %% 3] == "human") # Human player move <- move.human(game) else { # Computer player move <- minimax(game, player) move <- move$move } game[move] <- player # Change board draw.board(game) winner <- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6 # Winner, winner, chicken dinner? player <- -player # Change player } } tic.tac.toe()

This is my last word on Tic Tac Toe but now that the minimax conundrum is solved I could start working on other similar games such as Connect Four, Draughts or even the royal game of Chess.

The post Tic Tac Toe Part 3: The Minimax Algorithm appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### The Ghost Printer Behind Top-level R Expressions

Thu, 06/08/2017 - 02:00

(This article was first published on English Blog on Yihui Xie | 谢益辉, and kindly contributed to R-bloggers)

For any developers who have ever written an S3 method for the print() function, they probably know what a top-level R expression means, but this is a very confusing concept to non-developers. I have to explain this every now and then, so I decided to write a short post about it.

Yesterday I saw a Github issue in the rmarkdown repository, and you can see that there are still users confused by the fact that ggplot2 plots are not rendered in certain cases. I have seen similar questions perhaps hundreds of times. Such questions have been answered in the R FAQ 7.22 “Why do lattice/trellis graphics not work?”, but the answer didn’t explain the root reason in detail.

A top-level R expression is usually implicitly printed. Both words can cause confusion: printing is implicit so that you probably don’t consciously know it, and printing means the print() function is called on the object returned by the expression. For example, when you type 1 + 1 in the R console, and press Enter/Return, what actually happens is print(2), where 2 is the value returned by 1 + 1.

The reason that you create a ggplot() object in your R console and it shows up after you press Enter is that ggplot2 has defined a print.ggplot method on such objects. ggplot() does not actually create the plot – it only creates a ggplot object. It is the print method that actually creates the plot in a graphical device (as a side-effect).

Expressions nested in other expressions are not top-level expressions. For example, in a for loop, ggplot objects or HTML widgets or knitr::kable() cannot be displayed because they are not printed, unless you explicitly print() them.

Top-level R expressions are not always printed. They are not printed if they are marked as invisible. For example, if you run invisible(1 + 1) in the R console, you won’t see any value printed.

Many R functions return invisible values wrapped in the invisible() function. I want to show a few functions that you may have never recognized: if, (, and assignment operators such as = and <- are all functions. Let’s start from the assignment operator:

library(ggplot2) p = ggplot(mpg, aes(cty, hwy)) + geom_point()

This creates a ggplot object but you do not see the plot even though this is a top-level expression, and that is because the assignment operator = returns the value invisibly. If you type p in the console later, you will see the plot since p is implicitly printed.

You can force a top-level expression to be visible (hence printed) by wrapping it in (), e.g., (invisible(1 + 1)) will be printed, and so will the ggplot object below:

library(ggplot2) (p = ggplot(mpg, aes(cty, hwy)) + geom_point())

The if function returns the value of a last expression depending on the condition. For example, if (TRUE) A else B returns A, and if (FALSE) {A} else {B; C} returns C. With that in mind, you will understand why the ggplot object is not displayed in the if statement below:

library(ggplot2) if (TRUE) { ggplot(mpg, aes(cty, hwy)) + geom_point() 1 + 1 # the value of this expression is returned (and printed) }

Note for, while, and repeat loops always return an invisible NULL, so nothing will be printed automatically, including the last expression in the loop, e.g.,

library(ggplot2) for (i in 1:10) { ggplot(mpg, aes(cty, hwy)) + geom_point() }

However, base R graphics is a different animal. All base R graphics functions will create plots no matter if they are called in top-level R expressions. For example, this will create 10 plots:

for (i in 1:10) { plot(rnorm(100)) }

WAT?

WAT!

Lastly, there is a function withVisible() that can tell you whether an expression returns a visible value or not, e.g.,

withVisible(1 + 1) # $value # [1] 2 # #$visible # [1] TRUE withVisible({x = 1 + 1}) # assignment in {} # $value # [1] 2 # #$visible # [1] FALSE

I hope you could remember the “ghost printer” the next time you run any R code in your R console or R Markdown or any knitr documents.

location.href/*Tal, could u pls not modify my script? It is not cool. Thanks!*/='https://yihui.name/en/2017/06/top-level-r-expressions/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English Blog on Yihui Xie | 谢益辉. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Non gaussian time-series, let’s handle it with score driven models!

Wed, 06/07/2017 - 22:54

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Henrique Helfer

Motivation

Until very recently, only a very limited classes of feasible non Gaussian time series models were available. For example, one could use extensions of state space models to non Gaussian environments (see, for example, Durbin and Koopman (2012)), but extensive Monte Carlo simulation is required to numerically evaluate the conditional densities that define the estimation process of such models.

The high technicalities involved in implementing these algorithms and its accompanying computational cost have not helped its widespread use by practitioners. On the other hand, different attempts to extend ARMA type models with conditional non Gaussian distributions have been more successful. For example, the use of GARCH type models to deal with heavy tailed distributions in finance (Engle and Bollerslev (1986)), the Autoregressive Conditional Duration (ACD) model of Engle and Russell (1998) to tackle asymmetric distributions in time duration and the Poisson count models of Davis et al (2003) for the modelling of discrete events in time. But, so far, these extensions have lacked an unifying framework that would allow the specification, estimation and forecasting of a model based on an arbitrary non Gaussian distribution. The recently proposed Generalized Autoregressive Scores (GAS) models by Creal et al (2008, 2013), or dynamic conditional score (DCS) from Harvey (2013), offer an unifying framework to derive and estimate time series model with any conditional non-Gaussian distribution, either discrete or continuous, univariate or multivariate. More specifically, in GAS models, conditional on past observations, a proper probability model, possibly non Gaussian, is chosen for the response variable at time . Then, by construction, time varying parameters can be accommodated according to an updating mechanism that uses the score as its driving force. The use of the score for updating time-varying parameters is intuitive given that it defines the steepest ascent direction for improving the model’s local fit in terms of the likelihood or density at time , given the current parameter position. In such an updating mechanism information from the whole density is used to track the evolution of time varying parameters.

Of course, in this post I will briefly explain the estimation framework of such models for our community, however I deeply encourage you our fellow “insighteR” to pay a visit to gasmodel.com, where you can find a whole section devoted to GAS models papers and see for yourself the diversity of applications.

Score driven models

First of all, one should choose an specific distribution which support accommodates the range of values assumed by the time series of interest ,

where is the time varying parameter vector, while makes reference to the fixed parameter vector that will be estimated by maximum likelihood and collects all relevant information up to time .

Next, to fully specify a GAS model, one has to choose which parameters of the distribution will evolve in time and those that will be fixed. The time varying parameters will then follow a GAS(p,q) updating mechanism given by:

where .

To complete the description of the the updating mechanism of GAS models it is necessary to define the score given by,

where is the observation density function and is a scaling matrix with appropriate dimensions. Different choices of results in different GAS models: means that it will be use the inverse of Fisher Information matrix, the pseudo-inverse of Fisher Information matrix the identity matrix.

Parametrization

If we assume for instance the intensity of a poisson density, 0" title="\lambda_{t}>0" class="latex" />, it is natural to choose . From this, the previous Equations should also be re-parametrized with regard to . Considering a monotonically increasing mapping function , then . Taking , the poisson GAS specification updating mechanism, will be

Application

To elucidate a potential application, in this section we will generate and model a cont time series using GAS framework. More specifically, we will adopt a poisson density with its intensity parameter being updated trough a GAS mechanism. Also we will fix as scaling matrix. Following Table 1 from Creal et al (2008), the score is given by,

First of all, let’s create a function which simulates a cont time series from a poisson model with its intensity parameter following a GAS mechanism.

################################## #### GENERATING DGP ################################## A = 0.101 B = -0.395 w = 0.183 GAS_POISSON_GENERATE = function (w,A,B,size){ y = f = s = NULL ############################################### # Initial conditions for the recursion: ############################################### t=1 f[1] = 1 y[1] = rpois(1,f[1]) ############################################### # GAS mechanism: ############################################### for (t in 1:size){ s[t] = (y[t]-exp(f[t]))*(exp(f[t])) f[t+1] = w + A * s[t] + B * f[t] y[t+1] = rpois(1,exp(f[t+1])) } return(list("y"=y,"f"=f)) }

In the sequel, one can simulate the values of a cont time-series using the previous function. For instance, let us simulate a time series of size 1000 with , and .

set.seed(201930) serie.sim = GAS_POISSON_GENERATE(w,A,B,1000) par(mfrow=c(2,1)) ts.plot(serie.sim$y,ylab = "y") ts.plot(exp(serie.sim$f),ylab = "lambda")

To estimate the values of , we will use Maximum likelihood as described in the original paper of Creal et al. (2013). To do this in R, we need to create a function which will maximize some quantity, i.e., log-likelihood.

################################## #### ESTIMATING ML ################################## GAS_POISSON = function (q, y=y){ A = q[1] B = q[2] w = q[3] ##### n = length(y) s = NULL f = NULL ##### f[1] = 0 for (t in 1 : n){ s[t] = (y[t]-exp(f[t]))*(exp(f[t])) f[t+1] = w + A * s[t] + B * f[t] } # Here we have the loglikelihood. sum = 0 for (t in 2:length(y)){ sum = sum + dpois(y[t], lambda = exp(f[t]), log = TRUE) } return(sum/n) }

With the Maximum likelihood function in hands, we can do the parameter optimization, using as initial condition for the parameters in a uniform between 0 and 0.1.

f = NULL; s= NULL y = serie.sim$y set.seed(8456) result = optim(runif(3,0,0.1), y=y,GAS_POISSON , method="BFGS", hessian=TRUE, control=list(fnscale=-1)) param = result$par

The standard errors can be calculated using the Hessian inverse evaluated at the optimum point, i.e., under some mild conditions, the maximum likelihood estimator of is consistent and converges in distribution to

hessian = -solve(as.matrix((length(y))*result$hessian)) inv.hessian = hessian stderrors = array(0,length(param)) for (t in 1:length(param)){stderrors[t] = sqrt(hessian[t,t])} estim = cbind(param,stderrors) estim ## param stderrors ## [1,] 0.07543154 0.01999293 ## [2,] -0.56880833 0.11422324 ## [3,] 0.20280978 0.05132407 Regarding the forecasting, similar to GARH models, the steps ahead distribution conditional on observations up to time , , is only analytical for , when it coincides with the chosen probability model. However, for 1" title="k>1" class="latex" /> it has to be evaluated using Monte Carlo simulation. But first, let us create a function to calculate the series of time-varying parameters with our estimated parameters, . GAS_POISSON_CALCULATE = function (par,y){ A = par[1] ; B = par[2] ; w=par[3] f = s = NULL f[1] = 0 n = length(y) ##### for (t in 1 : n){ s[t] = (y[t]-exp(f[t]))*(exp(f[t])) f[t+1] = w + A * s[t] + B * f[t] } ##### return(list("y"=y,"f"=f)) } # Here we save the time varying parameter series model.values = GAS_POISSON_CALCULATE(result$par,y)

Now we can use the last point of series to obtain an “initial condition” for the forecasting simulation. Let us create a function to calculate our predictions with confidence intervals.

GAS_POISSON_FORECATING = function(f.mod,steps.ahead,par){ f = f.mod[length(f.mod)] ################################################## n.sim=2000 # Number of Monte Carlo Simulations dens.pred = matrix(NA,steps.ahead,n.sim) # Matrix of simulated values with dimension steps ahead x MC simulations f.prev = NULL # simulated series of time-varying parameter f for each MC s.prev = NULL # simulated series of score s # Estimated parameters from theta A = par[1] ; B = par[2] ; w=par[3] ################################################## for(j in 1:n.sim){ #f.prev[1] = f[length(y)+1] f.prev[1] = f for (t in 1:steps.ahead){ dens.pred[t,j] = rpois(1, lambda = exp(f.prev[t])) # generate a random poisson value with intensity parameter modeled by GAS s.prev[t] = (dens.pred[t,j]-exp(f.prev[t]))*(exp(f.prev[t])) # calculate the score f.prev[t+1] = w + A*s.prev[t] + B*f.prev[t] # update lambda } f.prev = NULL s.prev = NULL } ################################################## # Here we calculate the expected value of the predictive density and calculate the confidence intervals forecasting = NULL CI.sup = NULL CI.inf = NULL for(i in 1:steps.ahead){ forecasting[i]=mean(dens.pred[i,]) CI.sup[i] = quantile(dens.pred[i,],probs=0.975) CI.inf[i] = quantile(dens.pred[i,],probs=0.025) } return(list("forecasting"=forecasting,"CI.sup" = CI.sup, "CI.inf" = CI.inf)) }

With the forecasting function already defined, one should only use as input the last value of the series , the number of steps ahead and .

forecast = GAS_POISSON_FORECATING(model.values$f,5,result$par) final.pred = cbind(forecast$forecasting,forecast$CI.sup,forecast$CI.inf) colnames(final.pred) = c("Mean", "UB", "LB") print(final.pred) ## Mean UB LB ## [1,] 1.0320 3 0 ## [2,] 1.2320 4 0 ## [3,] 1.1270 4 0 ## [4,] 1.1615 4 0 ## [5,] 1.1125 3 0 References Creal, D., Koopman, S. J., & Lucas, A. (2008). A general framework for observation driven time-varying parameter models, Tinbergen Institute,Tech. Rep. Creal, D., Koopman, S. J., & Lucas, A. (2013). Generalized autoregressive score models with applications. Journal of Applied Econometrics, 28(5), 777-795. Davis, R. A., Dunsmuir, W. T., & Streett, S. B. (2003). Observation-driven models for Poisson counts. Biometrika, 777-790. Durbin, J., & Koopman, S. J. (2012). Time series analysis by state space methods (Vol. 38). OUP Oxford. Engle, R. F., & Bollerslev, T. (1986). Modelling the persistence of conditional variances. Econometric reviews, 5(1), 1-50. Engle, R. F., & Russell, J. R. (1998). Autoregressive conditional duration: a new model for irregularly spaced transaction data. Econometrica, 1127-1162. Harvey, A. C. (2013). Dynamic models for volatility and heavy tails: with applications to financial and economic time series (Vol. 52). Cambridge University Press. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### How to create dot-density maps in R Wed, 06/07/2017 - 22:17 (This article was first published on Revolutions, and kindly contributed to R-bloggers) Choropleths are a common approach to visualizing data on geographic maps. But choropleths — by design or necessity — aggregate individual data points into a single geographic region (like a country or census tract), which is all shaded a single colour. This can introduce interpretability issues (are we seeing changes in the variable of interest, or just population density?) and can fail to express the richness of the underlying data. For an alternative approach, take a look at the recent Culture of Insight blog post which provides a tutorial on creating dot-density maps in R. The chart below is based on UK Census data. Each point represents 10 London residents, with the colour representing one of five ethnic categories. Now, the UK census only reports ethnic ratios on a borough-by-borough basis, so the approach here is to simulate the individual resident data (which is not available) by randomly distributing points across the borough following the reported distribution. In a way, this is suggesting a level of precision which isn't available in the source data, but it does provide a visualization of London's ethnic diversity that isn't confounded with the underlying population distribution. Follow the link below to the detailed blog post, which includes R code (in both base and ggplot2 graphics) for creating density dot-charts like these. Also be sure to check out the zoomable version of the chart at the top of the page, which used Microsoft's Deep Zoom Composer in conjunction with OpenSeadragon to provide the zooming capability. Culture of Insight: Building Dot Density Maps with UK Census Data in R var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### More on safe substitution in R Wed, 06/07/2017 - 19:30 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) Let’s worry a bit about substitution in R. Substitution is very powerful, which means it can be both used and mis-used. However, that does not mean every use is unsafe or a mistake. We can confirm the above code performs no substitution: a <- 1 b <- 2 substitute(a + b + z) ## a + b + z And it appears the effect is that substitute is designed to not take values from the global environment. So, as we see below, it isn’t so much what environment we are running in that changes substitute’s behavior, it is what environment the values are bound to that changes things. (function() { a <- 1 substitute(a + b + z, environment()) })() ## 1 + b + z We can in fact find many simple variations of substitute that work conveniently. substitute(a + b + z, list(a=1, b=2)) ## 1 + 2 + z substitute(a + b + z, as.list(environment())) ## 1 + 2 + z Often R‘s documentation is a bit terse (or even incomplete) and functions (confusingly) change behavior based on type of arguments and context. I say: always try a few variations to see if some simple alteration can make "base-R" work for you before giving up and delegating everything to an add-on package. However, we in fact found could not use substitute() to implement wrapr::let() effects (that is re-mapping non-standard interfaces to parametric interfaces). There were some avoidable difficulties regarding quoting and un-quoting of expressions. But the killing issue was: substitute() apparently does not re-map left-hand sides: # function that print all of its arguments (including bindings) f <- function(...) { args <- match.call() print(paste("f() call is:", capture.output(str(args)))) } # set up some global variables X <- 2 B <- 5 # try it f(X=7, Y=X) ## [1] "f() call is: language f(X = 7, Y = X)" # use substitute to capture an expression captured <- substitute(f(X=7, Y=X)) # print the captured expression print(captured) ## f(X = 7, Y = X) # evaluate the captured expression eval(captured) ## [1] "f() call is: language f(X = 7, Y = X)" # notice above by the time we get into the function # the function arguments have taken there value first # from explicit argument assignment (X=7) and then from # the calling environment (Y=X goes to 2). # now try to use substitute() to re-map values xform1 <- substitute(captured, list(X= as.name('B'))) # doesn't look good in printing print(xform1) ## captured # and substitutions did not happen as the variables we # are trying to alter are not free in the word "captured" # (they are in the expression the name captured is referring to) eval(xform1) ## f(X = 7, Y = X) # can almost fix that by calling substitute on the value # of captured (not the word "captured") with do.call() subs <- do.call(substitute, list(captured, list(X= as.name('B')))) print(subs) ## f(X = 7, Y = B) eval(subs) ## [1] "f() call is: language f(X = 7, Y = B)" # notice however, only right hand side was re-mapped # we saw "f(X = 7, Y = B)", not "f(B = 7, Y = B)" # for some packages (such as dplyr) re-mapping # left-hand sides is important # this is why wrapr::let() exists wrapr::let( c(X= 'B'), f(X=7, Y=X) ) ## [1] "f() call is: language f(B = 7, Y = B)" Re-mapping left hand sides is an important capability when trying to program over dplyr: suppressPackageStartupMessages(library("dplyr")) d <- data.frame(x = 1:3) mapping <- c(OLDCOL= 'x', NEWCOL= 'y') wrapr::let( mapping, d %>% mutate(NEWCOL = OLDCOL*OLDCOL) ) ## x y ## 1 1 1 ## 2 2 4 ## 3 3 9 wrapr::let() is based on string substitution. This is considered risky. Consider help(substitute, package='base') Note substitute works on a purely lexical basis. There is no guarantee that the resulting expression makes any sense. And that is why wrapr::let() takes a large number of precautions and vets user input before performing any substitution. The idea is: wrapr::let() is more specialized than substitute() so in addition to attempting extra effects (re-mapping left hand sides) it can introduce a lot of checks to ensure safe invariants. And that is a bit of my point: when moving to a package look for specificity and safety in addition to "extra power." That is how wrapr::let() is designed and whey wrapr::let() is a safe and effective package to add to your production work-flows. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### quantmod 0.4-9 on CRAN Wed, 06/07/2017 - 19:25 (This article was first published on FOSS Trading, and kindly contributed to R-bloggers) A new release of quantmod is now on CRAN! The only change was to address changes to Yahoo! Finance and their effects on getSymbols.yahoo(). GitHub issue #157 contains some details about the fix implementation. Unfortunately, the URL wasn’t the only thing that changed. The actual data available for download changed as well. The most noticeable difference is that the adjusted close column is no longer dividend-adjusted (i.e. it’s only split-adjusted). Also, only the close price is unadjusted; the open, high, and low are split-adjusted. There also appear to be issues with the adjusted prices in some instruments. For example, users reported issues with split data for XLF and SPXL in GitHub issue #160. For XLF, there a split and a dividend on 2016-09-16, even on the Yahoo! Finance historical price page for XLF. As far as I can tell, there was only a special dividend. The problem with SPXL is that the adjusted close price isn’t adjusted for the 4/1 split on 2017-05-01, which is also reflected on the Yahoo! Finance historical prices page for SPXL. Another change is that the downloaded data may contain rows where all the values are “null”. These appear on the website as “0”. This is a major issue for some instruments. Take XLU for example; 188 of the 624 days of data are missing between 2014-12-04 and 2017-05-26 (ouch!). You can see this is even true on the Yahoo! Finance historical price page for XLU. If these changes have made you look for a new data provider, see my post: Yahoo! Finance Alternatives. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Data wrangling : I/O (Part-2) Wed, 06/07/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the first part of this series and it aims to cover the importing of data from the web. In many cases, downloading data in order to process them can be time consuming, therefore being able to import the data straight from the web is a ‘nice-to-have’ skill. Moreover, data isn’t always not saved in structured files, but they are on the web in forms of text and tables, in this set of exercise we will go through the latter case. In case you want me to go through the former case as well, please let me know at the comment section. Before proceeding, it might be helpful to look over the help pages for the getURL, fromJSON, ldply, xmlToList, read_html, html_nodes, html_table, readHTMLTable, htmltab. Moreover please load the following libraries. install.packages("RCurl") library(RCurl) install.packages("rjson") library(rjson) install.packages("XML") library(XML) install.packages("plyr") library(plyr) install.packages("rvest") library(rvest) install.packages("htmltab") library(htmltab) Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Exercise 1 Retrieve the source of the web page “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-1/data.csv” and assign it to the object “url” Exercise 2 Read the csv file and assign it to the “csv_file” object. Exercise 3 Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.txt” and then assign it to the “txt_file” object. Note: it is a txt file, so you should use the adequate function in order to import it. Exercise 4 Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.json” and then assign it to the “json_file” object. Note: it is a json file, so you should use the adequate function in order to import it. Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to: • import data into R in several ways while also beeing able to identify a suitable import tool • use SQL code within R • And much more Exercise 5 Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.xml” and then assign it to the “xml_file” object. Note: it is a xml file, so you should use the adequate function in order to import it. Exercise 6 We will go through web scraping now. Read the html file “http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html” and assign it to the object “url”. hint: consider using read_html Exercise 7 Select the “table” nodes from the html document you retrieved before. hint: consider using html_nodes Exercise 8 Convert the node you retrieved at exercise 7, to an actionable list for processing. hint: consider using html_table Exercise 9 Let’s go to a faster and more straight forward function, retrieve the html document like you did at exercise 6 and make it an actionable list using the function readHTMLTable. Exercise 10 This may be a bit tricky, but give it a try. Retrieve the html document like you did at exercise 6 and make it an actionable data frame using the function htmltab. Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Get the best from ggplotly Wed, 06/07/2017 - 15:16 (This article was first published on Blog – The R graph Gallery, and kindly contributed to R-bloggers) Plotly is an R library allowing to make amazing interactive charts in a minute. This blog post shows how to get the best from ggplotly, the function of plotly allowing you to turn any ggplot2 chart interactive. A special care is given on features that are not present in ggplot2 (like hovering), and features that are not well translated by ggplotly (like legend position). Let’s start with a usual ggplot2 chart. Note that the ggplot2 section of the R graph gallery contains dozens of examples with the reproducible code, so you will most likely find the graphic you need to visualize your data. # Libraries library(tidyverse) library(plotly) # Scatterplot p=ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species, shape=Species)) + geom_point(size=6, alpha=0.6) p Then, we can apply the magic of plotly in just one line of code! We get an interactive chart on which we can zoom, hover, play with axis, export, click on legend to make group appear/disappear … This is great and works with any ggplot2 graphic. ggplotly(p) Since ggplot2 creates static charts, it does not offer any option to custom hover text. This can be done as follow: 1/ create a text vector with the text you want to show 2/ use plotly_build() to make the interactive chart 3/ pass the text to the interactive version with style() mytext=paste("Sepal Length = ", iris$Sepal.Length, "\n" , "Sepal Width = ", iris$Sepal.Width, "\n", "Row Number: ",rownames(iris), sep="") pp=plotly_build(p) style( pp, text=mytext, hoverinfo = "text", traces = c(1, 2, 3) ) Unfortunately, some ggplot2 parameters are sometimes misunderstood by ggplotly. For example, let’s place the ggplot2 legend inside the plot: p_changed=ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species, shape=Species)) + geom_point(size=6, alpha=0.6) + theme(legend.position = c(.83, .9)) p_changed Using a basic ggplotly() call will not change the legend position. To do so, we need to enter parameters using the real plotly syntax as follow. Note that all plotly arguments can be found here. pp_changed=plotly_build(p_changed) style( pp_changed ) %>% layout( legend = list(x = 0.01, y = 0.95) ) Note that the R graph gallery offers a lot of other examples of interactive R charts. You can follow the gallery on Twitter or on Facebook to be notified of recent updates. 1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Blog – The R graph Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Security: the dangers of copying and pasting R code Wed, 06/07/2017 - 12:11 (This article was first published on R – Why?, and kindly contributed to R-bloggers) Most of the time when we stumble across a code snippet online, we often blindly copy and paste it into the R console. I suspect almost everyone does this. After all, what’s the harm? Consider this simple piece of R code that performs simple linear regression # Generate data x = rnorm(10) y = rnorm(10) message(“All your base are belong to us.”) # Simple linear regression m = lm(y ~ x) Now highlight the above piece of R code and copy and paste it into your console; look carefully at what you’ve pasted. A new line has magically appeared. # Generate data x = rnorm(10) y = rnorm(10) message("All your base are belong to us.") # Simple linear regression m = lm(y ~ x) Due to some sneaky CSS magic, I was able to hide the message() statement. If I was evil, I could have changed this to a system, source, or any other command. The CSS code simply sets the message() function to the background color, changes the font size and makes it un-selectable (see this post for details). So remember, be careful with your copy and pasting! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Why?. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### R and Data Science activities in London, June 27th – 29th Wed, 06/07/2017 - 11:27 (This article was first published on R – Locke Data, and kindly contributed to R-bloggers) Locke Data will be up to some shenanigans of various stripes in the big smoke. We hope to see you at some of them! June 26th — Monday Introduction to R (Newcastle) I won’t be in London for this but I’ll be doing a day of Introduction to R in Newcastle. This is supporting the local user groups and costs up to £90 for the whole day. Intro to R in Newcastle, June 26th June 27th — Tuesday Data Science Executive Briefing I’ll be teaming up with onezeero to host an Executive Briefing, offering senior management the opportunity to get a buzzword-free and FUD-free overview of what data science is, ways it can help organisations, and what sorts of things would help ensure they can implement data science with minimal risk. If this sounds like something you want to attend, or get some senior people at your organisation along to, send Dominic an email at dom@onezeero.co.uk. Data Science Technologies event In the evening, I’ll be talking R and SQL Server for real-time predictions at a new meetup being hosted by Winton called Data Science Technologies. If you’re interested, you can register on the Meetup Data Science Technologies https://www.jumpingrivers.com/ June 28th — Wednesday R and Microsoft training day I’ll be delivering a training day helping people put R into production using Microsoft products. Working with Jumping Rivers, there’s a week of training available in London. On my R with Microsoft training day we’ll look at how we can use R Server to cope with large data volumes and parallel computations, then we’ll look at embedding R in various services to enable applications and users to make use of our R script. The day is very practical with exercises along the way to ensure you come out of the training having experience doing. As a result, you’ll need to bring along a laptop and you should already be able to write R code at least at a beginner level. Find out more about the course and register on JumpingRivers.com Jumping Rivers Training June 29th — Thursday Chat over coffee Fancy grabbing some coffee? Help me recover from some hard work by grabbing a coffee with me. Up to 4 people can book the same slot so that I can introduce folks – but if you want to talk one on one, let me know. Go Coffee!! Coffee time! PowerBI User Group In the evening, I’ll be delivering a session on how R can solve PowerBI pain points at the London PowerBI User Group. If this is of interest, you can RSVP on the meetup. London PowerBI User Group The post R and Data Science activities in London, June 27th – 29th appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### 14 Jobs for R users (2017-05-06) – from all over the world Wed, 06/07/2017 - 10:20 To post your R job on the next post Just visit this link and post a new R job to the R community. You can post a job for free (and there are also “featured job” options available for extra exposure). Current R jobs Job seekers: please follow the links below to learn more and apply for your R job of interest: Featured Jobs 1. Part-Time Big Data Discovery Automation in Digital Health (5-10 hours per week) MedStar Institutes for Innovation – Posted by Praxiteles Anywhere 22 May2017 2. Full-Time Data Scientist / Analytics Consultant Hedera Insights – Posted by HederaInsights Antwerpen Vlaanderen, Belgium 18 May2017 3. Full-Time Data Scientist (m/f) Meritocracy – Posted by arianna@meritocracy Hamburg Hamburg, Germany 11 May2017 4. Full-Time Data Scientist One Acre Fund – Posted by OAF Nairobi Nairobi County, Kenya 9 May2017 5. Full-Time Back-End Developer – Sacramento Kings (Sacramento, CA) Sacramento Kings – Posted by khavey Sacramento California, United States 9 May2017 6. Full-Time Data Analyst, Chicago Violence Reduction Strategy National Network for Safe Communities (NNSC) – Posted by nnsc Chicago Illinois, United States 5 May2017 More New Jobs 1. Full-Time Phd opportunity @ Barcelona / Moorepark Teagasc (Ireland), Universitat Autònoma de Barcelona (Spain) and CEDRIC (Centre d’études et de recherché en informatique et communications) of CNAM (Paris) – Posted by emmanuel.serrano.ferron Anywhere 25 May2017 2. Full-Time Research Software Engineer @ Bailrigg, England Lancaster University – Posted by killick Bailrigg England, United Kingdom 24 May2017 3. Full-Time Senior Data Scientist @ Minneapolis, Minnesota, U.S. General Mills – Posted by dreyco676 Minneapolis Minnesota, United States 23 May2017 4. Part-Time Big Data Discovery Automation in Digital Health (5-10 hours per week) MedStar Institutes for Innovation – Posted by Praxiteles Anywhere 22 May2017 5. Full-Time Data Scientist / Analytics Consultant Hedera Insights – Posted by HederaInsights Antwerpen Vlaanderen, Belgium 18 May2017 6. Full-Time Data Scientists for ArcelorMittal @ Avilés, Principado de Asturias, Spain ArcelorMittal – Posted by JuanM Avilés Principado de Asturias, Spain 12 May2017 7. Full-Time Data Scientist (m/f) Meritocracy – Posted by arianna@meritocracy Hamburg Hamburg, Germany 11 May2017 8. Full-Time Data Scientist Prospera Technologies – Posted by Prospera Technologies Tel Aviv-Yafo Tel Aviv District, Israel 11 May2017 9. Full-Time Data Scientist One Acre Fund – Posted by OAF Nairobi Nairobi County, Kenya 9 May2017 10. Full-Time Back-End Developer – Sacramento Kings (Sacramento, CA) Sacramento Kings – Posted by khavey Sacramento California, United States 9 May2017 11. Full-Time Data Analyst, Chicago Violence Reduction Strategy National Network for Safe Communities (NNSC) – Posted by nnsc Chicago Illinois, United States 5 May2017 12. Full-Time Transportation Market Research Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com Arlington Virginia, United States 4 May2017 13. Full-Time Data Manager @ Los Angeles, California, U.S. Virtual Pediatric Systems, LLC – Posted by gsotocampos Los Angeles California, United States 3 May2017 14. Full-Time Sr. Manager, Analytics @ Minneapolis, Minnesota, U.S. Korn Ferry – Posted by jone1087 Minneapolis Minnesota, United States 3 May 2017 In R-users.com you can see all the R jobs that are currently available. R-users Resumes R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free. (you may also look at previous R jobs posts). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Unconf projects 3: available, miner, rcheatsheet, ponyexpress Wed, 06/07/2017 - 09:00 (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers) Continuing our series of blog posts (day 1, day 2) this week about unconf 17. available Summary: Ever have trouble naming your software package? Find a great name and realize it's already taken on CRAN, or further along in development on GitHub? The available package makes it easy to check for valid, available names, and also checks various sources for any unintended meanings. The package can also suggest names based on the description and title of your package. Team: Jim Hester, Carl Ganz, Gábor Csárdi, Molly Lewis, Rachel Tatman miner Summary: The miner package provides a nice interface to the Minecraft API and allows users to manipulate the Minecraft world from a few simple functions. The package is designed with the intention of encouraging new users to learn R by writing scripts to automate activities inside Minecraft. Team: Brooke Anderson, Karl Broman, Gergely Daróczi, Mario Inchiosa, David Smith, Ali Zaidi rcheatsheet Summary: RStudio pioneered the idea of having beautiful cheatsheets for R packages. Creating them however, is quite time consuming and requires effort to maintain as packages evolve. The rcheatsheet package makes it easy for any package author to quickly generate beautufil cheatsheets from a simple table stored on a Google sheet or a local table. This short slide deck provides a quick overview Team: Diana Ly, Ramnath Vaidyanathan ponyexpress Summary: Need to write custom emails to a large group of people? ponyexpress makes it possible to create custom emails based on various fields of a data frame (final grades for a class, abstract acceptances for a workshop, project assignments for a hackathon, or just about anything else) and quickly sends them from your Gmail account. The package also provides custom template to make it easy to write rich html emails. Team: Karthik Ram, Lucy D'Agostino McGowan var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### In case you missed it: May 2017 roundup Tue, 06/06/2017 - 21:33 (This article was first published on Revolutions, and kindly contributed to R-bloggers) In case you missed them, here are some articles from May of particular interest to R users. Many interesting presentations recorded at the R/Finance 2017 conference in Chicago are now available to watch. A review of some of the R packages and projects implemented at the 2017 ROpenSci Unconference. An example of applying Bayesian Learning with the "bnlearn" package to challenge stereotypical assumptions Data from the Billboard Hot 100 chart used to find the most popular words in the titles of pop hits. Microsoft R Open 3.4.0 is now available for Windows, Mac and Linux. How to use the "tweenr" package to create smooth transitions in data animations. A preview of some of the companies and R applications to be presented at the EARL conference in San Francisco. The AzureDSVM package makes it easy to spawn and manage clusters of the Azure Data Science Virtual Machine. An online course on spatial data analysis in R, from the Consumer Data Research Centre in the UK. Video and slides of Lixun Zhang's presentation "R in Financial Services: Challenges and Opportunity" from the New York R Conference. Visual Studio 2017 now features built-in support for both R and Python development. Quantifying the home-field advantage in English Premier League football. Using the new CRAN_package_db function to analyze data about CRAN packages. Stack Overflow Trends tracks the trajectory of questions about R and Python. A recorded webinar on using Microsoft R to predict length of stay in hospitals. The new "Real-Time Scoring" capability in Microsoft R Server creates a service to generate predictions from certain models in milliseconds. "Technical Foundations of Informatics" is an open course guide on data analysis and visualization with R with a modern slant. The Datasaurus Dozen generalizes Anscombe's Quartet with a process to create datasets of any shape with (nearly) the same summary statistics. CareerCast ranks Statistician as the best job to have in 2017. You can now use Microsoft R within Alteryx Designer. How to clean messy data in Excel by providing just a few examples of transformations. And some general interest stories (not necessarily related to R): As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Package “SentimentAnalysis” released on CRAN Tue, 06/06/2017 - 21:09 (This article was first published on R Blog, and kindly contributed to R-bloggers) Authors: Stefan Feuerriegel, Nicolas Pröllochs This report introduces sentiment analysis in R and shows how to use our package “SentimentAnalysis”. What is sentiment analysis? Sentiment analysis is a research branch located at the heart of natural language processing (NLP), computational linguistics and text mining. It refers to any measures by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed opinion in a range spanning from positive to negative. As a result, one may also refer to sentiment analysis as opinion mining. What is the the package “SentimentAnalysis” about? Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable. How to use the package “SentimentAnalysis”? This simple example shows how to perform a sentiment analysis of a single string. The result is a two-level factor with levels “positive” and “negative.” # Analyze a single string to obtain a binary response (positive / negative) sentiment <- analyzeSentiment("Yeah, this was a great soccer game of the German team!") convertToBinaryResponse(sentiment)$SentimentGI #> [1] positive #> Levels: negative positive

The following shows a larger example:

# Create a vector of strings documents <- c("Wow, I really like the new light sabers!", "That book was excellent.", "R is a fantastic language.", "The service in this restaurant was miserable.", "This is neither positive or negative.", "The waiter forget about my a dessert -- what a poor service!") # Analyze sentiment sentiment <- analyzeSentiment(documents) # Extract dictionary-based sentiment according to the Harvard-IV dictionary sentiment$SentimentGI #> [1] 0.3333333 0.5000000 0.5000000 -0.6666667 0.0000000 -0.6000000 # View sentiment direction (i.e. positive, neutral and negative) convertToDirection(sentiment$SentimentGI) #> [1] positive positive positive negative neutral negative #> Levels: negative neutral positive response <- c(+1, +1, +1, -1, 0, -1) compareToResponse(sentiment, response)

Vignette: https://cran.r-project.org/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Manipulate Biological Data Using Biostrings Package Exercises(Part 3)

Tue, 06/06/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Bioinformatics is an amalgamation Biology and Computer science.Biological Data is manipulated using Computers and Computer software’s in Bioinformatics. Biological Data includes DNA; RNA & Proteins. DNA & RNA is made of Nucleotide which are our genetic material in which we are coded.Our Structure and Functions are done by protein, which are build of Amino acids
In this exercise we try compare between DNAs, RNAs & Amino Acid Sequences to fund out the relationships.
Comparison is done using sequence alignment or sequence comparison techniques.
There are two types of sequence alignment exists.
1.Pairwise alignment
2.Multiple Sequence Alignment

Pairwise alignment refers to comparison between two sequences, where as Multiple Sequence Alignment refers to comparing more than two sequences.

In the exercises below we cover how we can do pairwise alignment using Biostrings package in Bioconductor.

Install Packages
Biostrings

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create two DNA strings and do pairwise alignment using local, global and overlap alignment techniques and print the score.

Exercise 2

Create two DNA strings and do pairwise alignment and write the alignment to an .aln file.

Exercise 3

Create two Amino acid strings and do pairwise alignment

Exercise 4

Create two Amino acid strings and do pairwise alignment using BLOSUM62 substitution matrix.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

• import data into R in several ways while also beeing able to identify a suitable import tool
• use SQL code within R
• And much more

Exercise 5

Create two Amino acid strings and do pairwise alignment using BLOSUM100 substitution matrix

Exercise 6

Create two Amino acid strings and do pairwise alignment using PAM250 substitution matrix

Exercise 7

Compare between BLOSUM62 substitution matrix of R and that of the NCBI Database using any two amino acid of your choice.

Exercise 8

Do pairwise alignment using Needlemann Wunch Alignment algorithm and print the score, suppress any warnings.

Exercise 9

Create two DNA Strings and translate the same to amino acids and do pairwise alignment between the amino acid sequences

Exercise 10

Create two RNA Strings and translate the same to amino acids and do pairwise alignment between the amino acid sequences

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Web data acquisition: from database to dataframe for data analysis and visualization (Part 4)

Tue, 06/06/2017 - 11:54

The previous post described how the deeply nested JSON data on fligths were parsed and stored in an R-friendly database structure. However, looking into the data, the information is not yet ready for statistical analysis and visualization and some further processing is necessary before extracting insights and producing nice plots.

In the parsed batch, it is clearly visible the redundant structure of the data with the flight id repeted for each segment of each flight. This is also confirmed with the following simple check as the rows of the dataframe are more than the unique counts of the elements in the id column.

dim(data_items) [1] 397 15 length(unique(data_items$id)) 201 # real time changes of data could produce different results This implies that the information of each segment of each flight has to be aggregated and merged in a dataset as single observations of a statistical analysis between, for example, price and distance. First, a unique primary key for each observation has to be used as reference variable to uniquely identify each element of the dataset. library(plyr) # sql like functions library(readr) # parse numbers from strings data_items <- data.frame(data_items) # id (primary key) data <- data.frame(unique(data_items$id)) colnames(data) <- c('id') # n° of segment n_segment <- aggregate(data_items['segment.id'], by=data_items['id'], length) data <- join(data, n_segment, by='id', type='left', match='first') # sql left join # mileage mileage <- aggregate(data_items['segment.leg.mileage'], by=data_items['id'], sum) data <- join(data, mileage, by='id', type='left', match='first') # sql left join # price price <- data.frame('id'=data_items$id, 'price'=parse_number(data_items$saleTotal)) data <- join(data, price, by='id', type='left', match='first') # sql left join # dataframe colnames(data) <- c('id','segment', 'mileage', 'price') head(data)

The aggregation of mileage and price using the unique primary key allows to set up a dataframe ready for statistical analysis and data visualization. Current data tells us that there is a maximum of three segments in the connection between FCO and LHR with a minimum price of around EUR 122 and a median around EUR 600.

# descriptive statistics summary(data) # histogram price & distance g1 <- ggplot(data, aes(x=price)) + geom_histogram(bins = 50) + ylab("Distribution of the Price (EUR)") + xlab("Price (EUR)") g2 <- ggplot(data, aes(x=mileage)) + geom_histogram(bins = 50) + ylab("Distribution of the Distance") + xlab("Distance (miles)") grid.arrange(g1, g2) # price - distance relationship s0 <- ggplot(data = data, aes(x = mileage, y = price)) + geom_smooth(method = "lm", se=FALSE, color="black") + geom_point() + labs(x = "Distance in miles", y = "Price (EUR)") s0 <- ggMarginal(s0, type = "histogram", binwidth = 30) s0

Of course, plenty of other analysis and graphical representations using flights features are possible given the large set of variables available in QPX Express API and the availability of data in real time.

To conclude the 4-step (flight) trip from data acquisition to data analysis, let's recap the most important concepts described in each of the post:

That's all folks!

#R #rstats #maRche #json #curl #qpxexpress #Rbloggers

This post is also shared in www.r-bloggers.com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Unconf projects 2: checkers, gramr, data-packages, exploRingJSON

Tue, 06/06/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Following up on Stefanie's recap of unconf 17, we are following up this entire week with summaries of projects developed at the event. We plan to highlight 4-5 projects each day, with detailed posts from a handful of teams to follow.

checkers

Summary: checkers is a framework for reviewing analysis projects. It provides automated checks for best practices, using extensions on the goodpractice package.

In addition, checkers includes a descriptive guide for best practices.

gramr

Summary: gramr is a package that wraps the command line tool write-good to provide grammar checking functions for Rmd or md documents. It can be used as an RStudio Addin, or from the console or command line by supplying an Rmd or md filename.

p.s. Two of the three (Ben/Gordon) were remote participants.

data-packages

Summary: The data-packages team created a suite of outputs, including a flexdashboard with a searchable table to query dataset metadata.

In addition, the team created a best practices document for data packages, a Twitter bot @rstatsdata that tweets about datasets in R packages on CRAN, and they have plans for future work.

exploRingJSON

Summary: The exploRingJSON project includes a survey of R packages that deal with JSON data, a package (https://github.com/sctyner/JSOmetaN) to get high level metadata for JSON data, and a Shiny app to explore JSON.

Team:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### The Many-Faced Future

Tue, 06/06/2017 - 08:21

(This article was first published on jottR, and kindly contributed to R-bloggers)

The future package defines the Future API, which is a unified, generic, friendly API for parallel processing. The Future API follows the principle of write code once and run anywhere – the developer chooses what to parallelize and the user how and where.

The nature of a future is such that it lends itself to be used with several of the existing map-reduce frameworks already available in R. In this post, I’ll give an example of how to apply a function over a set of elements concurrently using plain sequential R, the parallel package, the future package alone, as well as future in combination of the foreach, the plyr, and the purrr packages.

You can choose your own future and what you want to do with it. Example: Multiple Mandelbrot sets

The Julia package provides the JuliaImage() function for generating a Julia set for a given set of start parameters (centre, L, C) where centre specify the center point in the complex plane, L specify the width and height of the square region around this location, and C is a complex coefficient controlling the “shape” of the generated Julia set. For example, to generate one of the above Julia set images (1000-by-1000 pixels), you can use: library("Julia")
set <- JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = -0.4 + 0.6i)
plot_julia(set)

with plot_julia <- function(img, col = topo.colors(16)) {
par(mar = c(0, 0, 0, 0))
image(img, col = col, axes = FALSE)
}

For the purpose of illustrating how to calculate different Julia sets in parallel, I will use the same (centre, L) = (0 + 0i, 3.5) region as above with the following ten complex coeffients (from Julia set): Cs <- list(
a = -0.618 + 0i,
b = -0.4 + 0.6i,
c = 0.285 + 0i,
d = 0.285 + 0.01i,
e = 0.45 + 0.1428i,
f = -0.70176 - 0.3842i,
g = 0.835 - 0.2321i,
h = -0.8 + 0.156i,
i = -0.7269 + 0.1889i,
j = 0 - 0.8i
)

Now we’re ready to see how we can use futures in combination of different map-reduce implementations in R for generating these ten sets in parallel. Note that all approaches will generate the exact same ten Julia sets. So, feel free to pick your favorite approach.
Sequential

To process the above ten regions sequentially, we can use the lapply() function: library("Julia")
sets <- lapply(Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})
Parallel library("parallel")
ncores <- future::availableCores() ## a friendly version of detectCores()
cl <- makeCluster(ncores)

clusterEvalQ(cl, library("Julia"))
sets <- parLapply(cl, Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})
Futures (in parallel) library("future")
plan(multisession) ## defaults to availableCores() workers

library("Julia")
sets <- future_lapply(Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
})

We could also have used the more explicit setup plan(cluster, workers = makeCluster(availableCores())), which is identical to plan(multisession).
Futures with foreach library("doFuture")
registerDoFuture() ## tells foreach futures should be used
plan(multisession) ## specifies what type of futures

library("Julia")
sets <- foreach(C = Cs) %dopar% {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
}

Note that I didn’t pass .packages = "Julia" to foreach() because the doFuture backend will do that automatically for us – that’s one of the treats of using futures. If we would have used doParallel::registerDoParallel(cl) or similar, we would have had to worry about that.
Futures with plyr

The plyr package will utilize foreach internally if we pass .parallel = TRUE. Because of this, we can use plyr::llply() to parallelize via futures as follows: library("plyr")
library("doFuture")
registerDoFuture() ## tells foreach futures should be used
plan(multisession) ## specifies what type of futures

library("Julia")
sets <- llply(Cs, function(C) {
JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = C)
}, .parallel = TRUE)

For the same reason as above, also here we don’t have to worry about global variables and making sure needed packages are attached; that’s all handles by the future packages.
Futures with purrr (= furrr)

As a final example, here is how you can use futures to parallelize your purrr::map() calls: library("purrr")
library("future")
plan(multisession)

library("Julia")
sets <- Cs %>%
map(~ future(JuliaImage(1000, centre = 0 + 0i, L = 3.5, C = .x))) %>%
values

Comment: This latter approach will not perform load balance (“scheduling”) across backend workers; that’s a feature that ideally would be taken care of by purrr itself. However, I have some ideas for future versions of future (pun…) that may achieve this without having to modify the purrr package.
Got compute?

If you have access to one or more machines with R installed (e.g. a local or remote cluster, or a Google Compute Engine cluster), and you’ve got direct SSH access to those machines, you can have those machines calculate the above Julia sets for you; just change future plan, e.g. plan(cluster, workers = c("machine1", "machine2", "machine3.remote.org"))

If you have access to a high-performance compute (HPC) cluster with a HPC scheduler (e.g. Slurm, TORQUE / PBS, LSF, and SGE), you can harness its power by switching to: library("future.batchtools")
plan(batchtools_sge)

For more details, see the vignettes of the future.batchtools and batchtools packages.

Happy futuring!

See also var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: jottR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### anytime 0.3.0

Tue, 06/06/2017 - 03:05

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of the anytime package is now on CRAN. It marks the eleventh release since the inaugural version late last summer.

anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects — and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.

This release brings a little more consistency to how numeric or integer arguments are handled. Previously, we were overly eager in accepting something such as 20150605 (i.e. today) as a (numerical or integer) input to both anytime() and anydate(). That is well-intentioned, but ultimately foolish. We relied on heuristic cutoffs to determine whether input was "meant to be" a date or time offset. There lies madness. We now differentiate whether we were called via anytime() (in which case numerical data is second offset to the epoch, just as.POSICct()) or anytime() (in which case it is days offset to the (date) epoch, just like as.Date()). The previous behaviour can be restored via a options, both function-local as well as global are supported. And of course, there is no change for all other (and more common) input formats, notably character or factor. A full list of changes follows.

Changes in anytime version 0.3.0 (2017-06-05)
• Numeric input is now always an offset to epoch, with anytime() using seconds, and anydate() using dates. (#65 fixing #63).

• Old behaviour can be re-enabled with an option also supporting a global setting getOption("anytimeOldHeuristic")

• RStudio versions 1.1.129 or later can run all functions without fear of crashing due to a change in their use of Boost.

• Replaced init.c with registration code inside of RcppExports.cpp thanks to Rcpp 0.12.11.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Who Survives Riddler Nation?

Tue, 06/06/2017 - 02:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

Last week, I published an article on learning to fight in the Battle for Riddler Nation. Here’s a refresher of the rules:

In a distant, war-torn land, there are 10 castles. There are two warlords: you and your archenemy. Each castle has its own strategic value for a would-be conqueror. Specifically, the castles are worth 1, 2, 3, …, 9, and 10 victory points. You and your enemy each have 100 soldiers to distribute, any way you like, to fight at any of the 10 castles. Whoever sends more soldiers to a given castle conquers that castle and wins its victory points. If you each send the same number of troops, you split the points. You don’t know what distribution of forces your enemy has chosen until the battles begin. Whoever wins the most points wins the war.

Submit a plan distributing your 100 soldiers among the 10 castles. Once I receive all your battle plans, I’ll adjudicate all the possible one-on-one match-ups. Whoever wins the most wars wins the battle royale and is crowned king or queen of Riddler Nation!

This battle took place in two rounds. In the first round, I submitted the solution (1, 1, 1, 1, 1, 1, 23, 23, 24, 24) using intuition, and did not do particularly well. In the second round, I used agent-based modelling and an evolutionary model to get a solution. The strategy (3, 3, 13, 2, 4, 1, 27, 30, 10, 7), performed best in the last round of the simulated 100-years war of Riddler Nation, so I submitted it to the contest.

The results of the second round are now out, and although Mr. Roeder enjoyed my article enough to give it a shout-out (sweet, my name is on FiveThirtyEight, which is among my favorite news sites!), I did not win. Now, this is not shocking; with 931 participants, and game theory essentially reducing every game to either domination by one player or an optimal gamble (the latter being true here), the chances of my coming out on top are slim. With that in mind, I cannot be disappointed. In fact, I have some reasons to feel good; my strategy beats the top performers, who seemed to expect others would play the optimal strategy from the first round (Mr. Roeder’s statistics revealed a downward shift in troop allocations to lower-valued castles) and thus chose a strategy to beat it. Perhaps my strategy was ahead of its time? Who knows.

Mr. Roeder has published the strategies his readers submitted for the second round, and I will be rebuilding a strategy from that data using Mesa and agent-based modeling again, but we will be doing more.

First, let’s see how well my strategy did in the last round, because I want to know.

Second, I’m going to explore the idea that my method generates a mixture of strategies that reflects the game-theoretic optimal strategy. (See the conclusion of my previous post to see the definition of this strategy and its properties.) In some sense, this allows me to determine if I lost the last round because of bad luck or because other players were choosing superior strategies.

Third, I want to cluster the strategies so I can see what “species” of strategies people chose. I use the excellent R package mclust for this clustering.

Fourth and finally, I visualize the evolution of strategies as my simulation plays out, using ggplot2, dimensionality reduction, and gganimate.

Results of the Last Battle for Riddler Nation

The winning strategy of the last round, submitted by Vince Vatter, was (0, 1, 2, 16, 21, 3, 2, 1, 32, 22), with an official record1 of 751 wins, 175 losses, and 5 ties. Naturally, the top-performing strategies look similar. This should not be surprising; winning strategies exploit common vulnerabilities among submissions.

I’ve downloaded the submitted strategies for the second round (I already have the first round’s strategies). Lets load them in and start analyzing them.

import pandas as pd from pandas import Series, DataFrame import numpy as np from numpy.random import poisson, uniform import mesa, mesa.time, mesa.datacollection import pymysql, sqlalchemy import random # Code from the last post def gen_strat(obj): """Generating a Series consisting of castle troop assignments args: obj: An iterable (NumPy array, Series, list, tuple, etc.) containing troop assignment return: Series; The generated strategy """ srs = Series(obj).apply(int) if (len(srs) != 10) or (srs.sum() != 100) or ((srs < 0).values.any()): raise ValueError("A valid strategy consists of ten integers between 0 and 100 that sum to 100.") srs.index = pd.Index(np.arange(10) + 1) return srs def check_strat(strat): """Checks that strat is a valid strategy, like that created by gen_strat args: strat: Series; An object to check return: bool; True if valid """ if str(type(strat)) == "": # All the other conditions that must hold if len(strat) == 10 and strat.sum() == 100 and (strat >= 0).values.all() and \ (strat == strat.apply(int)).values.all() and (strat.index.values == (np.arange(10) + 1)).all(): return True return False def battle_strat(strat1, strat2): """Determine which strategy between two wins args: strat1: Series; generated by gen_strat (or obeying its rules) strat2: Series; generated by gen_strat (or obeying its rules) return: dict; with the following fields: * "winner": int; 1 for strat1, -1 for strat2, 0 for tie * "pts": list; Item 0 is strat1's points, Item1 strat2's points """ if not check_strat(strat1) or not check_strat(strat2): raise ValueError("Both strat1 and strat2 must be valid strategies") castle_margins = strat1 - strat2 win1 = (castle_margins > 0).apply(int) win2 = (castle_margins < 0).apply(int) tie = (castle_margins == 0).apply(int) castle_points = Series(strat1.index, index=strat1.index) tie_pts = (tie * castle_points).sum() / 2 pts1 = (win1 * castle_points).sum() + tie_pts pts2 = (win2 * castle_points).sum() + tie_pts pts_list = [pts1, pts2] if pts1 > pts2: return {"winner": 1, "pts": pts_list} elif pts1 < pts2: return {"winner": -1, "pts": pts_list} else: return {"winner": 0, "pts": pts_list} def battle_strat_df(df1, df2): """Adjudicate all battles to see who wins args: df1: DataFrame; A data frame of warlords' strategies df2: DataFrame; A data frame of warlords' strategies return: DataFrame; Contains columns for Win, Tie, and Lose, the performance of the warlords of df1 against those of df2 """ df1_mat = df1.values df2_mat = df2.values score_mat = np.ones(df2_mat.shape, dtype=np.int8) res = list() for i in range(df1_mat.shape[0]): vec = df1_mat[i, np.newaxis, :] margin = vec - df2_mat df1_castles = (margin > 0) * (np.arange(10) + 1) df2_castles = (margin < 0) * (np.arange(10) + 1) tie_castles = (margin == 0) * (np.arange(10) + 1) tie_scores = tie_castles.sum(axis=1) / 2 df1_scores = df1_castles.sum(axis=1) + tie_scores df2_scores = df2_castles.sum(axis=1) + tie_scores res.append({"Win": (df1_scores > df2_scores).sum(), "Tie": (df1_scores == df2_scores).sum(), "Losses": (df1_scores < df2_scores).sum()}) return DataFrame(res, index=df1.index) ########################### ## Definition of the ABM ## ########################### class Warlord(mesa.Agent): """A warlord of Riddler Nation, who starts with a war doctrine, provinces, and fights other warlords""" def __init__(self, model, unique_id, doctrine, provinces, max_provinces, mutate_prob, mutate_amount): """Create a warlord args: model: mesa.Model; The model to which the agent belongs unique_id: int; Identifies the agent doctrine: Series; Describes the doctrine, and must be like one generated by gen_strat() provinces: int; The number of provinces the warlord starts with max_provinces: int; The maximum number of provinces a warlord can have before a split mutate_prob: float; A number between 0 and 1 that represents the probability of a mutation in offspring mutate_ammount: float; A number greater than 0 representing average number of mutations in doctrine of offspring return: Warlord """ self.model = model if not check_strat(doctrine): raise ValueError("doctrine must be a valid strategy") self.doctrine = doctrine self.unique_id = int(unique_id) self.provinces = max(0, int(provinces)) # provinces at least 0 self.max_provinces = max(2, int(max_provinces)) # max_provinces at least 2 self.mutate_prob = max(min(mutate_prob, 1), 0) # between 0 and 1 self.mutate_amount = max(mutate_amount, 0) self.opponent = None # In future, will be the id of this warlord's opponent def step(self): """In a step, find a new opponent to battle if not assigned""" if self.opponent is None: other = self.model.get_random_warlord() self.opponent = other other.opponent = self def advance(self): """Fight, and handle death or split of provinces""" if self.opponent is not None: # Resolve battle other = self.opponent res = battle_strat(self.doctrine, other.doctrine)["winner"] self.provinces += res other.provinces -= res # Clear opponents self.opponent = None other.opponent = None # Resolve possible death if self.provinces <= 0: self.model.schedule.remove(self) # If too many provinces, split, create new warlord if self.provinces >= self.max_provinces: son_doctrine = self.doctrine son_mutate_prob = self.mutate_prob son_mutate_amount = self.mutate_amount son_provinces = int(self.provinces / 2) # Is there a mutation? if uniform() < self.mutate_prob: # Yes! Apply the mutation son_mutate_prob = (self.mutate_prob + 1) / 2 son_mutate_amount = self.mutate_amount * 2 # Change doctrine mutations = min(poisson(self.mutate_amount), 100) for _ in range(mutations): son_doctrine[random.choice(son_doctrine.index)] += 1 # Add 1 to a random castle son_doctrine[random.choice(son_doctrine[son_doctrine > 0].index)] -= 1 # Subtract 1 from a random # castle that has at least # one soldier # Create the son son = Warlord(model = self.model, unique_id = self.model.next_id(), doctrine = son_doctrine, provinces = son_provinces, max_provinces = self.max_provinces, mutate_prob = son_mutate_prob, mutate_amount = son_mutate_amount) self.model.schedule.add(son) # Changes to the warlord self.mutate_prob /= 2 self.mutate_amount /= 2 self.provinces = self.max_provinces - son_provinces class RiddlerBattle(mesa.Model): """The Model that handles the simulation of the Battle for Riddler Nation""" def __init__(self, doctrines, N=None, max_provinces=6, mutate_prob=0.9, mutate_amount=10, debug=False): """Create an instance of the Battle for Riddler Nation args: doctrines: DataFrame; Contains all the doctrines for the warlords to create N: int; The number of warlords to create; if None, the number of rows of doctrines max_provinces: int; The maximum number of provinces each warlord can have mutate_prob: float; Each warlord's initial mutation probability mutate_amount: float; Each warlord's initial rate of number of mutations debug: bool; Debug mode return: RiddlerBattle """ if N is None: N = doctrines.shape[0] if debug: strat = {"Doctrine": lambda w: w.doctrine.values, "Provinces": lambda w: w.provinces} else: strat = {"Doctrine": lambda w: w.doctrine.values} self.schedule = mesa.time.SimultaneousActivation(self) # Battles are determined simultaneously self._max_id = 0 # The highest id currently used by the model self.dc = mesa.datacollection.DataCollector(agent_reporters=strat) # Data collection self.create_warlords(doctrines, N, max_provinces, mutate_prob, mutate_amount) self._gen_warlord_list() def create_warlords(self, doctrines, N, max_provinces, mutate_prob, mutate_amount): """Populate the model with warlords args: doctrines: DataFrame; Contains all the doctrines for the warlords to create N: int; The number of warlords to create max_provinces: int; The maximum number of provinces each warlord can have mutate_prob: float; Each warlord's initial mutation probability mutate_amount: float; Each warlord's initial rate of number of mutations """ i = 0 init_provinces = int(max_provinces / 2) for _ in range(N): w = Warlord(model = self, unique_id = self.next_id(), doctrine = doctrines.iloc[i, :], provinces = init_provinces, max_provinces = max_provinces, mutate_prob = mutate_prob, mutate_amount = mutate_amount) self.schedule.add(w) i += 1 if (i >= doctrines.shape[0]): i = 0 # We've run out of available doctrines, so start at the beginning of the DataFrame def _gen_warlord_list(self): """Generate a list of warlords used for assigning opponents""" self._warlord_list = [w for w in self.schedule.agents] def get_random_warlord(self): """Pull a random warlord without an opponent""" i = random.choice([i for i in range(len(self._warlord_list))]) return self._warlord_list.pop(i) def next_id(self): """Get the next valid id for the model, and update the max id of the model return: int; Can be used as an id """ self._max_id += 1 return self._max_id def step(self): """Take a step""" self.dc.collect(self) self.schedule.step() self._gen_warlord_list() # Reset the list of available warlords def run_model(self, steps): """Run the model args: steps: int; The number of steps to run the model """ steps = int(steps) for _ in range(steps): self.step() # Read in data, and create a DataFrame containing the strategies from both rounds rs1 = pd.read_csv("castle-solutions.csv", encoding='latin-1').iloc[:, 0:10].applymap(int) rs1.columns = pd.Index(np.arange(10) + 1) rs1 = rs1.loc[rs1.apply(check_strat, axis=1), :] # Use only valid strategies rs1.index = pd.Index(np.arange(rs1.shape[0])) rs2 = pd.read_csv("castle-solutions-2.csv", encoding='latin-1').iloc[:, 0:10].applymap(int) rs2.columns = rs1.columns.copy() rs2 = rs2.loc[rs2.apply(check_strat, axis=1), :] # Use only valid strategies rs2.index = pd.Index(np.arange(rs2.shape[0])) # Combine into one DataFrame, with a multi-index for distinguishing participants riddler_strats = pd.concat([rs1, rs2]) riddler_strats.index = pd.MultiIndex(levels=[[1, 2], list(set(rs1.index.values.tolist()).union(rs2.index.values.tolist()))], labels=[([0] * rs1.shape[0]) + ([1] * rs2.shape[0]), rs1.index.values.tolist() + rs2.index.values.tolist()], names=["round", "idx"]) riddler_strats.head() 1 2 3 4 5 6 7 8 9 10 round idx 1 0 100 0 0 0 0 0 0 0 0 0 1 52 2 2 2 2 2 2 12 12 12 2 26 26 26 16 1 1 1 1 1 1 3 25 0 0 0 0 0 0 25 25 25 4 25 0 0 0 0 0 0 25 25 25 del rs1, rs2 # No longer needed

Let’s see how my strategy fared.

# See how my strategy did in the second round myStrat = gen_strat((3, 3, 13, 2, 4, 1, 27, 30, 10, 7)) myStrat_vs_r2 = battle_strat_df(DataFrame([myStrat]), riddler_strats.loc[2]) myStrat_vs_r2 Losses Tie Win 0 358 10 534 myStrat_vs_r2.Win / myStrat_vs_r2.values.sum() 0 0.592018 Name: Win, dtype: float64

You may recall that with my first strategy I was winning 48% of my battles, and now I won 59% of battles, an improvement.

Let’s now get an overall ranking for the round 2 strategies.

r2_res = battle_strat_df(riddler_strats.loc[2], riddler_strats.loc[2]) r2_res.sort_values("Win", ascending=False).head() Losses Tie Win idx 0 169 6 727 1 179 9 714 2 182 8 712 4 193 6 703 3 194 10 698

This is an exact match to Mr. Roeder’s results (there’s one extra tie than there should be, since a strategy always ties against itself), which is good news. (It shows I’m not crazy!)

How did second-round strategies fare against the first?

r2_vs_r1 = battle_strat_df(riddler_strats.loc[2], riddler_strats.loc[1]) r2_vs_r1.sort_values("Win", ascending=False).head() Losses Tie Win idx 299 107 4 1238 400 107 4 1238 300 107 4 1238 297 107 4 1238 379 108 3 1238

I would also like to summarize the strategies in round 2. One way is to take the mean of the troop allocations.

riddler_strats.loc[2].mean().round(0) 1 3.0 2 4.0 3 6.0 4 8.0 5 10.0 6 12.0 7 14.0 8 17.0 9 14.0 10 12.0 dtype: float64 # How did my strategy fair against the mean? battle_strat(myStrat, gen_strat(riddler_strats.loc[2].mean().round(0))) {'pts': [18.5, 36.5], 'winner': -1}

Strange… my strategy does not beat the mean strategy.

Mixed Strategy

Unfortunately, I did not save the data I generated from the last post; it’s gone forever. Below, I recreate the simulation I developed on the data from the first Battle for Riddler Nation. Furthermore, because I want to use this data later, I save the simulations in a MySQL database.

%time rb1 = RiddlerBattle(riddler_strats.loc[(1), :]) CPU times: user 3.42 s, sys: 32 ms, total: 3.45 s Wall time: 3.56 s %time rb1.run_model(100) CPU times: user 17min 44s, sys: 1.83 s, total: 17min 46s Wall time: 18min 9s warlords1 = rb1.dc.get_agent_vars_dataframe() warlords1 = DataFrame([w[1]["Doctrine"] for w in warlords1.iterrows()], columns=np.arange(10) + 1, index=warlords1.index) def pymysql_sqlalchemy_stringgen(user, passwd, host, dbname): """Generate a connection string for use with SQLAlchemy for MySQL and PyMySQL connections Args: user (str): The username of the connecting user passwd (str): The user's password host (str): The host for where the database is located dbname (str): The name of the database to connect with Returns: (str) A SQLAlchemy connection string suitable for use with create_engine() Additional options for the connection are not supported with this function. """ return "mysql+pymysql://" + user + ":" + passwd + "@" + host + "/" + dbname conn = sqlalchemy.create_engine(pymysql_sqlalchemy_stringgen("root", pswd, "localhost", "riddlerbattles")).connect() try: conn.execute("DROP TABLE round1simulation;") except: pass make_table = """CREATE TABLE round1simulation ( step int(4), agentid int(8), castle_1 int(8), castle_2 int(8), castle_3 int(8), castle_4 int(8), castle_5 int(8), castle_6 int(8), castle_7 int(8), castle_8 int(8), castle_9 int(8), castle_10 int(8), PRIMARY KEY (step,agentid) );""" conn.execute(make_table) <sqlalchemy.engine.result.ResultProxy at 0xa979946c> # Put the data in the table temp_table = warlords1.copy() temp_table.columns = pd.Index(["castle_" + str(i) for i in temp_table.columns]) temp_table.columns Index(['castle_1', 'castle_2', 'castle_3', 'castle_4', 'castle_5', 'castle_6', 'castle_7', 'castle_8', 'castle_9', 'castle_10'], dtype='object') temp_table.to_sql("round1simulation", con=conn, if_exists='append') del temp_table # Not needed any longer

As mentioned in my previous post, the optimal strategy is a random strategy. No choice of troop allocation that always win against other allocations exists. Instead, a player employing optimal play randomly allocates troops to castles, with some probability distribution. The question is: what is that probability distribution?

I hypothesize that the distribution of pure strategies used by the warlords generated by the ABM is close to whatever the optimal distribution is. In particular, random warlords picked from the final population of warlords produced by the ABM tie or beat random strategies picked by players at least half of the time.

I test this hypothesis below.

def random_battle(df1, df2): """Pick a random warlord from df1 and have him battle a random warlord from df2 Args: df1: DataFrame; A data frame of warlords' strategies df2: DataFrame; A data frame of warlords' strategies Returns: int; The result of the battle, with 1 for victory for the warlord from df1, -1 for defeat, 0 for tie """ return battle_strat(gen_strat(df1.sample(n=1).values[0]), gen_strat(df2.sample(n=1).values[0]))['winner'] warlords1.head() 1 2 3 4 5 6 7 8 9 10 Step AgentID 0 1 100 0 0 0 0 0 0 0 0 0 2 52 2 2 2 2 2 2 12 12 12 3 26 26 26 16 1 1 1 1 1 1 4 24 0 0 0 0 2 0 22 27 25 5 26 0 0 0 0 1 0 24 25 24 # Do 1000 random battles, picking from my warlords found with the ABM and those from the second round of the # Battle for Riddler Nation rand1 = Series([random_battle(warlords1.loc[99], riddler_strats.loc[2]) for _ in range(1000)]) rand1.head() 0 -1 1 1 2 -1 3 -1 4 -1 dtype: int64 rand1.value_counts() -1 508 1 469 0 23 dtype: int64 # Winning proportion rand1.value_counts().loc[1] / rand1.value_counts().sum() 0.46899999999999997

That is not good news for my hypothesis. The strategies found via the ABM lose against human opponents more frequently than they win. This should not be happening if my belief were correct.

Perhaps ABMs cannot be used to solve this problem after all, or at least for my approach. Then again, maybe I need a longer "training" period; that is, 100 rounds are not enough to learn a strategy. Perhaps 1,000 or 3,000 are needed (though completing that many round would take a long time), or I need to seed the system with more agents.

Strategy Clusters

I believe that humans are going to be prone to choosing strategies that follow similar themes. Perhaps some players believe in a balanced approach to all castles, some emphasize high-point castles, others low-point castles, and so on. Here, I use cluster analysis to examine this hypothesis.

I will be switching from Python to R from this point on, but there's a few things I want to do in Python first: submit strategies from the second round to my database, and generate strategies via the ABM when using strategies from both rounds of the Battle for Riddler Nation as the initial seed.

# Submit round 2 strategies try: conn.execute("DROP TABLE round2strats;") except: pass make_table = """CREATE TABLE round2strats ( idx int(4), castle_1 int(8), castle_2 int(8), castle_3 int(8), castle_4 int(8), castle_5 int(8), castle_6 int(8), castle_7 int(8), castle_8 int(8), castle_9 int(8), castle_10 int(8), PRIMARY KEY (idx) );""" conn.execute(make_table) temp_table = riddler_strats.loc[2].copy() temp_table.columns = pd.Index(["castle_" + str(i) for i in temp_table.columns]) temp_table.to_sql("round2strats", con=conn, if_exists='append') del temp_table # Generate combined strategies ABM results %time rb2 = RiddlerBattle(DataFrame(riddler_strats.values, columns=np.arange(10) + 1)) CPU times: user 5.86 s, sys: 12 ms, total: 5.88 s Wall time: 6.12 s %time rb2.run_model(100) CPU times: user 29min 44s, sys: 1.77 s, total: 29min 46s Wall time: 30min 17s warlords2 = rb2.dc.get_agent_vars_dataframe() warlords2 = DataFrame([w[1]["Doctrine"] for w in warlords2.iterrows()], columns=np.arange(10) + 1, index=warlords2.index) # FYI: I'm aware this is bad database design, and I don't care. try: conn.execute("DROP TABLE round2simulation;") except: pass make_table = """CREATE TABLE round2simulation ( step int(4), agentid int(8), castle_1 int(8), castle_2 int(8), castle_3 int(8), castle_4 int(8), castle_5 int(8), castle_6 int(8), castle_7 int(8), castle_8 int(8), castle_9 int(8), castle_10 int(8), PRIMARY KEY (step,agentid) );""" conn.execute(make_table) temp_table = warlords2.copy() temp_table.columns = pd.Index(["castle_" + str(i) for i in temp_table.columns]) temp_table.to_sql("round2simulation", con=conn, if_exists='append') del temp_table # We're done in Python now; only remaining business is closing the MySQL connection conn.close()

Now we can start using R.

# Loading packages pckgs_cran <- c("dplyr", "ggplot2", "mclust", "devtools", "RMySQL") for (pckg in pckgs_cran) { if (!require(pckg, character.only = TRUE)) { install.packages(pckg) require(pckg, character.only = TRUE) } } ## Loading required package: dplyr ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ## Loading required package: ggplot2 ## Loading required package: mclust ## Package 'mclust' version 5.3 ## Type 'citation("mclust")' for citing this R package in publications. ## Loading required package: devtools ## Loading required package: RMySQL ## Loading required package: DBI if (!require("gganimate")) { devtools::install_github("dgrtwo/gganimate") library(gganimate) } ## Loading required package: gganimate %s% <- function(x, y) paste(x, y) %s0% <- function(x, y) paste0(x, y)

Now let's connect to the database so we can work with the data.

db <- src_mysql("riddlerbattles", host = "127.0.0.1", user = "root", password = pswd) castle_str <- paste0("castle_", 1:10) round1_strats <- tbl(db, sql("SELECT" %s% paste(castle_str, collapse = ", ") %s% "FROM round1simulation WHERE step = 0")) round2_strats <- tbl(db, sql("SELECT" %s% paste(castle_str, collapse = ", ") %s% "FROM round2strats")) round1_sim <- tbl(db, sql("SELECT step," %s% paste(castle_str, collapse = ", ") %s% "FROM round1simulation")) round2_sim <- tbl(db, sql("SELECT step," %s% paste(castle_str, collapse = ", ") %s% "FROM round2simulation"))

I first will embed the strategies from round 1 into a 2D space, just for the sake of visualization. I do this by finding a distance matrix, then using classical multidimensional scaling to find points in a 2D plane that closely preserve the distance matrix. Here, I feel that Manhattan distance is the best way to describe distances between strategies (it basically translates to how many troops need to be moved for one strategy to become another).

round1_strats_mat % as.data.frame %>% as.matrix round2_strats_mat % as.data.frame %>% as.matrix strats_mds % dist("manhattan") %>% cmdscale round1_strats_mds <- strats_mds[1:nrow(round1_strats_mat), ] round2_strats_mds <- strats_mds[-(1:nrow(round1_strats_mat)), ] round1_strats_plot <- data.frame("x" = round1_strats_mds[, 1], "y" = round1_strats_mds[, 2]) round2_strats_plot <- data.frame("x" = round2_strats_mds[, 1], "y" = round2_strats_mds[, 2]) ggplot(round1_strats_plot, aes(x = x, y = y)) + geom_point() + theme_bw()

ggplot(round2_strats_plot, aes(x = x, y = y)) + geom_point() + theme_bw()

Clusters definitely exist in the first round. In my opinion, at least five, maybe six clusters exist (I want to consider stragglers a separate cluster). In the second round, I observe more uniformity; my guess is there are between one and four clusters, but most likely one.

I want to know how performance in the Battle is distributed. For this, I'll need an adjudicator, but I can write one for R similarly to how I wrote one in Python with NumPy.

# Interesting: this is barely slower than the Python original battle_strat_df <- function(df1, df2) { # Adjudicate all battles to see who wins # # args: # df1: matrix; A matrix of warlords' strategies # df2: matrix; A matrix of warlords' strategies # return: # matrix; Contains columns for Win, Tie, and Lose, the performance # of the warlords of df1 against those of df2 score_mat <- matrix(rep(1:10, times = nrow(df2)), nrow = nrow(df2), byrow = TRUE) res <- sapply(1:nrow(df1), function(i) { vec <- df1[i, ] margin <- matrix(rep(vec, times = nrow(df2)), nrow = nrow(df2)) - df2 df1_castles 0) * score_mat df2_castles <- (margin < 0) * score_mat tie_castles <- (margin == 0) * score_mat tie_scores <- rowSums(tie_castles) / 2 df1_scores <- rowSums(df1_castles) + tie_scores df2_scores df2_scores), "Tie" = sum(df1_scores == df2_scores), "Losses" = sum(df1_scores < df2_scores))) }) return(t(res)) } round1_res <- battle_strat_df(round1_strats_mat, round1_strats_mat) round1_strats_plot$win <- (round1_res[, "Win"] / nrow(round1_strats_mat)) round2_res <- battle_strat_df(round2_strats_mat, round2_strats_mat) round2_strats_plot$win <- (round2_res[, "Win"] / nrow(round2_strats_mat)) ggplot(round1_strats_plot, aes(x = x, y = y, color=win)) + geom_point() + scale_color_gradient(low = "dark blue", high = "orange") + theme_bw()

ggplot(round2_strats_plot, aes(x = x, y = y, color=win)) + geom_point() + scale_color_gradient(low = "dark blue", high = "orange") + theme_bw()

Judging from these charts, strategies near the periphery don't do very well, while those near the center tend to do better. A peripheral strategy probably resembles (100, 0, 0, 0, 0, 0, 0, 0, 0, 0), but the winning strategies look, in some sense, "balanced". With that in mind, these pictures are not too shocking. Looking at the first chart for the first round one may guess that some clusters consist of well performing strategies, while other clusters consist of more poor performers.

Let's see if we can use this embedding and mclust to determine what the clusters are. mclust is an R package for model-based clustering; that is, it applies the EM-algorithm to find a clustering scheme that works well for data that was drawn from Normally distributed clusters. If you're familiar with k-means clustering, model-based clustering via the EM-algorithm and a Normal model is a much more sophisticated version of the same thing.

mclust's principal function, Mclust(), takes automated steps to find a good clustering scheme. I simply run it out of the box to find clusters for both data sets.

clust1 <- Mclust(round1_strats_mds, G=1:6) summary(clust1) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VEV (ellipsoidal, equal shape) model with 5 components: ## ## log.likelihood n df BIC ICL ## -13023.54 1349 25 -26227.25 -26471.11 ## ## Clustering table: ## 1 2 3 4 5 ## 75 575 123 310 266 round1_strats_plot$clust <- as.factor(clust1$classification) ggplot(round1_strats_plot, aes(x = x, y = y, color=clust)) + geom_point() + scale_color_brewer(palette = "Dark2") + theme_bw()

clust2 <- Mclust(round2_strats_mds, G=1:4) summary(clust2) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVI (diagonal, varying volume and shape) model with 3 components: ## ## log.likelihood n df BIC ICL ## -8741.765 902 14 -17578.79 -17678.37 ## ## Clustering table: ## 1 2 3 ## 735 25 142 round2_strats_plot$clust <- as.factor(clust2$classification) ggplot(round2_strats_plot, aes(x = x, y = y, color=clust)) + geom_point() + scale_color_brewer(palette = "Dark2") + theme_bw()

I would say reasonable clustering were found both times, and they largely affirm what I believed; there's five good clusters in the first round, but by the second round players started to converge to variations on one strategy (it may have found three clusters, but they're not very convincing).

Let's characterize the clusters even more.

clust_strats1 <- t(sapply(unique(clust1$classification), function(c) colMeans(round1_strats_mat[which( clust1$classification == c), ]))) clust_strats1 %>% round(digits = 0) ## castle_1 castle_2 castle_3 castle_4 castle_5 castle_6 castle_7 ## [1,] 3 3 4 6 7 10 12 ## [2,] 4 7 8 10 9 10 6 ## [3,] 7 8 9 13 17 20 24 ## [4,] 1 2 2 5 7 12 17 ## [5,] 2 4 4 8 13 18 23 ## castle_8 castle_9 castle_10 ## [1,] 17 19 19 ## [2,] 3 12 31 ## [3,] 1 1 1 ## [4,] 25 27 1 ## [5,] 27 1 1 clust_strats1 %>% heatmap(Colv=NA)

clust_wins1 <- sapply(unique(clust1$classification), function(c) { res <- battle_strat_df(round1_strats_mat[which( clust1$classification == c ), ], round1_strats_mat) return(mean(res[, "Win"] / nrow(round1_strats_mat))) }) names(clust_wins1) <- 1:5 clust_wins1 ## 1 2 3 4 5 ## 0.3497831 0.3156116 0.3306334 0.3121045 0.3097644 clust_strats2 <- t(sapply(unique(clust2$classification), function(c) colMeans(round2_strats_mat[which( clust2$classification == c), ]))) clust_strats2 %>% round(digits = 0) ## castle_1 castle_2 castle_3 castle_4 castle_5 castle_6 castle_7 ## [1,] 3 4 6 8 10 13 13 ## [2,] 7 1 0 0 0 0 0 ## [3,] 2 4 5 7 11 10 24 ## castle_8 castle_9 castle_10 ## [1,] 14 15 13 ## [2,] 31 30 29 ## [3,] 30 4 4 clust_strats2 %>% heatmap(Colv=NA)

clust_wins2 <- sapply(unique(clust2$classification), function(c) { res <- battle_strat_df(round2_strats_mat[which( clust2$classification == c ), ], round2_strats_mat) return(mean(res[, "Win"] / nrow(round2_strats_mat))) }) names(clust_wins2) <- 1:3 clust_wins2 ## 1 2 3 ## 0.3464018 0.1466962 0.3247166

In the first round, we can see that there were five doctrines for troop allocation. Doctrines 1 and 2 bet heavy on high-value castles, and the other three were varying degrees of betting on low-value castles, with the most successful of the three groups betting almost exclusively on low-value castles, ignoring those with a value greater than 7. The other two were less successful the more they bet on high-value castles.

In the second round, people tended to spread out their troop allocations more. Most gravitated towards an even distribution of troops among castles with values of at least 6. The least successful group, and also the smallest group, continued to bet heavy on high-value castles, and a subset bet slightly more heavily on lower-value castles.

Strategy Evolution

Finally, I want to look at how strategies developed during the simulation evolved.

In spirit, I repeat the multidimensional scaling, clustering, and plotting I did above, but clusters are found on data that extends across time (I hope this shows, say, a speciation process, or perhaps homogenization), and the plot is animated.

This is all done for the first simulation, using strategies submitted in only the first round, below. Unfortunately, the classical multidimensional scaling method and model-based clustering methods that I implemented before won't work here; they're too computationally taxing, and simply fail and crash R. Since the matrix off strategies is large, I'm not surprised, but I am disappointed. I had to resort to PCA and using the first two principal components, and using k-means for clustering instead of model-based clustering.

round1_sims_mat <- round1_sim %>% as.data.frame %>% as.matrix round1_sims_mds <- cbind(princomp(round1_sims_mat[, 2:11], scores = TRUE)$scores[, 1:2], round1_sims_mat[, "step"]) colnames(round1_sims_mds) <- c("x", "y", "step") sims_clust1 <- kmeans(round1_sims_mds[,1:2], centers = 10)$cluster round1_sims_plot <- data.frame("x" = round1_sims_mds[,"x"], "y" = round1_sims_mds[,"y"], "step" = round1_sims_mds[,"step"], "clust" = as.factor(sims_clust1)) p <- ggplot(round1_sims_plot, aes(x = x, y = y, color = clust, frame = step)) + geom_point() + theme_bw() gganimate(p, "out.gif", interval = .2)

I was disappointed by the result. I expected to see exploration of the feature space, species being born and going extinct, but all I see is mass population die-off as the algorithm converges on a small handful of survivors. Flimsy strategy choices are weened out, and eventually more and more weak choices die out, but there's no reason to think that the algorithm is allowing individuals with good characteristics (that is, good choice of doctrine) not necessarily included in the initial set of strategies to be born and challenge incumbents.

If my strategy is not performing well, this is likely the explanation: not enough exploration of the possible space of strategies. This suggests allow for more radical evolution in the ABM as opposed to what is currently being employed.

Here is the simulation on a larger starting strategy set, including the strategies from both round 1 and round 2. The result is basically the same.

round2_sims_mat <- round2_sim %>% as.data.frame %>% as.matrix round2_sims_mds <- cbind(princomp(round2_sims_mat[, 2:11], scores = TRUE)$scores[, 1:2], round2_sims_mat[, "step"]) colnames(round2_sims_mds) <- c("x", "y", "step") sims_clust2 <- kmeans(round2_sims_mds[,1:2], centers = 10)$cluster round2_sims_plot <- data.frame("x" = round2_sims_mds[,"x"], "y" = round2_sims_mds[,"y"], "step" = round2_sims_mds[,"step"], "clust" = as.factor(sims_clust2)) p <- ggplot(round2_sims_plot, aes(x = x, y = y, color = clust, frame = step)) + geom_point() + theme_bw() gganimate(p, "out2.gif", interval = .2)

Conclusion

There's a lot to learn from these data sets. I want to see what words appear in players' description of their strategies' motivation, allowing me to discover thought patterns in strategy clusters. I also want to see if a neural network could be used to learn good strategies.

That said, the ABM method appears to need tweaking. The winning strategy's author said he was using genetics-based algorithms to find his solution, a commonly recurring theme. I thought I was using a genetic algorithm, but it seems that although the weak die off and the strong reproduce, there are not enough mutations in my model to allow finding a good strategy. Maybe this could be fixed by parameter tuning. I don't know. We will need to see.

That said, this has been a fun project, and I have learned a lot in the process. Perhaps next time I will reign supreme over Riddler Nation.

1. As mentioned in my first article, I was not able to replicate Mr. Roeder’s records, and I did not find the same winning strategies as he did. I do not know where the source of the error is, but I looked over my code a lot and cannot find errors in it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));