Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 37 min ago

Data Science for Operational Excellence (Part-1)

Thu, 04/06/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

R has many powerful libraries to handle operations research. This exercise tries to demonstrate a few basic functionality of R while dealing with linear programming.
Linear programming is a technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.
The lpsolve package in R provides a set of functions to create models from scratch or use some prebuilt ones like the assignment and transportation problems.
Answers to the exercises are available here. If you obtained a different (correct) answer than those
listed on the solutions page, please feel free to post your answer as a comment on that page.
Please install and load the package lpsolve and igraph before starting the exercise.

Answers to the exercises are available here.

Exercise 1
Load packages lpSolve and igraph. Then, take a look at lp.assign to see how it works.

Exercise 2
Create a matrix representing the cost related to assign 4 tasks(rows) to 4 workers(cols) by generating integer random numbers between 50 and 100, with replacement. In order to make this exercise reproducible, define seed as 1234.

Exercise 3
Who should be assign to each task to obtain all the work done at minimal cost?

Exercise 4
Based on the resource allocation plan, how much we will spend to get all this work done?

Exercise 5
Take a look at lp.transport to see how it works. Set up the cost matrix by generating integer random numbers between 0 and 1000, without replacement. Consider that will be 8 factories(rows) serving 5 warehouses(cols).

Exercise 6
Set up the offer constraint by generating integer random numbers between 50 and 300, without replacement.

Exercise 7
Set up the demand constraint by generating integer random numbers between 100 and 500, without replacement.

Exercise 8
Find out which factory will not use all its capacity at the optimal cost solution.

Exercise 9
What is the cost associated to the optimal distribution?

Exercise 10
Create adjacency matrix using your solution in order to create a graph using igraph package.

Related exercise sets:
  1. Data science for Doctors: Inferential Statistics Exercises(Part-4)
  2. Data Science for Doctors – Part 1 : Data Display
  3. Data Science for Doctors – Part 2 : Descriptive Statistics
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fitting a rational function in R using ordinary least-squares regression

Thu, 04/06/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Srini Kumar, VP of Product Management and Data Science, LevaData; and Bob Horton, Senior Data Scientist, Microsoft

A rational function is defined as the ratio of two functions. The Padé Approximant uses a ratio of polynomials to approximate functions:

R(x)= \frac{\sum_{j=0}^m a_j x^j}{1+\sum_{k=1}^n b_k x^k}=\frac{a_0+a_1x+a_2x^2+\cdots+a_mx^m}{1+b_1 x+b_2x^2+\cdots+b_nx^n}
Here we show a way to fit this type of function to a set of data points using the lm function in R.

We’ll use a data-simulating function based on a ratio of polynomials to generate a set of y values for some random x values, add some noise, then see how well we can fit a curve to these noisy points.

set.seed(123) N <- 1000 x <- rnorm(N) f <- function(x) 50*x^2/(1 + 4*x) # data-simulating function y <- f(x) + rnorm(N, sd=3) point_data <- data.frame(x, y) library("tidyverse") ggplot(point_data, aes(x=x, y=y)) + geom_point() + ylim(-100, 100) + ggtitle("simulated data points")

Note that in the formula for the Padé approximant the numerator polynomial has an zeroth-order coefficient (\(a_0\)), where the denominator just has a constant \( 1\). This makes it easy to rearrange the equation to express \(y\) as a linear combination of terms.

y = \frac{a_0 + a_1 x + a_2 x^2}{1 + b_1 x + b_2 x^2}
y (1 + b_1 x + b_2 x^2) = a_0 + a_1 x + a_2 x^2
y = a_0 + a_1 x + a_2 x^2 - b_1 x y - b_2 x^2 y

Now we have a form that lm can work with. We just need to specify a set of inputs that are powers of x (as in a traditional polynomial fit), and a set of inputs that are y times powers of x. This may seem like a strange thing to do, because we are making a model where we would need to know the value of y in order to predict y. But the trick here is that we will not try to use the fitted model to predict anything; we will just take the coefficients out and rearrange them in a function. The fit_pade function below takes a dataframe with x and y values, fits an lm model, and returns a function of x that uses the coefficents from the model to predict y:

fit_pade <- function(point_data){ fit <- lm(y ~ x + I(x^2) + I(y*x) + I(y*x^2), point_data) lm_coef <- as.list(coef(fit)) names(lm_coef) <- c("a0", paste0(rep(c('a','b'), each=2), 1:2)) with(lm_coef, function(x)(a0 + a1*x + a2*x^2)/(1 - b1*x - b2*x^2)) }

We'll use ggplot2 to visualize the results; since it drops points outside the plotted range, it does not try to connect the points on either side of the discontinuity, as long as the data gives expected y values past those limits.

plot_fitted_function <- function(x_data, fitted_fun, title){ x_data$y_hat <- fitted_fun(x_data$x) g <- ggplot(x_data, aes(x=x, y=y)) + geom_point() + ylim(-100, 100) + geom_line(aes(y=y_hat), col="red", size=1) + ggtitle(title) plot(g) }

When we draw the fitted values over the input data points, we see that in this case the function describes the points quite well:

pade_approx <- fit_pade(point_data) plot_fitted_function(point_data, pade_approx, title="fitted function")

Here are some more examples, showing that this approach can work with some interesting patterns of points:

function_list <- list( function(x) (100 - 50*x - 100*x^2)/(1 - 10*x - 10*x^2), function(x) (100 - 50*x - 100*x^2)/(1 - 10*x - 5*x^2), function(x) (100 - 50*x - 100*x^2)/(1 - 50*x - 5*x^2) ) for (f in function_list){ sim_data <- point_data %>% mutate(y=f(x) + rnorm(nrow(point_data), sd=3)) plot_fitted_function(sim_data, fit_pade(sim_data), title=as.character(deparse(f))[2]) }

Here we have used linear regression by ordinary least squares (with lm) to fit distinctly nonlinear rational functions. For simplicity, these examples focus on equations of second order (or less) in both numerator and denominator, but the idea extends to higher orders. On a cautionary note, this approach seems to have numerical stability issues with some inputs. For example, if you take one of the data-simulating equations above and make selected coefficients much larger, you can create datasets that are fitted poorly by this method. And if the data-simulating function does not have the correct form (for example, if the zeroth order term in the denominator is not 1), the fitted curves can be completely wrong. For practical purposes it might be preferable to use a nonlinear least squares approach (e.g., the nls function). Still, this approach works well in many examples, and it lets you fit some curves that cannot be represented by ordinary polynomials.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statlearn17, Lyon

Thu, 04/06/2017 - 14:18

Today and tomorrow, I am attending the Statlearn17 conference in Lyon, France. Which is a workshop with one-hour talks on statistics and machine learning. And which makes for the second workshop on machine learning in two weeks! Yesterday there were two tutorials in R, but I only took the train to Lyon this morning: it will be a pleasant opportunity to run tomorrow through a city I have not truly ever visited, if X’ed so many times driving to the Alps.

Filed under: Kids, pictures, R, Statistics, Travel, University life Tagged: Berlin, conference, France, French Alps, Lyon, machine learning, R, SFDS, Statlearn 2017, train, Université Lumière Lyon 2

Plotly charts in nteract notebooks using R

Thu, 04/06/2017 - 07:21

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

The Plotly R package can now be used within nteract notebooks. Below are some examples. Visit below links for installation related information


Ensure pandoc.exe and pandoc-citeproc.exe are available in the same folder where R.exe resides.

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tic Tac Toe War Games: The Intelligent Minimax Algorithm

Thu, 04/06/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

In a previous post, I shared how to build a randomised Tic Tac Toe simulation. The computer plays against itself playing at random positions. In this post, I will share how to teach the computer to play the game strategically.

I love the 1983 classic movie War Games. In this film, a computer plays Tic Tac Toe against itself to learn that it cannot win the game to prevent a nuclear war.

Back in those days, I devoured the wonderful book Writing Strategy Games on your Atari by John White which contains an algorithm to play Tic Tac Toe War Games. This is my attempt to relive the eighties using R.

You can find the code on my GitHub page.

Drawing the Board

A previous post describes the function that draws the Tic Tac Toe board. For completeness, the code is replicated below. The game board is a vector of length nine consisting of either -1 (X), 0 (empty field) or 1 (O). The vector indices correspond with locations on the game board:

1 2 3
4 5 6
7 8 9

draw.board <- function(board) { # Draw the board xo <- c("X", " ", "O") # Symbols par(mar = rep(0,4)) plot.window(xlim = c(0,30), ylim = c(0,30)) abline(h = c(10, 20), col="darkgrey", lwd = 4) abline(v = c(10, 20), col="darkgrey", lwd = 4) pieces <- xo[board + 2] text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), pieces, cex = 6) # Identify location of any three in a row square <- t(matrix(board, nrow = 3)) hor <- abs(rowSums(square)) if (any(hor == 3)) hor <- (4 - which(hor == 3)) * 10 - 5 else hor <- 0 ver <- abs(colSums(square)) if (any(ver == 3)) ver <- which(ver == 3) * 10 - 5 else ver <- 0 diag1 <- sum(diag(square)) diag2 <- sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (hor > 0) lines(c(0, 30), rep(hor, 2), lwd=10, col="red") if (ver > 0) lines(rep(ver, 2), c(0, 30), lwd=10, col="red") if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd=10, col="red") if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd=10, col="red") } Human Players

This second code snippet lets a human player move by clicking anywhere on the graphic display using the locator function. The click location is converted to a number to denote the position on the board. The entered field is only accepted if it has not yet been used (the empty variable contains the available fields).

# Human player enters a move move.human <- function(game) { text(4, 0, "Click on screen to move", col = "grey", cex=.7) empty <- which(game == 0) move <- 0 while (!move %in% empty) { coords <- locator(n = 1) # add lines coords$x <- floor(abs(coords$x) / 10) + 1 coords$y <- floor(abs(coords$y) / 10) + 1 move <- coords$x + 3 * (3 - coords$y) } return (move) } Evaluate the Game

This code snippet defines the function which assesses the current board and assigns a score. Zero means no outcome, -6 means that the X player has won and +6 implies that the O player has won.

# Evaluate board position <- function(game, player) { # Determine game score square <- t(matrix(game, nrow = 3)) hor <- rowSums(square) ver <- colSums(square) diag1 <- sum(diag(square)) diag2 <- sum(diag(t(apply(square, 2, rev)))) eval <- c(hor, ver, diag1, diag2) # Determine best score minimax <- ifelse(player == -1, "min", "max") best.score <-, list(eval)) if (abs(best.score) == 3) best.score <- best.score * 2 return (best.score) } Computer Moves

The computer uses a modified Minimax Algorithm to determine its next move. This article from the Never Stop Building blog and the video below explain this method in great detail.

The next function determines the computer’s move. I have not used a brute-force minimax algorithm to save running time. I struggled building a fully recursive minimax function. Perhaps somebody can help me with this. This code looks only two steps deep and contains a strategic rule to maximise the score.

The first line stores the value of the players move, the second remainder of the matrix holds the evaluations of all the opponents moves. The code adds a randomised variable, based on the strategic value of a field. The centre has the highest value because it is part of four winning lines. Corners have three winning lines and the rest only two winning lines. This means that the computer will, all things being equal, favour the centre over the corners and favour the other fields least. The randomised variables in the code ensure that the computer does not always pick the same field in a similar situation.

# Determine computer move <- function(game, player) { empty <- which(game == 0) eval <- matrix(nrow = 10, ncol = 9, data = 0) for (i in empty) { game.tmp <- game game.tmp[i] <- player eval[1, i] <-, player) empty.tmp <- which(game.tmp ==0) for (j in empty.tmp) { game.tmp1 <- game.tmp game.tmp1[j] <- -player eval[(j + 1), i] <-, -player) } } if (!any(abs(eval[1,]) == 6)) { # When winning, play move # Analyse opponent move minimax <- ifelse(player == -1, "max", "min") # Minimax best.opponent <- apply(eval[-1,], 1, minimax) eval[1,] <- eval[1,] * -player * best.opponent } # Add randomisation and strategic values board <- c(3, 2, 3, 2, 4, 2, 3, 2, 3) # Strategic values board <- sapply(board, function(x) runif(1, 0.1 * x, (0.1 * x) + 0.1)) # Randomise eval[1, empty] <- eval[1, empty] + player * board[empty] # Randomise moves # Pick best game minimax <- ifelse(player == -1, "which.min", "which.max") # Minimax move <-, list(eval[1,])) # Select best move return(move) }

This last code snippet enables computers and humans play each other or themselves. The players vector contains the identity of the two players so that a human can play a computer or vice versa. The human player moves by clicking on the screen.

The loop keeps running until the board is full or a winner has been identified. A previous Tic Tac Toe post explains the draw.board function.

# Main game engine tic.tac.toe <- function(player1 = "human", player2 = "computer") { game <- rep(0, 9) # Empty board winner <- FALSE # Define winner player <- 1 # First player players <- c(player1, player2) draw.board(game) while (0 %in% game & !winner) { # Keep playing until win or full board if (players[(player + 3) %% 3] == "human") # Human player move <- move.human(game) else # Computer player move <-, player) game[move] <- player # Change board draw.board(game) winner <- max(, 1), abs(, -1))) == 6 # Winner, winner, chicken dinner? player <- -player # Change player } }

You can play the computer by running all functions and then entering tic.tac.toe().

I am pretty certain this simplified minimax algorithm is unbeatable—why don’t you try to win and let me know when you do.

Tic Tac Toe War Games

Now that this problem is solved, I can finally recreate the epic scene from the WarGames movie. The Tic Tac Toe War Games code uses the functions explained above and the animation package. Unfortunately, there are not many opportunities to create sound in R.

# WAR GAMES TIC TAC TOE source("Tic Tac Toe/Tic Tac Toe.R") # Draw the game board draw.board.wargames <- function(game) { xo <- c("X", " ", "O") # Symbols par(mar = rep(1,4), bg = "#050811") plot.window(xlim = c(0,30), ylim = c(0,30)) abline(h = c(10, 20), col = "#588fca", lwd = 20) abline(v = c(10, 20), col = "#588fca", lwd = 20) text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 20, col = "#588fca") text(1,0,"", col = "#588fca", cex=2) # Identify location of any three in a row square <- t(matrix(game, nrow = 3)) hor <- abs(rowSums(square)) if (any(hor == 3)) hor <- (4 - which(hor == 3)) * 10 - 5 else hor <- 0 ver <- abs(colSums(square)) if (any(ver == 3)) ver <- which(ver == 3) * 10 - 5 else ver <- 0 diag1 <- sum(diag(square)) diag2 <- sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (all(hor > 0)) for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col="#588fca") if (all(ver > 0)) for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col="#588fca") if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd = 10, col = "#588fca") if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd = 10, col = "#588fca") } library(animation) player <- -1 games <- 100 saveGIF ({ for (i in 1:games) { game <- rep(0, 9) # Empty board winner <- 0 # Define winner #draw.board.wargames(game) while (0 %in% game & !winner) { # Keep playing until win or full board empty <- which(game == 0) move <-, player) game[move] <- player if (i <= 12) draw.board.wargames(game) winner <- max(, 1), abs(, -1))) == 6 player <- -player } if (i > 12) draw.board.wargames(game) } }, interval = c(unlist(lapply(seq(1, 0,-.2), function (x) rep(x, 9))), rep(0,9*94)), = "wargames.gif", ani.width = 1024, ani.height = 1024)

The post Tic Tac Toe War Games: The Intelligent Minimax Algorithm appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Run(chart)ing Man

Thu, 04/06/2017 - 01:00

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

Measurement for improvement in healthcare with R and Qlik –

My not so secret obsession

Anyone in the small subset of the Earth’s population who both read this blog AND follow me on Twitter will know that I spend a lot of time at work either producing or analysing run charts, which we use to determine if our quality improvement work is genuinely assisting us achieve our target outcomes.

A run chart is essentially a line chart, with an initial baseline median plotted and extended into the future. Then, as the improvement work continues, you continue to plot the data and look for certain patterns within the data which are statistically unlikely to occur by chance.
Here’s an example (minus the date axis and the y-scale):

Take a look at the link above and you will see graphics demonstrating the 4 rules (shift,trend, too few runs, extreme data point).

For a long time I’d been thinking about how to automate the process of identifying sustained runs of improvement (signified in this case using the IHI run chart rules of 9 consecutive points on the appropriate side of the median), and then calculating a new improvement median from these points.

I’ve actually written SQL queries to automate most of this – including finding the first improvement median but not automatically finding every re-phased run beyond that( e.g. phase 2, then phase 3 etc) without resorting to recursion or looping. Which, as in R, is frowned upon. However, unlike R, calculating medians when there is no native median function (SQL Server 2008 R2) is a bit of a challenge, along with finding trends and every run for comparison with the “too few runs” rule. Got there in the end, and it improved my SQL skills no end, but SQL is not the right tool for the job.

I should mention the fantastic qicharts package, which allows you to create virtually every chart you would require, with a host of options. I’ve used this lot, and recommend it. It has some very robust rules built in to identify improvements. It will produce faceted plots, and allows you to specify different periods over which to calculate medians – so, if a an improvement is noted, you can calculate a new median from that point onwards.

Because of the volume of metrics I have to track, it’s not possible (or at least, not desireable) for me to manually specify different improvement ranges for each set of metrics.

So, for the last wee while, I’ve been working away at automating all this using dplyr and ggplot2.

I wondered if this could be done in a one-shot, vectorised approach.
I googled. I considered data.table, which may very well be the ideal package to do the work.
I saw various other possible approaches and solutions. I obsessed over whether I could vectorise it, and if not, would it make me a Bad Person.

And then I stopped worrying about it and just cracked on with dplyr.

I wrote many functions.
I learned some basic R functions that I didn’t know existed (need to RTFM at some point).
I wrote complex base R that future me is going to hate me for, even with the comments.
I grappled with Non Standard Evaluation.

I laughed.
I cried.
I nearly hurled (see Non Standard Evaluation).

And finally, I came up with not one but 2 different R based approaches.

The first one is less dplyr intense and works for just one plot at a time – ideal for ad-hoc analysis and plotting. And it’s quick too:

I also produced a slightly more heavy duty approach for faceted plots within ggplot2.
For one of my metrics I now have run-charts across 99 seperate locations, all plotted nicely in A3 with re-phased medians.
I was literally punching the air the first time I saw this magic plot appear on my monitor.
From start to finish – the analysis and plotting/saving to pdf took around 20 secs. Not too shabby.

I’ve also been digging in to QlikView a bit more recently. Just over a week ago I sat down to see if I could crack this in QlikView. Here is my progress so far:

A simple run of 9 in a row identified and higlighted:

These pesky points on the median gave me some grief. Ignoring these, in various combinations,was by far the hardest part to implement:

From my work with QlikView, I’m going to go back to my R code and see if I can make any more improvements. It’s great how different tools can inspire you to new ideas and looking at things a different way.

I’m not going to stick any code up just yet – but maybe might have a package in the works. I understand that NSE in dplyr is getting revised so I’m going to wait to see what transpires.

In the meantime though – for blazing fast interactive analysis on the fly for any combination within my data set – I can use QlikView.

And for static reports at scale, or simple ad-hoc analysis, yet again, its R, and the ggplot2 / dplyr super-combo that get’s the job done.

To leave a comment for the author, please follow the link and comment on their blog: HighlandR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Read elsewhere: Organizing DataFest the tidy way

Wed, 04/05/2017 - 18:40

Part of the reason why we have been somewhat silent at Citizen Statistician is that it’s DataFest season, and that means a few weeks (months?) of all consuming organization followed by a weekend of super fun data immersion and exhaustion… Each year that I organize DataFest I tell myself “next year, I’ll do [blah] to make my life easier”. This year I finally did it! Read about how I’ve been streamlining the process of registrations, registration confirmations, and dissemination of information prior to the event on my post titled “Organizing DataFest the tidy way” on the R Views blog.

Stay tuned for an update on ASA DataFest 2017 once all 31 DataFests around the globe have concluded!

Shiny 1.0.1

Wed, 04/05/2017 - 17:28

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Shiny 1.0.1 is now available on CRAN! This release primarily includes bug fixes and minor new features.

The most notable additions in this version of Shiny are the introduction of the reactiveVal() function (like reactiveValues(), but it only stores a single value), and that the choices of radioButtons() and checkboxGroupInput() can now contain HTML content instead of just plain text. We’ve also added compatibility for the development version of ggplot2.

Breaking changes

We unintentionally introduced a minor breaking change in that checkboxGroupInput used to accept choices = NULL to create an empty input. With Shiny 1.0.1, this throws an error; using choices = character(0) works. We intend to eliminate this breakage in Shiny 1.0.2.

Also, the selected argument for radioButtons, checkboxGroupInput, and selectInput once upon a time accepted the name of a choice, instead of the value of a choice; this behavior has been deprecated with a warning for several years now, and in Shiny 1.0.1 it is no longer supported at all.

Storing single reactive values with reactiveVal

The reactiveValues object has been a part of Shiny since the earliest betas. It acts like a reactive version of an environment or named list, in that you can store and retrieve values using names:

rv <- reactiveValues(clicks = 0) observeEvent(input$button, { currentValue <- rv$clicks rv$clicks <- currentValue + 1 })

If you only have a single value to store, though, it’s a little awkward that you have to use a data structure designed for multiple values.

With the new reactiveVal function, you can now create a reactive object for a single variable:

clicks <- reactiveVal(0) observeEvent(input$button, { currentValue <- clicks() clicks(currentValue + 1) })

As you can see in this example, you can read the value by calling it like a function with no arguments; and you set the value by calling it with one argument.

This has the added benefit that you can easily pass the clicks object to another function or module (no need to wrap it in a reactive()).

More flexible radioButtons and checkboxGroupInput

It’s now possible to create radio button and checkbox inputs with arbitrary HTML as labels. To do so, however, you need to pass different arguments to the functions. Now, when creating (or updating) either of radioButtons() or checkboxGroupInput(), you can specify the options in one of two (mutually exclusive) ways:

  • What we’ve always had:
    Use the choices argument, which must be a vector or list. The names of each element are displayed in the app UI as labels (i.e. what the user sees in your app), and the values are used for computation (i.e. the value is what’s returned by input$rd, where rd is a radioButtons() input). If the vector (or list) is unnamed, the values provided are used for both the UI labels and the server values.
  • What’s new and allows HTML:
    Use both the choiceNames and the choiceValues arguments, each of which must be an unnamed vector or list (and both must have the same length). The elements in choiceValues must still be plain text (these are the values used for computation). But the elements in choiceNames (the UI labels) can be constructed out of HTML, either using the HTML() function, or an HTML tag generation function, like tags$img() and icon().

Here’s an example app that demos the new functionality (in this case, we have a checkboxGroupInput() whose labels include the flag of the country they correspond to):

ggplot2 > 2.2.1 compatibility

The development version of ggplot2 has some changes that break compatibility with earlier versions of Shiny. The fixes in Shiny 1.0.1 will allow it to work with any version of ggplot2.

A note on Shiny v1.0.0

In January of this year, we quietly released Shiny 1.0.0 to CRAN. A lot of work went into that release, but other than minor bug fixes and features, it was mostly laying the foundation for some important features that will arrive in the coming months. So if you’re wondering if you missed the blog post for Shiny 1.0.0, you didn’t.

Full changes

As always, you can view the full changelog for Shiny 1.0.1 (and 1.0.0!) in our file.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

JAGS tutorial at useR! 2017

Wed, 04/05/2017 - 17:15

(This article was first published on R – JAGS News, and kindly contributed to R-bloggers)

I am giving a pre-conference tutorial on JAGS at useR! 2017 in Brussels on 4 July. You can see the outline of the tutorial on the conference website along with the other tutorials being given the same day.

This is my first pre-conference tutorial. Last year I gave a three day course on JAGS at the University of Zurich, so I have plenty material. My main problem is not over-filling the 3-hour time slot.

To leave a comment for the author, please follow the link and comment on their blog: R – JAGS News. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting Markets using eXtreme Gradient Boosting (XGBoost)

Wed, 04/05/2017 - 16:03

(This article was first published on R programming, and kindly contributed to R-bloggers)

By Milind Paradkar

In recent years, machine learning has been generating a lot of curiosity for its profitable application to trading. Numerous machine learning models like Linear/Logistic regression, Support Vector Machines, Neural Networks, Tree-based models etc. are being tried and applied in an attempt to analyze and forecast the markets. Researchers have found that some models have more success rate compared to other machine learning models. eXtreme Gradient Boosting also called XGBoost is one such machine learning model that has received rave from the machine learning practitioners.

In this post, we will cover the basics of XGBoost, a winning model for many kaggle competitions. We then attempt to develop an XGBoost stock forecasting model using the “xgboost” package in R programming.

Basics of XGBoost and related concepts

Developed by Tianqi Chen, the eXtreme Gradient Boosting (XGBoost) model is an implementation of the gradient boosting framework. Gradient Boosting algorithm is a machine learning technique used for building predictive tree-based models. (Machine Learning: An Introduction to Decision Trees).

Boosting is an ensemble technique in which new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

The ensemble technique uses the tree ensemble model which is a set of classification and regression trees (CART). The ensemble approach is used because a single CART, usually, does not have a strong predictive power. By using a set of CART (i.e. a tree ensemble model) a sum of the predictions of multiple trees is considered.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction.

The objective of the XGBoost model is given as:

Obj = L +

L is the loss function which controls the predictive power, and
Ω is regularization component which controls simplicity and overfitting

The loss function (L) which needs to be optimized can be Root Mean Squared Error for regression, Logloss for binary classification, or mlogloss for multi-class classification.

The regularization component (Ω) is dependent on the number of leaves and the prediction score assigned to the leaves in the tree ensemble model.

It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. The Gradient boosting algorithm supports both regression and classification predictive modeling problems.

Sample XGBoost model:

We will use the “xgboost” R package to create a sample XGBoost model. You can refer to the documentation of the “xgboost” package here.

Install and load the xgboost library –

We install the xgboost library using the install.packages function. To load this package we use the library function. We also load other relevant packages required to run the code.

install.packages("xgboost") # Load the relevant libraries library(quantmod); library(TTR); library(xgboost);


Create the input features and target variable – We take the 5-year OHLC and volume data of a stock and compute the technical indicators (input features) using this dataset. The indicators computed include Relative Strength Index (RSI), Average Directional Index (ADX), and Parabolic SAR (SAR). We create a lag in the computed indicators to avoid the look-ahead bias. This gives us our input features for building the XGBoost model. Since this is a sample model, we have included only a few indicators to build our set of input features.

# Read the stock data symbol = "ACC" fileName = paste(getwd(),"/",symbol,".csv",sep="") ; df = colnames(df) = c("Date","Time","Close","High", "Low", "Open","Volume") # Define the technical indicators to build the model rsi = RSI(df$Close, n=14, maType="WMA") adx = data.frame(ADX(df[,c("High","Low","Close")])) sar = SAR(df[,c("High","Low")], accel = c(0.02, 0.2)) trend = df$Close - sar # create a lag in the technical indicators to avoid look-ahead bias rsi = c(NA,head(rsi,-1)) adx$ADX = c(NA,head(adx$ADX,-1)) trend = c(NA,head(trend,-1))

Our objective is to predict the direction of the daily stock price change (Up/Down) using these input features. This makes it a binary classification problem. We compute the daily price change and assigned a positive 1 if the daily price change is positive. If the price change is negative, we assign a zero value.

# Create the target variable price = df$Close-df$Open class = ifelse(price &gt; 0,1,0)


Combine the input features into a matrix – The input features and the target variable created in the above step are combined to form a single matrix. We use the matrix structure in the XGBoost model since the xgboost library allows data in the matrix format.

# Create a Matrix model_df = data.frame(class,rsi,adx$ADX,trend) model = matrix(c(class,rsi,adx$ADX,trend), nrow=length(class)) model = na.omit(model) colnames(model) = c("class","rsi","adx","trend")


Split the dataset into training data and test data – In the next step, we split the dataset into training and test data. Using this training and test dataset we create the respective input features set and the target variable.

# Split data into train and test sets train_size = 2/3 breakpoint = nrow(model) * train_size training_data = model[1:breakpoint,] test_data = model[(breakpoint+1):nrow(model),] # Split data training and test data into X and Y X_train = training_data[,2:4] ; Y_train = training_data[,1] class(X_train)[1]; class(Y_train) X_test = test_data[,2:4] ; Y_test = test_data[,1] class(X_test)[1]; class(Y_test)


Train the XGBoost model on the training dataset –

We use the xgboost function to train the model. The arguments of the xgboost function are shown in the picture below.

The data argument in the xgboost function is for the input features dataset. It accepts a matrix, dgCMatrix, or local data file. The nrounds argument refers to the max number of iterations (i.e. the number of trees added to the model). The obj argument refers to the customized objective function. It returns gradient and second order gradient with given prediction and dtrain.

# Train the xgboost model using the "xgboost" function dtrain = xgb.DMatrix(data = X_train, label = Y_train) xgModel = xgboost(data = dtrain, nround = 5, objective = "binary:logistic")


Output – The output is the classification error on the training data set.


We can also use the cross-validation function of xgboost i.e. In this case, the original sample is randomly partitioned into nfold equal size subsamples. Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining (nfold – 1) subsamples are used as training data. The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.

# Using cross validation dtrain = xgb.DMatrix(data = X_train, label = Y_train) cv = = dtrain, nround = 10, nfold = 5, objective = "binary:logistic")


Output – The returns a data.table object containing the cross validation results.

Make predictions on the test data

To make predictions on the unseen data set (i.e. the test data), we apply the trained XGBoost model on it which gives a series of numbers.

# Make the predictions on the test data preds = predict(xgModel, X_test) # Determine the size of the prediction vector print(length(preds)) # Limit display of predictions to the first 6 print(head(preds))


Output –

These numbers do not look like binary classification {0, 1}. We have to, therefore, perform a simple transformation before we are able to use these results. In the example code shown below, we are comparing the predicted number to the threshold of 0.5. The threshold value can be changed depending upon the objective of the modeler, the metrics (e.g. F1 score, Precision, Recall) that the modeler wants to track and optimize.

prediction = as.numeric(preds &gt; 0.5) print(head(prediction))


Output –

Measuring model performance

Different evaluation metrics can be used to measure the model performance. In our example, we will compute a simple metric, the average error. It compares the predicted score with the threshold of 0.50.

For example: If the predicted score is less than 0.50, then the (preds > 0.5) expression gives a value of 0. If this value is not equal to the actual result from the test data set, then it is taken as a wrong result.

We compare all the preds with the respective data points in the Y_test and compute the average error. The code for measuring the performance is given below. Alternatively, we can use hit rate or create a confusion matrix to measure the model performance.

# Measuring model performance error_value = mean(as.numeric(preds &gt; 0.5) != Y_test) print(paste("test-error=", error_value))


Output –

Plot the feature importance set – We can find the top important features in the model by using the xgb.importance function.

# View feature importance from the learnt model importance_matrix = xgb.importance(model = xgModel) print(importance_matrix)


Plot the XGBoost Trees

Finally, we can plot the XGBoost trees using the xgb.plot.tree function. To limit the plot to a specific number of trees, we can use the n_first_tree argument. If NULL, all trees of the model are plotted.

# View the trees from a model xgb.plot.tree(model = xgModel) # View only the first tree in the XGBoost model xgb.plot.tree(model = xgModel, n_first_tree = 1)



This post covered the popular XGBoost model along with a sample code in R programming to forecast the daily direction of the stock price change. Readers can catch some of our previous machine learning blogs (links given below). We will be covering more machine learning concepts and techniques in our coming posts.

Predictive Modeling in R for Algorithmic Trading
Machine Learning and Its Application in Forex Markets

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

The post Forecasting Markets using eXtreme Gradient Boosting (XGBoost) appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Renthop: predicting interest level for rental listings

Wed, 04/05/2017 - 10:13

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Renthop is a New York based apartment listing website. It uses an innovative algorithm to sort and present the listings to the visitors. As part of our third project at NYCDSA, we participated in the Kaggle’s Two Sigma Connect: Rental Listing Inquiries machine learning competition. Our challenge was to predict the interest level in an apartment rental listing. We had 49,000 labeled entries and made predictions on 74,000. Target labels were high, medium and low.

The scope of this problem represents a typical machine learning challenge faced by many organizations. Data consisted of numerical, continuous, ordinal, and categorical variables. It had pictures, descriptions of the units and geospatial information. Distribution of the classes was also imbalanced. Considering all the above, it’s safe to conclude this problem is a much better proxy to real world situations in which Data Science can and must be employed, a somewhat uncommon characterization for Kaggle competitions, that tend to supply cleaner data and focus mainly on predictive models.

Feature Engineering

Our initial approach to this problem was to slice and dice the given variables in a variety of ways so that we could capture information contained within them. The almost readily available ones, that didn’t require much treatment, are the following:

  • Price (continuous): Log transformation
  • Bedrooms (integer): Unchanged
  • Bathrooms (integer): Unchanged
  • Price/Bedrooms and Price/Bathrooms: Used by adding 1 to the denominator and taking the logarithm
  • Latitude and Longitude (continuous): After an initial analysis, it was determined that, despite not having all coordinates pointing to the New York City area, that the ones that didn’t were usually incorrect and referred to listings from NYC. Thus, we forced all coordinates into a rectangular area around the city.
  • Photos (list of urls): As a first approach, used simply the number of photos for each listing.
  • Description and Features (text, list): As a first approach, extracted simple metrics such as the length and number of words.
  • Listing creation date/time: Extracted features such as the hour of the day, day of the week and of the month.

These initial features were already very informative in that it was possible to obtain cross-entropy losses below 0.6 (top score is still above 0.500 as we write this article) with certain models. From these features, the most significant one was found to be the price.

Density plot of prices for different interest levels. Price is on a logarithmic scale.

The listing creation hour can also tell us something about the interest level.

Interest levels by listing creation hour

Other Categorical Variables

An approach sometimes employed on categorical variables consists on encoding them (transforming to integer values) and transforming them into dummy variables. The predictors in question are Manager ID, Building ID, Street Address and Display Address.

Listing Features

When it comes to using apartment features as a predictor, we had to start by taking a good look at the data. On our raw data, the listing features were either presented in a very structured way, such as in [Elevator, Laundry in Building, Hardwood Floors], or in a very unclean way, such as in [** LIFE OF LUXURY FOR NO FEE! * SPRAWLING 2BR/2BA MANSION * WALLS OF WINDOWS *

Having this, we determined the best approach to be capturing features by using regular expressions, while being careful to check that all matches were relevant to the intended feature. We extracted 56 different listing features in total. The code that achieves this can be found here.

Out of the features extracted, the ones that seem to have the greatest impact in terms of the interest levels, taking into consideration the amount of listings they affect, are Hardwood Floors and No Fee.

Distribution of interest levels with respect to hardwood floors listings

Distribution of interest levels with respect to no fee listings


After obtaining all the photos, taking in total over 80GB of storage space, we decided to extract a few useful values from them.

First, we looked at their dimensions and included them as predictors in our models (namely, average dimensions per listing, as well as the dimensions of the first photo).

We also extracted the sharpness of each image, having from them a somewhat surprising conclusion: contrary to our initial assumptions, high interest listings have slightly less sharp photos as compared to low interest listings.

Average photo sharpness per listing for each interest level

We also used Clarifai API to extract top 15 labels from each image. There, convolutional neural networks are used to learn features present in the image and a probability estimate is given for each label extracted. Though under gradient boosting relative influence, image features showed some positive significance, during training and testing they showed very low signs of improving the model.

Finally, in order to actually make use of the photo contents, we resized all photos to 100×100 squares, to be fed into a convolutional neural network model (more details on that below).

Sentiment Analysis from Description (NLP – Natural Language Processing) An R package for the extraction of sentiment and sentiment-based plot arcs from text

The name “Syuzhet” comes from the Russian Formalists Victor Shklovsky and Vladimir Propp who divided narrative into two components, the “fabula” and the “syuzhet.” Syuzhet refers to the “device” or technique of a narrative whereas fabula is the chronological order of events. Syuzhet, therefore, is concerned with the manner in which the elements of the story (fabula) are organized (syuzhet).

The Syuzhet package attempts to reveal the latent structure of narrative by means of sentiment analysis. Instead of detecting shifts in the topic or subject matter of the narrative (as Ben Schmidt has done), the Syuzhet package reveals the emotional shifts that serve as proxies for the narrative movement between conflict and conflict resolution.

  • Afinn – developed by Finn Arup Nielsen as the AFINN WORD DATABASE
  • Bing – developed by Minqing Hu and Bing Liu as the OPINION LEXICON
  • NRC – developed by Mohammad, Saif M. and Turney, Peter D. as the NRC EMOTION LEXICON
Structure the Unstructured

Syuzhet is concerned with the linear progression of narrative from beginning (first page) to the end (last page), whereas fabula is concerned with the specific events of a story, events which may or may not be related in chronological order … When we study the syuzhet, we are not so much concerned with the order of the fictional events but specifically interested in the manner in which the author presents those events to readers.

What is Sentiment Analysis?

Opinion mining or sentiment analysis  Computational study of opinions, sentiments, subjectivity, evaluations, attitudes, appraisal, affects, views, emotions, etc., expressed in text.  Reviews, blogs, discussions, news, comments, feedback, or any other documents “Opinions” are key influencers of our behaviors.  Our beliefs and perceptions of reality are conditioned on how others see the world.  Often when we need to make a decision, we often seek out the opinions of others. In the past,  Individuals: seek opinions from friends and family  Organizations: use surveys, focus groups, opinion polls, consultants.

Using NLP for Feature Extraction

We used the syuzhet package to take the descriptions from the Kaggle dataset and produce a scoring mechanism that valued each description on a range of emotions:

Anger * Anticipation * Disgust * Fear * Joy * Sadness * Surprise * Trust

In addition we composed additional features for these datasets based on valence as being either:

Negative vs. Positive

Models Implemented

Several machine learning models were implemented, including XGBoost and other decision trees based models, as well as neural networks.

Neural Networks

Based on the fact that more than 99% of the data, in terms of size, made available to us was in the form of photos, we had to make use of it on our predictions. The challenge here is to extract from them more features than the readily available ones: number of photos per listing and their dimensions.

We identified four possible approaches (not mutually exclusive):

  • Feature Engineering: extract some manually chosen statistics, such as brightness, sharpness, contrast, etc.
  • Train a separate convolutional neural network model that classifies the images based on what they show. Possible categories could include kitchen, bathroom, floor plan, fitness center, street view, etc.
  • Train a separate model with only the photos, then feed the results to our main model.
  • Extract anonymous features by training on an unified model.

Initially, as a preparation step to implementing the last approach, we trained a simple neural network model with four dense layers, taking as input the basic listing features already provided to us with just a few tweaks, and gradually developed our feature engineering and saw our predictions improve.

Simple neural network model

Ultimately, the model yielding the best results, achieving a Kaggle score of 0.58854, used the following feature transformations:

  • Coordinates (longitude/latitude) outside the New York City area snapped to a rectangle around it,
  • Logarithm of price, price per bedroom and price per bathroom,
  • Count of words/characters in description, words and quantity of apartment features,
  • Time of day, day of month, day of week,
  • Sentiment analysis,
  • Parsed apartment features as 56 dummy variables,
  • Dimensions and sharpness (log) of first photo, as well as average dimension and sharpness (log) of all the photos per listing,
  • Manager id: encoded the top 999 managers in terms of the amount of listings, mapping all the remaining ones to a common category, and then applied an embedding to 10 activations. Similar to generating 1000 dummy variables and then applying a dense layer with 10 activations.

Neural network embedding for Manager ID

When it came to use the actual photos, we were faced with a few obstacles. First, the fact that they had many different sizes and aspect ratios. Second, that each listing had a different number of photos. And finally, storage restrictions on the GPU servers we used. By resizing all of them to 100×100 square thumbnails, we were able to squeeze them into about 2.3GB of storage. To deal with the fact that different listings had different amounts of photos, and also given the memory space inflated photos take as inputs to a convolutional network (100x100x3x4 ~ 120KB), we decided to only take the first photo as a first approach, and also by taking only 20 thousand training samples.

Convolutional neural network (layers refers to the commonly used term channels in this context)

After training for a few hours, the results from the validation set didn’t seem very promising. Given that, in order to have a good predictor, having good training data is key, we decided to extract the activations from the convolutional part of the network, freezing its (trained) weights and therefore not requiring the photos for training anymore, and expand, up to validation data, to the entire training set. This did give us an improvement but, oddly, not enough to beat our previous Kaggle score that didn’t use the photos as input. This seems to go against the simple intuitive idea that, given a model more data and freedom (weights), it should be able to make better predictions if properly trained. Truth be told, almost no tuning went into perfecting our model, especially when it comes to its architecture, so there’s certainly a lot of room for improvements. We did use validation loss to drive the learning rate decay, as well as early stopping and the choice of weights to be used for the final prediction.

Looking forward, potential improvements can be obtained by training with photos on all listings by loading them as the training occurs (already implemented), by considering all the photos from each listing, possibly having a model training exclusively on the photos (with its activations used as an input on another model). Also, extracting more information from the descriptions and applying some transformations to the coordinates, as well as having shifted versions for time-based features (as they are all periodic), could all lead to an improvement in our predictions.


These neural networks were implemented using the Keras framework over the Theano backend. All the code can be found on our project’s github.

In order to have a highly automated and controlled environment for our features, where we ensure that training and test data go through the same transformations from raw data to becoming inputs for our neural networks, we developed a preprocessing framework, with many possible transformations, that delivers on that promise. After setting up the preprocessor, with all the different pipelines for the different types of data processed, getting the data is as simple as calling load_and_transform(test), with test being False for train and True for test data.

With the generator module, this framework extended to data loading/generation run in parallel as the network is being trained. This is a critical functionality when all the photo data, once loaded, would exceed the memory capacity of our system.

The post Renthop: predicting interest level for rental listings appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Discrete Event Simulation in R (and, Why R Is Different)

Wed, 04/05/2017 - 09:19

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

I was pleased to see the announcement yesterday of simmer 3.61. a discrete-event simulation (DES) package for R. I’ve long had an interest in DES, and as I will explain below, implementing DES in R brings up interesting issues about R that transcend the field of DES. I had been planning to discuss them in the context of my own DES package for R, and the above announcement will make a good springboard for that.

First, what is DES? It is simulation of stochastic processes with discrete state space. Consider for instance the classic M/M/1 queue: exponential interarrival and service times, with a single server. The state space can be defined as the queue length, which is integer-valued and thus “discrete.” This contrasts to, say, simulating a weather system, where state such as temperature is continuous.

The key component of such a simulator, no matter which programmer world view the software takes (see below) is the event list.  At any simulated time t, the event list records all the events that are supposed to happen after time t. In the M/M/1 queue example, for instance, at time 168.0, there might be a service completion scheduled for time 169.1, and an arrival at 182.2.

The main loop of a DES program then consists of:

  • Remove the earliest event e in the event list; e will be represented as some kind of data structure, including event time, event type and so on.
  • Update the current simulated time to the time in e.
  • Simulate the execution of e.

The looping continues until the user-specified maximum simulation time is exceeded.

The M/M/1 case is simple (not to mention having a closed-form theoretical solution), but in complex systems the event list can be quite large, say thousands of events. In such cases, good performance means executing the first of the above three bulleted items efficiently. Classically, there has been much research on this (including theoretical models using renewal theory and the like). But what about doing this in R?

The simmer package handles this by…NOT doing it in R. If one needs the performance, this is the obvious route to take. (This is not quite true in a certain sense, also explained below.)

I developed my DES package entirely in R, mainly because I aimed it only as proof-of-concept. It’s been around for a few years, first on my own Web page then more recently on GitHub. I did it for use by my students, and because it seemed that periodically there have been questions on r-help along the lines of “Is there a DES package available for R?”

Most algorithms for handling event lists use some kind of priority queue, implemented a a binary tree. Since R lacks pointers, it is not easy to develop such thing, much less do it efficiently. So I just chose to implement the event list in DES as straight R vector, maintained in sorted order. But when a new event is inserted, a time-consuming operation ensues, due to the need to keep the event list in ascending order. Originally, I implemented this as binary search.

Later I realized that this was anti-R. I knew it would be slow, of course, but didn’t think much about alternatives. Then later it occurred to me:

  • Just add new events at the tail end.
  • Don’t keep the event list in sorted order.
  • In first bullet above in the event loop, simply find the earliest event by calling R’s which.min().

True, which.min() does an inefficient linear search. But it does it at C (not sea) level! Unless the event list is larger than any one I know of in practice, this will be a win.

Now, what about my pure-R DES package vs. simmer, which does its core operations in C++? The simmer package ought to be faster, right? Actually, yes, but one must first understand that they actually are not comparable, as follows.

There are two main programming paradigms (“world views”) in DES. Let’s illustrate that with M/M/1:

  • Event-oriented: Here the code explictly recognizes how one event triggers others. For a job arrival in M/M/1, the code to react to that arrival will see whether to add the job to the server queue, vs. starting it now if the queue is empty, and that same code will schedule the next job arrival.
  • Process-oriented. Here each entity more or less “minds its own business,” with fewer lines of code that explicitly show interactions between entities. In M/M/1 we might have an arrival process function and a server process function. The latter function might watch the queue continually to see when it becomes nonempty, and the former function might add the newly-arriving job to the queue, but NOT start service for the job in the case of an empty queue, as would be the case for the event-oriented approach.

The pros and cons are: The event-oriented approach is much easier to implement, but arguably less clear. The process-oriented approach requires threading of some kind (not necessarily the “classical” kind), but most people, including me, consider it clearer and more elegant.

The simmer package is process-oriented, and in fact is modeled on SimPy, a popular DES library written in Python. I’m a big fan of SimPy, which is another reason why I like simmer.

HOWEVER, the process-oriented approach, ceteris paribus, tends to be slow. This is due to various reasons related to the threading, but at any rate, the event-oriented approach, for all its inelegance, does tend to excel somewhat in this regard.

My DES package is event-oriented. So, I was curious which package would be faster, the pure-R event-oriented code or the R-calls-C++ process-oriented code. So I did a small experiment. (Disclaimer: Not claimed to generalize.) Both packages include M/M/1 examples, so I ran them for various values of the mean interarrival and service rates. I won’t enumerate the results here, but generally the C++/process-oriented runs were about 25-50% faster than the R/event-oriented ones.

There may be many issues here. For instance, DES’ deletion of an event involves code like

simlist$evnts <- simlist$evnts[-i,,drop=F]

This may involve reallocating memory for the entire matrix.

I will leave all this as an exercise for the reader.


To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding the Tucker decomposition, and compressing tensor-valued data (with R code)

Wed, 04/05/2017 - 07:50

In many applications, data naturally form an n-way tensor with n > 2, rather than a “tidy” table.
As mentioned in the beginning of my last blog post, a tensor is essentially a multi-dimensional array:

  • a tensor of order one is a vector, which simply is a column of numbers,
  • a tensor of order two is a matrix, which is basically numbers arranged in a rectangle,
  • a tensor of order three looks like numbers arranged in rectangular box (or a cube, if all modes have the same dimension),
  • an nth order (or n-way) tensor looks like numbers arranged in an n-hyperrectangle… you get the idea…

In this post I introduce the Tucker decomposition (Tucker (1966) “Some mathematical notes on three-mode factor analysis”). The Tucker decomposition family includes methods such as

  1. the higher-order SVD, or HOSVD, which is a generalization of the matrix SVD to tensors (De Lathauwer, De Moor, and Vanderwalle (2000) “A multilinear singular value decomposition”),
  2. the higher order orthogonal iteration, or HOOI, which delivers the best approximation to a given tensor by another tensor with prescribed mode-1 rank, mode-2 rank, etc. (De Lathauwer, De Moor, and Vanderwalle (2000) “On the Best Rank-1 and Rank-(R1,R2,…,RN) Approximation of Higher-Order Tensors”).

I introduce both approaches, and in order to demonstrate the usefulness of these concepts, I present a simple data compression example using The World Bank’s World Development Indicators dataset (though I use the version available on Kaggle).

However, before we can get started with the decompositions, we need to look at and understand the k-mode tensor product.

Throughout this post, I will also introduce the R functions from the package rTensor, which can be used to perform all of the presented computations.

Tensor times matrix: the k-mode product

The $k$-mode product of a tensor $X \in \mathbb{R}^{I\subscript{1} \times I\subscript{2} \times \ldots \times I\subscript{N}}$ with a matrix $A \in \mathbb{R}^{J \times I\subscript{k}}$ is written as

Y = X \times\subscript{k} A.

The resulting tensor $A$ is of size $I\subscript{1} \times \ldots \times I\subscript{k-1} \times J \times I\subscript{k+1} \times \ldots \times I\subscript{N}$, and contains the elements

y\subscript{i\subscript{1} \cdots i\subscript{k-1} j i\subscript{k+1} \cdots i\subscript{N}} = \sum\subscript{i\subscript{k} = 1}^{I\subscript{k}} x\subscript{i\subscript{1} i\subscript{2} \cdots i\subscript{N}} a\subscript{ji\subscript{k}}.

It can be hard, at first, to understand what that definition really means, or to visualize it in your mind. I find that it becomes easier once you realize that the k-mode product amounts to multiplying each mode-k fiber of $X$ by the matrix $A$.

We can demonstrate that in R:

library(rTensor) tnsr <- as.tensor(array(1:12, dim = c(2, 2, 3))) mat <- matrix(1:6, 3, 2) # 1-mode product performed via the function ttm in rTensor tnsr_times_mat <- ttm(tnsr = tnsr, mat = mat, m = 1)

Now, for example, the first slice of tnsr_times_mat is the same as the matrix product of mat with the first slice of tnsr:

tnsr_times_mat@data[ , , 1] # [,1] [,2] # [1,] 9 19 # [2,] 12 26 # [3,] 15 33 mat %*% as.matrix(tnsr@data[ , , 1]) # [,1] [,2] # [1,] 9 19 # [2,] 12 26 # [3,] 15 33

You might want to play around some more with the function ttm in R to get a better understanding of the k-mode product.

A few important facts about the k-mode product:

  • $X \times\subscript{m} A \times\subscript{n} B = X \times\subscript{n} B \times\subscript{m} A$ if $n \neq m$,
  • but $X \times\subscript{n} A \times\subscript{n} B = X \times\subscript{n} (BA)$ (in general $\neq X \times\subscript{n} B \times\subscript{n} A$).
Tucker decomposition

The Tucker decomposition (Tucker (1966)) decomposes a tensor into a core tensor multiplied by a matrix along each mode (i.e., transformed via a $k$-mode product for every $k = 1, 2, \ldots, N$):

X = G \times\subscript{1} A^{(1)} \times\subscript{2} A^{(2)} \times\subscript{3} \ldots \times\subscript{N} A^{(N)}.

Note that $G$ might be much smaller than the original tensor $X$ if we accept an approximation instead of an exact equality.

In case of three-way tensors, we can hold on to the following mental image:

It is interesting to note that the CP decomposition, that I introduced in a previous blog post, is a special case of the Tucker decomposition, where the core tensor $G$ is constrained to be superdiagonal.

Higher-order SVD (HOSVD)

So, how do you compute the Tucker decomposition?

Many algorithms rely on the following fundamental equivalence:


The above equation uses some notation that was not introduced yet:

  • $\otimes$ denotes the Kronecker product.
  • $X\subscript{(k)}$ is the mode-$k$ unfolding (or mode-$k$ matricization) of the tensor $X$. The mode-$k$ unfolding arranges the mode-$k$ fibers (a fiber is a generalization of column to tensors) of $X$ as columns into a matrix. The concept may be easiest to understand by looking at an example. The following R code shows a 3-way tensor and all three of its mode-$k$ unfoldings (using the k_unfold function from the rTensor package):

    tnsr <- as.tensor(array(1:12, dim = c(2, 3, 2))) tnsr@data # , , 1 # # [,1] [,2] [,3] # [1,] 1 3 5 # [2,] 2 4 6 # # , , 2 # # [,1] [,2] [,3] # [1,] 7 9 11 # [2,] 8 10 12 # mode-1 unfolding: k_unfold(tnsr, 1)@data # [,1] [,2] [,3] [,4] [,5] [,6] # [1,] 1 3 5 7 9 11 # [2,] 2 4 6 8 10 12 # mode-2 unfolding: k_unfold(tnsr, 2)@data # [,1] [,2] [,3] [,4] # [1,] 1 2 7 8 # [2,] 3 4 9 10 # [3,] 5 6 11 12 # mode-3 unfolding: k_unfold(tnsr, 3)@data # [,1] [,2] [,3] [,4] [,5] [,6] # [1,] 1 2 3 4 5 6 # [2,] 7 8 9 10 11 12

A straightforward approach to solve the Tucker decomposition would be to solve each mode-$k$ matricized form of the Tucker decomposition (shown in the equivalence above) for $A^{(k)}$. This approach is known as higher order SVD, or HOSVD. It can be regarded as a generalization of the matrix SVD, because the matrices $A^{(k)}$ are orthogonal, while the tensor $G$ is “ordered” and “all-orthogonal” (see De Lathauwer et. al. (2000) for detail). The resulting algorithm is shown below.


In R we can perform HOSVD using the function hosvd from rTensor:

tnsr <- rand_tensor(modes = c(30, 40, 50)) hosv_decomp <- hosvd(tnsr)

Now hosv_decomp$Z is our matrix $G$, and hosv_decomp$U is a list containing all the matrices $A^{(k)}$. We can use the function ttl, which performs multiple k-mode products on multiple modes successively given a tensor and a list of matrices, to check that up to numerical error the equation
$X = G \times\subscript{1} A^{(1)} \times\subscript{2} A^{(2)} \times\subscript{3} \ldots \times\subscript{N} A^{(N)}$
is satisfied:

HOSVD_prod <- ttl(hosv_decomp$Z, hosv_decomp$U, 1:3) error <- tnsr - HOSVD_prod table(abs(error@data) < 1e-12) # # TRUE # 60000 Higher order orthogonal iteration (HOOI)

Note that we can also use HOSVD to compress $X$ by truncating the matrices $A^{(k)}$. The truncated HOSVD, however, is known to not give the best fit, as measured by the norm of the difference

\lVert X - G \times\subscript{1} A^{(1)} \times\subscript{2} A^{(2)} \times\subscript{3} \ldots \times\subscript{N} A^{(N)} \rVert.

The higher order orthogonal iteration, or HOOI, algorithm finds the optimal approximation $\widehat{X}$ (with respect to the Frobenius norm loss) by, essentially, iterating the alternating truncation and SVD until convergence. If we truncate $A^{(k)}$ to have $r\subscript{k}$ columns, then the HOOI solution can be obtained by the following algorithm.


Application of HOOI to data compression

The example considered below is somewhat silly, given that the tensor I’m compressing isn’t very big, and thus there isn’t much of a point in compressing it. However, I think that the example still shows off very well how the algorithm can be very useful when the data size is much bigger (or the available storage much smaller).

I have downloaded from Kaggle the World Development Indicators dataset, originally collected and published by the The World Bank (the original dataset is available here).

The data can be arranged into a three-way tensor with the three modes corresponding to country (list of available countries), indicator (list of available indicators), and year (1960-2014). Since I didn’t have any time to deal with NA values in any creative way, I have kept only three indicators in the dataset. And I have replaced the remaining NAs with a country-wise average value for each particular indicator. Also, I have forgotten to normalize the data :disappointed:. The preprocessing resulted in a tensor of size 247-countries-by-3-indicators-by-55-years, that looks sort of like this:

In particular, large stretches of the data within a given country tend to be nearly constant, or nearly piece-wise constant.

We use the function tucker from rTensor to obtain a Tucker decomposition via HOOI, where we set the ranks to the value 3 at each mode.

dim(wdi_tnsr) # [1] 247 3 55 tucker_decomp <- tucker(wdi_tnsr, ranks = c(3, 3, 3)) str(tucker_decomp) # List of 7 # $ Z :Formal class 'Tensor' [package "rTensor"] with 3 slots # .. ..@ num_modes: int 3 # .. ..@ modes : int [1:3] 3 3 3 # .. ..@ data : num [1:3, 1:3, 1:3] -6.60e+10 -1.13e+05 6.24e+05 -7.76e+05 -1.93e+08 ... # $ U :List of 3 # ..$ : num [1:247, 1:3] -0.02577 -0.00065 -0.01146 -0.19637 -0.17317 ... # ..$ : num [1:3, 1:3] -1.00 -6.97e-10 -2.08e-02 2.08e-02 -4.70e-08 ... # ..$ : num [1:55, 1:3] -0.0762 -0.0772 -0.0785 -0.0802 -0.082 ... # $ conv : logi TRUE # $ est :Formal class 'Tensor' [package "rTensor"] with 3 slots # .. ..@ num_modes: int 3 # .. ..@ modes : int [1:3] 247 3 55 # .. ..@ data : num [1:247, 1:3, 1:55] 9.83e+07 4.44e+06 8.81e+07 1.05e+09 8.97e+08 ... # $ norm_percent: num 99.4 # $ fnorm_resid : num 3.9e+08 # $ all_resids : num [1:2] 3.9e+08 3.9e+08 # NULL

To see how well the tensor decomposition approximates the original tensor, we can look at the relative error

wdi_tnsr_approx <- ttl(tucker_decomp$Z, tucker_decomp$U, 1:3) fnorm(wdi_tnsr - wdi_tnsr_approx) / fnorm(wdi_tnsr) # [1] 0.005908934

and at the percentage of the norm of the original tensor explained by the Tucker decomposition

tucker_decomp$norm_percent # [1] 99.40911

We, observe that we indeed achieve a recovery with an accuracy of over 99%. For comparison, the original tensor contains 247 * 3 * 55 = 40755 entries, while the computed Tucker decomposition consists of only 127 * 3 + 3 * 3 + 55 * 3 + 3 * 3 * 3 = 582 numbers. That’s a reduction in size by a factor greater than 70.

Even though data compression does not make much sense for the size of the dataset considered here, it clearly shows potential to be very useful for purposes of data distribution and data storage, when the data size far exceeds the terabyte range.

Software for honours students

Wed, 04/05/2017 - 01:00

(This article was first published on R – Hyndsight, and kindly contributed to R-bloggers)

I spoke to our new crop of honours students this morning. Here are my slides, example files and links.


Managing References Data analysis and computation Writing your thesis LaTeX RMarkdown template

To leave a comment for the author, please follow the link and comment on their blog: R – Hyndsight. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Publish R functions as stored procedures with the sqlrutils package

Wed, 04/05/2017 - 00:19

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you've created an R function (say, a routine to clean up missing values in a data set, or a function to make forecasts using a machine learning model), and you want to make it easy for DBAs to use it, it's now possible to publish R functions as a SQL Server 2016 stored procedure. The sqlrutils package provides tools to convert an existing R function to a stored procedure which can then be executed by anyone with authenticated access to the database — even if they don't know any R.

To use an R function as a stored procedure, you'll need SQL Server 2016 with R Services installed. You'll also need to use the sqlrutils package to publish the function as a stored procedure: it's included with both Microsoft R Client (available free) and Microsoft R Server (included with SQL Server 2016), version 9.0 or later.

With that set up, the process is fairly simple:

  • Using Microsoft R (Client or Server), call library(sqlrutils) to make the publishing functions available in R.
    • Call help(package="sqlrutils") to see the list of functions provided.
  • Create the R function you want to publish. It should include one data.frame argument (this will be the input from the database) and return a data.frame as the value of the stored procedure.
    • The function can also return NULL, if the goal of the function is to modify the database directly.
    • You can use functions from the RevoScaleR package. For example, use RxSqlServerData to access data in the database with a SQL query, use RxDataStep to transform data using R, or use functions like rxLogit to train a predictive model.
    • Note that the function will only access data passed in as inputs — don't try reaching out to global variables, as they won't be there when the function runs within SQL Server.
  • Declare the input parameter for the stored procedure with the InputData function. In addition to naming the parameter, you can specify a default query to use if none is provided.
  • Optionally, define one or more additional input parameters for the stored procedure. These will become inputs to the R function defined above.
    • Only R objects of type POSIXct, numeric, character, integer, logical, and raw are allowed as inputs.
  • Prepare the R function for publishing with the StoredProcedure function. This is where you name the stored procedure and define the input parameters declared above.
  • Publish the stored procedure to the database with registerStoredProcedure. 
    • You'll need to provide a connection string with location and authentication information for the database. You can also provide this in the previous step, when you create the stored procedure.
    • You can test out the stored procedure from your R console with executeStoredProcedure.

That's it! Now your R function is available as an ordinary stored procedure, and can be executed (with its provided input data and optional parameters) like any other stored procedure.  

This is just an outline, and I've omitted some of the options for clarity. For more details on the process, you can find complete documentation at the link below.

Microsoft Docs: Generating an R Stored Procedure for R Code using the sqlrutils Package


To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R track on

Wed, 04/05/2017 - 00:00

(This article was first published on Jon Calder's R Blog, and kindly contributed to R-bloggers)

The devil is in the details! –

As I’ve said before, when it comes to programming I’m a firm believer in the “learn by doing” approach. is a project which exemplifies this.

I came across earlier this year while exploring GitHub looking for an open source project to contribute to. The premise is fairly simple:

  • pick a language that you’d like to learn or improve on
  • fetch an exercise via the Command Line Interface (CLI)
  • code up and submit a solution for the exercise
  • return to the site to get feedback on your solution and/or to compare it to the solutions others have come up with

The problems are very simple to begin with, (usually starting out with the traditional “Hello, World!”), progressing to more difficult problems further on. It’s aimed at both newbies and experienced developers, with the philosophy being that for newer programmers the exercises are achievable but “with enough complexity to uncover bite-sized knowledge gaps”, whereas for more experienced developers, the problems “provide a balance of constraints and freedom and encourage you to explore trade-offs and best-practices”.

More experienced developers are also encouraged to get involved reviewing the solutions of others and/or to contribute to the project which is all completely open source, whether it is the website itself, API’s, the CLI, documentation or track content. The site is also well suited to those with experience in one or more languages who are wanting to ramp up in a new language, or to get a sense of the how the idiomatic approach to a problem might differ from language to language.

There are currently 35 active language tracks, with another 20+ language tracks planned or upcoming (essentially in an incubation stage). One of those which is hopefully soon-to-be-launched is the R language track, which I’ve been contributing to over the past two months and am pretty excited about.

If you’re interested, there are a number of ways you can get involved:

  • Firstly, give the R language track a try! (you’ll need to login on using your GitHub account)
  • Once you’ve submitted a solution to a problem, you’ll be able to see other peoples submissions and are encouraged to comment on these and get discussions going around style and best practice, pros and cons of different approaches etc.
  • If you encounter any problems along the way (e.g. setup/install instructions, fetching or submitting exercises, running tests etc) then raise it via the online chat support in order to get help and/or bring it to someone’s attention if something needs to be fixed
  • With regard to the above, if you encounter an issue which applies specifically to the R track, then please open an issue on the R track repo
  • Consider becoming a mentor for the R track (if you’re interested, please reach out to me on Twitter or via my contact page)
  • Checkout the list of unimplemented exercises for the R track and follow the instructions there to submit a pull request
  • Lastly, look through's contributing doc, which outlines a number of other ways you can get involved (either with a specific language track, or across other areas of the project as well)

Based on my involvement so far the community seems very friendly and open, and I think its a great open source initiative, so there’s no reason why the #rstats community shouldn’t be well represented there.

On that note, a big thank you to exercism’s contributors, and especially to Jonathan Boiser and Katrina Owen for being friendly, helpful and supportive of my involvement in the project thus far. You guys are excellent role models for the greater open source community.

To leave a comment for the author, please follow the link and comment on their blog: Jon Calder's R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R with GRASS GIS – easiest Munros

Tue, 04/04/2017 - 22:27

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I’ve been having a more sustained play with R and GRASS together. I’ve previously used them, in tandem, for processing satellite images, but haven’t been much further. In this post I’ll look at finding the nearest roads to mountain summits, querying their elevations and presenting summary statistics.

First off, you can run R in GRASS or GRASS in R. I prefer the former as it avoids RAM issues. To do this fire up GRASS, ignore the GUI (for now) and type R at the command line. You can also run RStudio from R by typing rstudio & instead. Amazing! For a full breakdown, see this wiki.

For this work I’ve used the Ordnance Survey Terrain50 elevation model and the OpenRoads dataset. Before you can use the OpenRoads data you’ll need to change the column names, I’ve already solved this problem. Shamefully, I downloaded a Munro dataset some years ago and have no idea where from…

You can import your vector and raster data (roads, summits and elevation model) into GRASS using the GUI, this is probably the easiest method. Following this, you’ll need to filter the OpenRoads data to remove “Not Classified” roads, as these are generally not public and usually dirt tracks (769633 of them in GB). Do this with:

v.extract input=roads@network where="class != 'Not Classified'" output=roads_public

Note, any call using system can be executed at the GRASS console or in terminal, without R. The above has a lot of quotes, so is easiest run in the console or terminal!

Next we can run the following code for our analysis (the v.out.ascii and following cleaning calls would likely be better in readVECT):

library(rgrass7) library(rgdal) # Add new columns to vector system("v.db.addcolumn map=Munros@munro columns='dist INT,to_x INT,to_y INT'") # Get distance and coordinates of nearest road point system("v.distance from=Munros@munro from_type=point to=roads_public@munro to_type=line upload=dist,to_x,to_y column=dist,to_x,to_y") # Read results into R Munro = execGRASS("v.out.ascii", parameters=list(input="Munros@munro", type="point", columns="NAME,HT,dist,to_x,to_y"), intern=T) # Clean results Munro = strsplit(Munro, split="|", fixed=T) Munro ="", Munro)[, -3] colnames(Munro) = c("Easting", "Northing", "Name", "Height", "Road_dist", "Road_x", "Road_y") Munro[, -3] = apply(Munro[, -3], 2, as.numeric) Munro$Easting = round(Munro$Easting) Munro$Northing = round(Munro$Northing) # Write data to csv # Awkward, but easiest way to get point file of road locations! write.csv(Munro, "~/Munro.csv", row.names=F) # Make point file of nearest road locations system(" --overwrite input=/home/melville/Munro.csv output=temp separator=comma x=6 y=7 skip=1") # Interogate dtm at nearest road points # Annoyingly, this writes to another temp file system("r.what --overwrite map=Terrain50@PERMANENT points=temp@munro output=temp.csv separator=comma") # Read temp file of road heights and extract elevations x = read.csv("~/temp.csv", header=F) Munro$Road_height = round(x$V4) # Calculate some statistics Munro$Diff = Munro$Height - Munro$Road_height Munro$Gradient = round(100 * Munro$Diff / Munro$Road_dist) # Make columns integer to avoid decimal places when exported to a shp file Munro = Munro[, c(3:5, 8:10)] for(i in 2:ncol(Munro)){ class(Munro[, i]) = "integer" } # Convert munro summit and nearest road points into lines x = lapply(1:nrow(Munro), function(i){ rbind(paste(Munro[i, c(1, 2)], collapse=","), paste(Munro[i, c(6, 7)], collapse=","), paste("NaN,NaN")) }) x ="", x) write.table(x, "~/temp.txt", col.names=F, row.names=F, quote=F) system(" input=/home/melville/temp.txt output=Munro_lines separator=comma") # Read lines from GRASS l = readVECT("Munro_lines") # Add data frame to lines l@data = Munro # Write shp file writeOGR(l, ".", "Munro", "ESRI Shapefile", overwrite_layer=T) # Clean up system("rm ~/Munro*") system("rm ~/temp*")

Now we’ve a dataset called Munro in our R environment, we can make some summary plots (code below):

WordPress helpfully tiled these plots for me, but they were created with:

png("~/Cloud/Michael/websites/blog/Munro_height.png") hist(Munro$Height, main="Munro elevation", xlab="Elevation (m)") png("~/Cloud/Michael/websites/blog/Munro_dist.png") hist(Munro$Road_dist, main="Munro distance from road", xlab="Distance (m)") png("~/Cloud/Michael/websites/blog/Munro_road_ht.png") hist(Munro$Road_height, main="Nearest road elevation", xlab="Elevation (m)") png("~/Cloud/Michael/websites/blog/Munro_diff.png") hist(Munro$Diff, main="Difference between Munro and road elevation", xlab="Elevation (m)") png("~/Cloud/Michael/websites/blog/Munro_grad.png") hist(Munro$Gradient, main="Average gradient between road and Munro", xlab="Gradient (%)") png("~/Cloud/Michael/websites/blog/Munro_diff_dist.png") plot(Diff ~ Road_dist, Munro, main="Elevation difference and distance from road for Munros", xlab="Distance to road (m)", ylab="Elevation difference (m)")

We can also find the “easiest” hills and those furthest from a road:

# Easiest hills head(Munro[order(Munro$Diff), c(1, 5)]) head(Munro[order(Munro$Gradient), c(1, 6)]) # Furthest from a road head(Munro[order(Munro$Road_dist, decreasing=T), c(1, 3)]) Name Elev diff (m) The Cairnwell 266 Carn Aosda 308 An Socach: W summit 315 Meall a’Choire Leith 378 Tom Buidhe 430 Tolmount 434 Name Distance to road (m) Carn an Fhidhleir 14413 An Sgarsoch 14139 Ben Alder 12844 Beinn Bheoil 12113 Beinn Brohtain 10980 Beinn Dearg 10857

Finally, I used the Carto plugin for QGIS to upload the dataset to Carto. You can view and interact with it on my Carto map.

View and interact with the live version:

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R – iteratively changing column classes

Tue, 04/04/2017 - 20:46

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

When you write an ESRI Shapefile from R, writeOGR assumes your numeric fields are real numbers and so you get a lot of redundant zeros appended. This is misleading and distracting when you present data to third parties. You can fix this by changing column class from numeric to integer:

class(df$col) = "integer"

However, this only changes one column at a time. You can’t change the whole data frame (S4 spatial, or whatever) using the same method, or you’ll no longer have a data frame. Usually I’d solve this problem using an *apply method, in this case mapply or apply, but these don’t work in this case because the data frame is coerced into a matrix. This is one of the rare occasions when you need to use a for loop in R:

cols_to_change = c(1, 3, 5:9) for(i in cols_to_change){ class(df[, i]) = "integer" }

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing relational joins

Tue, 04/04/2017 - 18:16

I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join:

Please read on for my discussion of this diagram and teaching joins.

Teaching joins

In the above diagram two tables are laid out at angles, lines are extended from every row in each table, and a subset of the line intersections are marked as obeying the join condition and hence being in the result. It is a great diagram for discussing the meaning of joins. Being able to organize data transforms in terms of joins is a critical data science skill, so there is great value in being able to teach join theory.

I’ve been trying teaching joins as notional expansion followed by selection as shown in my recent Strata Spark Workshop (material developed in cooperation with Garrett Grolemend and others):

However, de-emphasizing the sequence of operations and the rejected join possibilities is an attractive alternative.

Deriving the diagram

Many relational joins can be thought of as conditions (often denoted as “theta”) under an outer or full-join operator “⟗” of appropriately augmented tables (adding “no-match” rows to each table before taking the cross product). The join symbol and theory are supposed to evoke the notion of a cross product, so teaching in terms of that seems sensible.

There are many methods of illustrating a set cross product including:

  • As a menu of possible combinations:

    (expanding this was the basis of my diagram).

  • As a complete bipartite graph:

  • As a matrix or grid:

If you flip the grid to an angle where both sets of source nodes have equivalent roles then you are getting back to the diagram of Grolemund and Wickham:

The idea is that the very many pairs induced by the full cross product are illustrated, but they are decorations on the crossing lines. This makes it easy to believe these induced nodes are notional, and (as is the case with real databases) only the ones needed are actually produced.

Other diagrams

I looked around for a short while for common SQL diagrams.

Venn diagrams are typically over-promoted as join mnemonics:

(Also discussed here.)

However there were some interesting illustrations trying both the grid and bipartite graph styles.


Bipartite Graphs



I like the idea of teaching all joins as filters (or theta-conditions) of the outer join. The Grolemund/Wickham diagrams are a good tool and have a style that reminds me of diagrammatic proofs of classic geometric theorems.

Python and R Vie for Top Spot in Kaggle Competitions

Tue, 04/04/2017 - 14:44

(This article was first published on R –, and kindly contributed to R-bloggers)

I’ve just updated the Competition Use section of The Popularity of Data Science Software. Here’s just that section for your convenience.

Competition Use is a web site that sponsors data science contests. People post problems there along the amount of money they are willing pay the person or team who solves their problem the best. Both money and the competitors’ reputations are on the line, so there’s strong motivation to use the best possible tools. Figure 9 compares the usage of the top two tools chosen by the data scientists working on the problems. From April 2015 through July 2016, we see the usage of both R and Python growing at a similar rate. At the most recent time point Python has pulled ahead slightly. Much more detail is available here.

Figure 9. Software used in data science competitions on in 2015 and 2016.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...