## Error message

• Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
• Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
R news and tutorials contributed by hundreds of R bloggers
Updated: 6 hours 41 min ago

### Classification from scratch, neural nets 6/8

Tue, 06/05/2018 - 10:43

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Sixth post of our series on classification from scratch. The latest one was on the lasso regression, which was still based on a logistic regression model, assuming that the variable of interest has a Bernoulli distribution. From now on, we will discuss technique that did not originate from those probabilistic models, even if they might still have a probabilistic interpretation. Somehow. Today, we will start with neural nets.

Maybe I should start with a disclaimer. The goal is not to replicate well designed R functions, used for predictive modeling. It is simply to get a basic understanding of what’s going on.

Networs, nodes and edges

First of all, neurals nets are nets, or networks. I will skip the parallel with “neural” stuff because it does not help me understanding what is happening (all apologies for my poor knowledge on biology, and cells)

So, it’s about some network. Networks have nodes, and edges (possibly connected) that connect nodes,

or maybe, to more specific (at least it helped me understanding what’s going on), some sort of flow network,

In such a network, we usually have sources (here multiple) sources (here ), on the left, on a sink (here ), on the right. To continue with this metaphorical introduction, information from the sources should reach the sink. An usually, sources are explanatory variables, , and the sink is our variable of interest . And we want to create a graph, from the sources to the sink. We will have directed edges, with only one (unique) direction, where we will put weights. It is not a flow, the parallel with flow will stop here. For instance, the most simple network will be the following one, with no layer (i.e no node between the source and the sink)

The output here is a binary variable (it can also be but here, it’s not a big deal). In our network, our output will be , because it is more easy to handly. For instance, consider something, for some function taking values in . One can consider the sigmoid functionwhich is actually the logistic function (so we should not be surprised to have results somehow close the logistic regression…). This function is called the activation function, and there are thousands of such functions. If , people consider the hyperbolic tangentor the inverse tangent function
And as input for such function, we consider a weighted sum of incoming nodes. So hereWe can also add a constant actuallySo far, we are not far away from the logistic regression. Except that our starting point was a probabilistic model, in the sense that the later was interpreted as a probability (the probability that ) and we wanted the model with the highest likelihood. But we’ll talk about selection of weights later one. First, let us construct our first (very simple) neural network. First, we have the sigmoid function

1 sigmoid = function(x) 1 / (1 + exp(-x))

The consider some weights. In our model with seven explanatory variables, with need 7 weights. Or 8 if we include the constant term. Let us consider ,

1 2 3 weights_0 = rep(1,8) X = as.matrix(cbind(1,myocarde[,1:7])) y_5_1 = sigmoid(X %*% weights_0)

that’s kind of stupid because all our predictions are 1, here. Let us try something else. Like . It is optimized, somehow, but we needed something to visualize what’s going on

1 weights_0 = lm(PRONO~.,data=myocarde)$coefficients then use 1 y_5_1 = sigmoid(X %*% weights_0) In order to see if we get a “good” prediction, let use plot the ROC curve, and compare it with the one we got with a (simple) logistic regression 1 2 3 4 5 6 7 8 9 library(ROCR) pred = ROCR::prediction(y_5_1,myocarde$PRONO) perf = ROCR::performance(pred,"tpr", "fpr") plot(perf,col="blue",lwd=2) reg = glm(PRONO~.,data=myocarde,family=binomial(link = "logit")) y_0 = predict(reg,type="response") pred0 = ROCR::prediction(y_0,myocarde$PRONO) perf0 = ROCR::performance(pred0,"tpr", "fpr") plot(perf0,add=TRUE,col="red") That’s not bad for a very first attempt. Except that we’ve been cheating here, since we did use . How, for real, should we choose those weights? Using a loss function Well, if we want an “optimal” set of weights, we need to “optimize” an objective function. So we need to quantify the loss of a mistake, between the prediction, and the observation. Consider here a quadratic loss function 1 2 loss = function(weights){ mean( (myocarde$PRONO-sigmoid(X %*% weights))^2) }

It might be stupid to use a quadratic loss function for a classification, but here, it’s not the point. We just want to understand what is the algorithm we use, and the loss function is just one parameter. Then we want to solveThus, consider

1 weights_1 = optim(weights_0,loss)$par (where the starting point is the OLS estimate). Again, to see what’s going on, let us visualize the ROC curve 1 2 3 4 5 y_5_2 = sigmoid(X %*% weights_1) pred = ROCR::prediction(y_5_2,myocarde$PRONO) perf = ROCR::performance(pred,"tpr", "fpr") plot(perf,col="blue",lwd=2) plot(perf0,add=TRUE,col="red")

That’s not amazing, but again, that’s only a first step.

A single layer

Let us add a single layer in our network.

Those nodes are connected to the sources (incoming from sources) from the left, and then connected to the sink, on the right. Those nodes are not inter-connected. And again, for that network, we need edges (i.e series of weights). For instance, on the network above, we did add one single layer, with (only) three nodes.

For such a network, the prediction formula is or more syntheticallyUsually, we consider the same activation function everywhere. Don’t ask me why, I find that weird.

Now, we have a lot of weights to choose. Let us use again OLS estimates

1 2 3 4 5 6 weights_1 &lt;- lm(PRONO~1+FRCAR+INCAR+INSYS+PAPUL+PVENT,data=myocarde)$coefficients X1 = as.matrix(cbind(1,myocarde[,c("FRCAR","INCAR","INSYS","PAPUL","PVENT")])) weights_2 &lt;- lm(PRONO~1+INSYS+PRDIA,data=myocarde)$coefficients X2=as.matrix(cbind(1,myocarde[,c("INSYS","PRDIA")])) weights_3 &lt;- lm(PRONO~1+PAPUL+PVENT+REPUL,data=myocarde)$coefficients X3=as.matrix(cbind(1,myocarde[,c("PAPUL","PVENT","REPUL")])) In that case, we did specify edges, and which sources (explanatory variables) should be used for each additional node. Actually, here, other techniques could be have been used, like using a PCA. Each node will then be one of the components. But we’ll use that idea later one… 1 X = cbind(sigmoid(X1 %*% weights_1), sigmoid(X2 %*% weights_2), sigmoid(X3 %*% weights_3)) But we’re not done here. Those were weights from the source to the know nodes, in the layer. We still need the weights from the nodes to the sink. Here, let use use a simple average 1 2 weights = c(1/3,1/3,1/3) y_5_3 &lt;- sigmoid(X %*% weights) Again, we can plot the ROC curve to see what we’ve done… 1 2 3 4 pred = ROCR::prediction(y_5_3,myocarde$PRONO) perf = ROCR::performance(pred,"tpr", "fpr") plot(perf,col="blue",lwd=2) plot(perf0,add=TRUE,col="red")

On back propagation

Now, we need some optimal selection of those weights. Observe that with only 3 nodes, there are already parameters in that model! Clearly, parcimony is not the major issue when you start using neural nets! If we want to solvefor some loss function, which isfor the quadratic norm, orif we want to use cross-entropy.

For convenience, let us center all the variable we create, otherwise, we get numerical problems.

1 2 3 4 5 6 7 8 9 10 11 12 center = function(z) (z-mean(z))/sd(z) loss = function(weights){ weights_1 = weights[0+(1:7)] weights_2 = weights[7+(1:7)] weights_3 = weights[14+(1:7)] weights_ = weights[21+1:4] X1=X2=X3=as.matrix(myocarde[,1:7]) Z1 = center(X1 %*% weights_1) Z2 = center(X2 %*% weights_2) Z3 = center(X3 %*% weights_3) X = cbind(1,sigmoid(Z1), sigmoid(Z2), sigmoid(Z3)) mean( (myocarde$PRONO-sigmoid(X %*% weights_))^2)} Now that we have our objective function, consider some starting points. We can consider weights from a PCA, and then use a gradient descent algorithm, 1 2 3 4 pca = princomp(myocarde[,1:7]) W = get_pca_var(pca)$contrib weights_0 = c(W[,1],W[,2],W[,3],c(-1,rep(1,3)/3)) weights_opt = optim(weights_0,loss)$par The prediction is then obtained using 1 2 3 4 5 6 7 8 9 10 weights_1 = weights_opt[0+(1:7)] weights_2 = weights_opt[7+(1:7)] weights_3 = weights_opt[14+(1:7)] weights_ = weights_opt[21+1:4] X1=X2=X3=as.matrix(myocarde[,1:7]) Z1 = center(X1 %*% weights_1) Z2 = center(X2 %*% weights_2) Z3 = center(X3 %*% weights_3) X = cbind(1,sigmoid(Z1), sigmoid(Z2), sigmoid(Z3)) y_5_4 = sigmoid(X %*% weights_) And as previously, why not plot the ROC curve of that model 1 2 3 4 pred = ROCR::prediction(y_5_4,myocarde$PRONO) perf = ROCR::performance(pred,"tpr", "fpr") plot(perf,col="blue",lwd=2) plot(perf,add=TRUE,col="red")

That’s not too bad. But with 27 coefficients, that’s what we would expect, no?

Using nnet() function

That’s more or less what is done in neural nets functions. Let us now have a look at some dedicated R functions.

1 2 3 4 library(nnet) myocarde_minmax = myocarde minmax = function(z) (z-min(z))/(max(z)-min(z)) for(j in 1:7) myocarde_minmax[,j] = minmax(myocarde_minmax[,j])

Here, variables are linearly transformed, to take values in . Then we can construct a neural network with one single layer, and three nodes,

1 2 3 4 5 6 7 8 9 10 11 12 model_nnet = nnet(PRONO~.,data=myocarde_minmax,size=3) summary(model_nnet) a 7-3-1 network with 28 weights options were - b-&gt;h1 i1-&gt;h1 i2-&gt;h1 i3-&gt;h1 i4-&gt;h1 i5-&gt;h1 i6-&gt;h1 i7-&gt;h1 -9.60 -1.79 21.00 14.72 -20.45 -5.05 14.37 -17.37 b-&gt;h2 i1-&gt;h2 i2-&gt;h2 i3-&gt;h2 i4-&gt;h2 i5-&gt;h2 i6-&gt;h2 i7-&gt;h2 4.72 2.83 -3.37 -1.64 1.49 2.12 2.31 4.00 b-&gt;h3 i1-&gt;h3 i2-&gt;h3 i3-&gt;h3 i4-&gt;h3 i5-&gt;h3 i6-&gt;h3 i7-&gt;h3 -0.58 -6.03 25.14 18.03 -1.19 7.52 -19.47 -12.95 b-&gt;o h1-&gt;o h2-&gt;o h3-&gt;o -1.32 29.00 -10.32 26.27

Here, it is the complete full network. And actually, there are (online) some functions that can he used to visualize that network

1 2 3 library(devtools) source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r') plot.nnet(model_nnet)

Nice, isn’t it? We clearly see the intermediary layer, with three nodes, and on top the constants. Edges are the plain lines, the darker, the heavier (in terms of weights).

Using neuralnet()

Other R functions can actually be considered.

1 2 3 4 library(neuralnet) model_nnet = neuralnet(formula(glm(PRONO~.,data=myocarde_minmax)), myocarde_minmax,hidden=3, act.fct = sigmoid) plot(model_nnet)

Again, for the same network structure, with one (hidden) layer, and three nodes in it.

Network with multiple layers

The good thing is that it’s not possible to add more layers. Like two layers. Nodes from the first layer are no longuer connected with the sink, but with nodes in the second layer. And those nodes will then be connected to the sink. We now have something like
whereI may be rambling here (a little bit) but that’s a lot of parameters. Here is the visualization of such a network,

1 2 3 4 library(neuralnet) model_nnet = neuralnet(formula(glm(PRONO~.,data=myocarde_minmax)), myocarde_minmax,hidden=3, act.fct = sigmoid) plot(model_nnet)

Application

Let us get back on our simple dataset, with only two covariates.

1 2 3 4 5 6 library(neuralnet) df_minmax =df df_minmax$y=(df_minmax$y=="1")*1 minmax = function(z) (z-min(z))/(max(z)-min(z)) for(j in 1:2) df_minmax[,j] = minmax(df[,j]) X = as.matrix(cbind(1,df_minmax[,1:2]))

Consider only one layer, with two nodes

1 2 3 model_nnet = neuralnet(formula(lm(y~.,data=df_minmax)), df_minmax,hidden=c(2)) plot(model_nnet)

Here, we did not specify it, but the activation function is the sigmoid (actually, it is called logistic here)

1 2 3 4 5 6 7 8 9 model_nnet$act.fct function (x) { 1/(1 + exp(-x)) } attr(,"type") [1] "logistic" f=model_nnet$act.fct

The weights (on the figure) can be obtained using

1 2 3 w0 = model_nnet$weights[[1]][[2]][,1] w1 = model_nnet$weights[[1]][[1]][,1] w2 = model_nnet$weights[[1]][[1]][,2] Now, to get our prediction, we should usewhich can be obtained using 1 2 3 4 5 6 7 8 9 10 11 12 f(cbind(1,f(X%*%w1),f(X%*%w2))%*%w0) [,1] [1,] 0.7336477343 [2,] 0.7317999050 [3,] 0.7185803540 [4,] 0.7404005280 [5,] 0.7518482779 [6,] 0.4939774149 [7,] 0.4965876378 [8,] 0.7101714888 [9,] 0.5050760026 [10,] 0.5049877644 Unfortunately, it is not the output of the model here, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 neuralnet::prediction(model_nnet) Data Error: 0;$rep1 x1 x2 y 1 0.1250 0.0000000000 0.02030470787 2 0.0625 0.1176470588 0.89621706711 3 0.9375 0.2352941176 0.01995171956 4 0.0000 0.4705882353 1.10849420363 5 0.5000 0.4705882353 -0.01364966058 6 0.3125 0.5294117647 -0.02409150561 7 0.6875 0.8235294118 0.93743057765 8 0.3750 0.8823529412 1.01320924782 9 1.0000 0.9058823529 1.04805134309 10 0.5625 1.0000000000 1.00377379767

If anyone has a clue, I’d be glad to know what went wrong here… I find that odd to have outputs outside the interval, but the output is neither

1 2 3 4 5 6 7 8 9 10 11 12 cbind(1,f(X%*%w1),f(X%*%w2))%*%w0 [,1] [1,] 1.01320924782 [2,] 1.00377379767 [3,] 0.93743057765 [4,] 1.04805134309 [5,] 1.10849420363 [6,] -0.02409150561 [7,] -0.01364966058 [8,] 0.89621706711 [9,] 0.02030470787 [10,] 0.01995171956 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### epubr 0.4.0 CRAN release

Tue, 06/05/2018 - 02:00

(This article was first published on R on Matthew Leonawicz | Blog, and kindly contributed to R-bloggers)

The epubr package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame.
E-book formatting is non-standard enough across all literature that no function can curate parsed e-book content across an arbitrary collection of e-books, in completely general form, resulting in a singular, consistently formatted output containing all the same variables.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Matthew Leonawicz | Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Animating Changes in Football Kits using R

Tue, 06/05/2018 - 02:00

(This article was first published on R-bloggers on Guy Abel, and kindly contributed to R-bloggers)

Background

I am enjoying the magick package at the moment. Reading through the vignette I spotted the image_morph() function. In this post I experiment with the function to build the GIF below that shows the changes in the England football first kit over time, using images from the excellent Historical Football Kits website.

Scraping

The Historical Football Kits website has a detailed section on England kits spread over six pages, starting from the first outfits used in 1872. Each pages includes some interesting discussion – and importantly for this post – images of the kits.

We can use the read_html() from the xml2 package and map() from purrr to read and save the source code of each page.

library(rvest) library(tidyverse) htmls <- c( "http://www.historicalkits.co.uk/international/england/england-1872-1939.html", "http://www.historicalkits.co.uk/international/england/england-1946-1960.html", "http://www.historicalkits.co.uk/international/england/england-1960-1983.html", "http://www.historicalkits.co.uk/international/england/england-1984-1997.html", "http://www.historicalkits.co.uk/international/england/england-1997-2010.html", "http://www.historicalkits.co.uk/international/england/england-2010-2019.html" ) %>% map(read_html)

From the source code we can then find the URLs of each kit image files using html_nodes() and html_attr() from rvest. I used purrr’s map_dfr() to store the links in a tibble and then dropped rows that do not contain kit image links or are images of away kits, kits used in single game or links to shops to buy replicas. This filtering was based on the image label or image URL and performed with the aid of the str_detect() function from stringr.

library(stringr) scrape_img_url <- function(html){ html %>% html_nodes(".float p , .float img") %>% html_attr("src") %>% tbl_df() %>% set_names("img_url") %>% mutate(label = html %>% html_nodes(".float p , .float img") %>% html_text() %>% c(., NA) %>% .[-1]) } d1 <- htmls %>% map_dfr(scrape_img_url) %>% filter(str_detect(string = img_url, pattern = "/international/england"), !str_detect(string = label, pattern = "change|alternate|Alternate|Change"), !str_detect(string = label, pattern = " v |Third"), !str_detect(string = img_url, pattern = "lithuania|italy|yellow|red")) head(d1) ## img_url label ## 1 /international/england/images/england-1872.gif 1872 ## 2 /international/england/images/england-1882.gif 1879-1900 ## 3 /international/england/images/england-1900.gif 1900-1914 ## 4 /international/england/images/england-1920-1932.gif 1920-1930 ## 5 /international/england/images/england-1921.gif 1930-1934 ## 6 /international/england/images/england-1934.gif 1934

Given these URLs I then downloaded each of the images which are stored in a single R object kits

library(magick) kits <- d1 %>% mutate(img_url = paste0("http://www.historicalkits.co.uk", img_url), img_url = str_replace(string =img_url, pattern =" ", replacement = "%20")) %>% select(img_url) %>% map(image_read) %>% set_names("img")

Typing kits into R will display each kit in the RStudio viewer (it will quickly run through each image). The console displays summary information for each image in the kits object.

> kits $img format width height colorspace filesize 1 GIF 170 338 sRGB 0 2 GIF 170 338 sRGB 0 3 GIF 170 338 sRGB 0 4 GIF 170 338 sRGB 0 5 GIF 170 338 sRGB 0 6 GIF 170 338 sRGB 0 7 GIF 170 338 sRGB 0 8 GIF 170 338 sRGB 0 9 GIF 170 338 sRGB 0 10 GIF 170 338 sRGB 0 Annotating Images Before creating any GIF I wanted add annotations for the year and the copyright information. To do this I first created a border using image_border() in magick and then image_annotate() to add the text. I wrapped these edits into an add_text() function and then applied each to the kit images. add_text <- function(img, label){ img %>% image_border(geometry = "10x60", color = "white") %>% image_chop("0x45") %>% image_annotate(text = label, gravity = "north") %>% image_annotate( text = "Animation by @guyabelguyabel", gravity = "south", location = "+0+45" ) %>% image_annotate( text = "Images are Copyright of Historical\nFootball Kits and reproduced by\nkind permission.", gravity = "south" ) } for(i in 1:length(kits$img)){ kits$img[i] <- add_text(img = kits$img[i], label = d1$label[i]) } Creating a GIF The final step was to bind together the set of images in an animated GIF with smooth transition images between each frame. To do this I used the image_morph() twice. First to repeat the same image so that the GIF would remain stable for a few frames (kits_morph1 below). Then again to create a set of morphing images between successive kits (kits_morph0 below). The full set of frames were stored in kits_ani kits_ani <- image_morph(c(kits$img[1], kits$img[1]), frames = 5) for(i in 2:length(kits$img)){ kits_morph0 <- image_morph(c(kits$img[i-1], kits$img[i]), frames = 5) kits_morph1 <- image_morph(c(kits$img[i], kits$img[i]), frames = 5) kits_ani <- c(kits_ani, kits_morph0) kits_ani <- c(kits_ani, kits_morph1) }

To create an animation I passed the set of frames in the kits_morph object to the image_animate() and image_write() functions to give the image above.

kits_ani %>% image_animate(fps = 10) %>% image_write(path = "england.gif") Club Teams

Similar code as above can be used to create images for club teams. I tried this out for the mighty Reading. As the Reading kits on Historical Football Kits are on only one page and includes only home kits, finding the image URLs was much easier…

d1 <- read_html("http://www.historicalkits.co.uk/Reading/Reading.htm") %>% scrape_img_url() %>% filter(str_detect(string = img_url, pattern = "/Reading"), !str_detect(string = img_url, pattern = "unknown")) %>% mutate( label = str_replace_all(string = label, pattern = "[:alpha:]|\\s", replacement = "") )

I could then run the same code as above to scrape the images, annotate the year and copyright information and build the GIF.

Ian Holloway – we had hoops first.

I did have trouble creating a GIF’s when I used more frames to morph images between successive kits, for example when using frames = 10. I could not consistently replicate the error, but I suspect it is something related to the memory size – my R session would freeze when passing image_animate() or image_write() on the kits_ani R object when it contained a large number of images.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-bloggers on Guy Abel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Classification from scratch, penalized Lasso logistic 5/8

Mon, 06/04/2018 - 16:50

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Fifth post of our series on classification from scratch, following the previous post on penalization using the norm (so-called Ridge regression), this time, we will discuss penalization based on the norm (the so-called Lasso regression).

First of all, one should admit that if the name stands for least absolute shrinkage and selection operator, that’s actually a very cool name… Funny story, a few years before, Leo Breiman introduce a concept of garrote technique… “The garrote eliminates some variables, shrinks others, and is relatively stable”.

I guess that somehow, the lasso is the extension of the garotte technique

Normalization of the covariates

As previously, the first step will be to consider linear transformations of all covariates to get centered and scaled variables (with unit variance)

1 2 3 4 y = myocarde$PRONO X = myocarde[,1:7] for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j]) X = as.matrix(X) Ridge Regression (from scratch) The heuristics about Lasso regression is the following graph. In the background, we can visualize the (two-dimensional) log-likelihood of the logistic regression, and the blue square is the constraint we have, if we rewite the optimization problem as a contrained optimization problem, 1 2 3 4 5 6 7 8 9 10 LogLik = function(bbeta){ b0=bbeta[1] beta=bbeta[-1] sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*log(1 + exp(b0+X%*%beta)))} u = seq(-4,4,length=251) v = outer(u,u,function(x,y) LogLik(c(1,x,y))) image(u,u,v,col=rev(heat.colors(25))) contour(u,u,v,add=TRUE) polygon(c(-1,0,1,0),c(0,1,0,-1),border="blue") The nice thing here is that is works as a variable selection tool, since some components can be null here. That’s the idea behind the following (popular) graph (with lasso on the left, and ridge on the right). Heuristically, the maths explanation is the following. Consider a simple regression , with -penality and a -loss fuction. The optimization problem becomesThe first order condition can be written(the sign in being the sign of ). Assume that , then solution is (we get a corner solution when is large). Optimization routine As in our previous post, let us start with standard (R) optimization routines, such as BFGS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 PennegLogLik = function(bbeta,lambda=0){ b0=bbeta[1] beta=bbeta[-1] -sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*log(1 + exp(b0+X%*%beta)))+lambda*sum(abs(beta)) } opt_lasso = function(lambda){ beta_init = lm(PRONO~.,data=myocarde)$coefficients logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik(x,lambda), hessian=TRUE, method = "BFGS", control=list(abstol=1e-9)) logistic_opt$par[-1] } v_lambda=c(exp(seq(-4,2,length=61))) est_lasso=Vectorize(opt_lasso)(v_lambda) library("RColorBrewer") colrs=brewer.pal(7,"Set1") plot(v_lambda,est_lasso[1,],col=colrs[1],type="l") for(i in 2:7) lines(v_lambda,est_lasso[i,],col=colrs[i],lwd=2) But it is very heratic… or non stable. Using glmnet Just to compare, with R routines dedicated to lasso, we get the following 1 2 3 library(glmnet) glm_lasso = glmnet(X, y, alpha=1) plot(glm_lasso,xvar="lambda",col=colrs,lwd=2) 1 plot(glm_lasso,col=colrs,lwd=2) If we look carefully what’s in the ouput, we can see that there is variable selection, in the sense that some , in the sense “really null” 1 2 3 4 5 6 7 8 9 10 glmnet(X, y, alpha=1,lambda=exp(-4))$beta 7x1 sparse Matrix of class "dgCMatrix" s0 FRCAR . INCAR 0.11005070 INSYS 0.03231929 PRDIA . PAPUL . PVENT -0.03138089 REPUL -0.20962611

Of course, with out optimization routine, we cannot expect to have null values

1 2 3 4 5 opt_lasso(.2) FRCAR INCAR INSYS PRDIA 0.4810999782 0.0002813658 1.9117847987 -0.3873926427 PAPUL PVENT REPUL -0.0863050787 -0.4144139379 -1.3849264055

So clearly, it will be necessary to spend more time today, to understand how it works…

Orthogonal covariates

Before getting into the maths, observe that when covariates are orthogonal, there is some very clear “variable” selection process,

1 2 3 4 5 6 library(factoextra) pca = princomp(X) pca_X = get_pca_ind(pca)$coord glm_lasso = glmnet(pca_X, y, alpha=1) plot(glm_lasso,xvar="lambda",col=colrs) plot(glm_lasso,col=colrs) Interior Point approach The penalty is now expressed using the so intuitively, it should be possible to consider algorithms related to linear programming. That was actually suggested in Koh, Kim & Boyd (2007), with some implementation in matlab, see http://web.stanford.edu/~boyd/l1_logreg/. If I can find some time, later one, maybe I will try to recode it. But actually, it is not the technique used in most R functions. Now, o be honest, we face a double challenge today: the first one is to understand how lasso works for the “standard” (least square) problem, the second one is to see how to adapt it to the logistic case. Standard lasso (with weights) If we get back to the original Lasso approach, the goal was to solve(with standard notions, as in wikipedia or Jocelyn Chi’s post – most of the code in this section is inspired by Jocelyn’s great post). Observe that the intercept is not subject to the penalty. The first order condition is theni.e.Assume now that KKT conditions are satisfied, since we cannot differentiate (to find points where the gradient is ), we can check if contains the subdifferential at the minimum. Namely For the term on the left, we recognize so that the previous equation can be writen into a sum of its positive and negative parts by replacing with where . Then the Lasso problem becomeswith constraints . Let denote the Lagrange multipliers for , respectively. To satisfy the stationarity condition, we take the gradient of the Lagrangian with respect to and set it to zero to obtainWe do the same with respect to to obtain As discussed in Jocelyn Chi’s post, primal feasibility requires that the primal constraints be satisfied so this gives us and . Then dual feasibility requires non-negativity of the Lagrange multipliers so we get and . And finally, complementary slackness requires that and We can simplify these conditions to obtain a simple set of rules for checking whether or not our solution is a minimum. From , we have . This gives us . From , we have . This gives us , which gives us . Hence, When , complementary slackness requires . So . Hence, . At the same time, so since . Then complementary slackness requires . Hence, when , we have and Similarly, when , complementary slackness requires . So and since . Then from and the above, we get . Then complementary slackness requires . Hence, when , we have and . Since , this means that when , . And when can also be written observe thatwhich is a coordinate-wise update. Now, if we consider a (slightly) more general problem, with weights in the first partthe coordinate-wise update becomes An alternative is to set so that the optimization problem can be written, equivalently hence and one gets or, if we develop Again, if there are weights , the coordinate-wise update becomes The code to compute this componentwise descent is 1 2 3 4 soft_thresholding = function(x,a){ result a)] a)] - a result[which(x &lt; -a)] = x[which(x &lt; -a)] + a return(result)} and the code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 lasso_coord_desc = function(X,y,beta,lambda,tol=1e-6,maxiter=1000){ beta = as.matrix(beta) X = as.matrix(X) omega = rep(1/length(y),length(y)) obj = numeric(length=(maxiter+1)) betalist = list(length(maxiter+1)) betalist[[1]] = beta beta0list = numeric(length(maxiter+1)) beta0 = sum(y-X%*%beta)/(length(y)) beta0list[1] = beta0 for (j in 1:maxiter){ for (k in 1:length(beta)){ r = y - X[,-k]%*%beta[-k] - beta0*rep(1,length(y)) beta[k] = (1/sum(omega*X[,k]^2))*soft_thresholding(t(omega*r)%*%X[,k],length(y)*lambda) } beta0 = sum(y-X%*%beta)/(length(y)) beta0list[j+1] = beta0 betalist[[j+1]] = beta obj[j] = (1/2)*(1/length(y))*norm(omega*(y - X%*%beta - beta0*rep(1,length(y))),'F')^2 + lambda*sum(abs(beta)) if (norm(rbind(beta0list[j],betalist[[j]]) - rbind(beta0,beta),'F') &lt; tol) { break } } return(list(obj=obj[1:j],beta=beta,intercept=beta0)) } Let’s keep that one warm, and let’s get back to our initial problem. The lasso logistic regression The trick here is that the logistic problem can be formulated as a quadratic programming problem. Recall that the log-likelihood is here which is a concave function of the parameters. Hence, one can use a quadratic approximation of the log-likelihood – using Taylor expansion, where is the working response is the predictionand are weights . Thus, we obtain a penalized least-square problem. And we can use what was done previously 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 lasso_coord_desc = function(X,y,beta,lambda,tol=1e-6,maxiter=1000){ beta = as.matrix(beta) X = as.matrix(X) obj = numeric(length=(maxiter+1)) betalist = list(length(maxiter+1)) betalist[[1]] = beta beta0 = sum(y-X%*%beta)/(length(y)) p = exp(beta0*rep(1,length(y)) + X%*%beta)/(1+exp(beta0*rep(1,length(y)) + X%*%beta)) z = beta0*rep(1,length(y)) + X%*%beta + (y-p)/(p*(1-p)) omega = p*(1-p)/(sum((p*(1-p)))) beta0list = numeric(length(maxiter+1)) beta0 = sum(y-X%*%beta)/(length(y)) beta0list[1] = beta0 for (j in 1:maxiter){ for (k in 1:length(beta)){ r = z - X[,-k]%*%beta[-k] - beta0*rep(1,length(y)) beta[k] = (1/sum(omega*X[,k]^2))*soft_thresholding(t(omega*r)%*%X[,k],length(y)*lambda) } beta0 = sum(y-X%*%beta)/(length(y)) beta0list[j+1] = beta0 betalist[[j+1]] = beta obj[j] = (1/2)*(1/length(y))*norm(omega*(z - X%*%beta - beta0*rep(1,length(y))),'F')^2 + lambda*sum(abs(beta)) p = exp(beta0*rep(1,length(y)) + X%*%beta)/(1+exp(beta0*rep(1,length(y)) + X%*%beta)) z = beta0*rep(1,length(y)) + X%*%beta + (y-p)/(p*(1-p)) omega = p*(1-p)/(sum((p*(1-p)))) if (norm(rbind(beta0list[j],betalist[[j]]) - rbind(beta0,beta),'F') &lt; tol) { break } } return(list(obj=obj[1:j],beta=beta,intercept=beta0)) } It looks like what can get when calling glmnet… and here, we do have null components for some large enough ! Really null… and that’s cool actually. Application on our second dataset Consider now the second dataset, with two covariates. The code to get lasso estimates is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 df0 = df df0$y = as.numeric(df$y)-1 plot_lambda = function(lambda){ m = apply(df0,2,mean) s = apply(df0,2,sd) for(j in 1:2) df0[,j] &lt;- (df0[,j]-m[j])/s[j] reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=1,lambda=lambda) u = seq(0,1,length=101) p = function(x,y){ xt = (x-m[1])/s[1] yt = (y-m[2])/s[2] predict(reg,newx=cbind(x1=xt,x2=yt),type="response")} v = outer(u,u,p) image(u,u,v,col=clr10,breaks=(0:10)/10) points(df$x1,df$x2,pch=19,cex=1.5,col="white") points(df$x1,df$x2,pch=c(1,19)[1+z],cex=1.5) contour(u,u,v,levels = .5,add=TRUE)}

Consider some small values, for [\lambda], so that we only have some sort of shrinkage of parameters,

1 2 3 4 5 reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=1) par(mfrow=c(1,2)) plot(reg,xvar="lambda",col=c("blue","red"),lwd=2) abline(v=exp(-2.8)) plot_lambda(exp(-2.8)) But with a larger , there is variable selection: here 1 2 3 4 5 reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=1) par(mfrow=c(1,2)) plot(reg,xvar="lambda",col=c("blue","red"),lwd=2) abline(v=exp(-2.1)) plot_lambda(exp(-2.1))

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Interactive RTutor Problemsets via RStudio Cloud

Mon, 06/04/2018 - 11:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

I just learned about RStudio Cloud (see https://rstudio.cloud/) that allows to simply run RStudio instances from your browser. Moreover, you can simply set-up RStudio projects that other users can simply copy and use themselves. RStudio Cloud is still in alpha and currently one can freely register to use it.

I have set up a project that allows you to simply test and create an interactive RTutor problem set. Just click on the following link:

https://rstudio.cloud/project/39040

You are first asked to sign-up to RStudio Cloud or to sign-in via your Google or Github account. Then you see a running instance of RStudio in your browser. You can look at the README.md file or directly open the file Intro.Rmd from the file pane to solve an RMarkdown based interactive problem set directly in RStudio.

This problem set is a simple introduction to R that is typically the first problem set in my courses at Ulm University before additional problem sets focus on the topics of the course.

The RMarkdown file Intro.Rmd contains several chunks, in which the student shall enter some code. One can check the solution via Addins -> Check Problemset and get hints for the current chunk by typing hint() in the R console. One also sees how many exercises have already been solved by typing stats() and can earn awards that are shown by awards().

Nowadays there exists a lot of great ways to learn R in such an interactive fashion, like the web-based Data Camp courses or Swirl that is mainly based on direct input into the R console. Yet, to the best of my knowledge RTutor is still unique by allowing to interactively work through RMarkdown files directly in RStudio. Hence, students can learn in the same environment that they probably use later in their own data science projects.

The file Intro_sol.Rmd in the subdirectory ps_source contains the source code used to generate this problem set.

This is basically a sample solution of the problem set. RTutor is able to automatically generate hints and tests of the student’s input from such a sample solution. It is also possible to customize hints and specify options how the student’s solution shall be checked. You can modify Intro_sol.Rmd to generate your own RTutor problem set, which you can host on your modified copy of my RStudio Cloud instance. This vignette offers more details on how to create RTutor problem sets.

RTutor also offers the opportunity to show and solve problem sets in shiny based format in the browser, as in the following screenshot:

You can use the shiny based version also from the RStudio Cloud instance by simply typing in the R console:

library(RTutor) show.ps("Intro", user.name="TestUser")

Several of my students at Ulm University have created very nice nice, shiny-based RTutor problem sets based on empirical research articles in economics. The problem sets are hosted on shinyapps.io and Github. Short descriptions of a some of these problem sets are given on my blog
http://skranz.github.io and a list of these problem sets is on the RTutor Github page: https://github.com/skranz/RTutor

In my courses, I let students solve RTutor problem sets in the RMarkdown format, however. After having solved a problem set, a student can create a submission file with the command make.submission() and upload it to the Moodle page of my course. At the end of the course, I download a big ZIP file of all submission and automatically grade the problem sets based on the R code here: https://github.com/skranz/RTutor/blob/master/grading/grade_sub.r.

Currently, in my courses all students install RStudio and RTutor locally on their computer. This works, but sometimes it can be a bit time consuming to get the software work on every computer given heterogeneous operating systems and the fact that not every student is very IT savvy. RStudio Cloud looks like a great alternative if one wants to avoid local installation for each student. Of course, in how far that is a long-run viable option will depend on the pricing system RStudio will choose after the test phase and whether you have an available budget for it.

By the way, if you have generated your own RTutor problem set and would like to share it, you can send me an email (email address is on my website) with a link to your Github repository or RStudio Cloud project. It would be great to have a list of external RTutor problem sets.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hello, Dorling! (Creating Dorling Cartograms from R Spatial Objects + Introducing Prism Skeleton)

Mon, 06/04/2018 - 04:42

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

NOTE: There is some iframed content in this post and you can bust out of it if you want to see the document in a full browser window.

Also, apologies for some lingering GitHub links. I’m waiting for all the repos to import into to other services and haven’t had time to setup my own self-hosted public instance of any community-usable git-ish environment yet.

And So It Begins

After seeing Fira Sans in action in presentations at eRum 2018 I felt compelled to add support hrbrthemes support for it so I made a firasans extension to it that uses Fira Sans Condensed and Fira Code fonts for ggplot2 graphics.

But I rly wanted to go the extra mile and make an R Markdown theme for it, but I’m weary of both jQuery & Bootstrap, plus prefer Prism over HighlightJS. So I started work on “Prism Skeleton”, which is an R Markdown template that has most of the features you would expect and some new ones, plus used Prism and Fira Sans/Code. You can try it out on your own if you use markdowntemplates but the “production” version is likely going to eventually go into the firasans package. I uses markdowntemplates as a playground for R Markdown experiments.

The source for the iframe at the end of this document is here: https://rud.is/dl/hello-dorling.Rmd. There are some notable features (I’ll repeat a few from above):

• Fira Sans for headers and text
• Fira Code for all monospaced content (including source code)
• No jQuery
• No Bootstrap (it uses the ‘Skeleton’ CSS framework)
• No HighightJS (it uses the ‘Prism” highlighter)
• Extended YAML parameters (more on that in a bit)
• Defaults to fig.retina=2 and the use of optipng or pngquant for PNG compression (so it expects them to be installed — ref this post by Zev Ross for more info and additional image use tips)

Oh, yes. You can read the iframe or busted out document for that bit. It’s a small package to make it easier to create Dorling cartograms based on previous work by @datagistips.

“You said something about ‘extended YAML’?”

Aye. Here’s the YAML excerpt from the Dorling Rmd:

--- title: "Hello, Dorling! (Creating Dorling Cartograms from R Spatial Objects)" author: "boB Rudis" navlink: "[rud.is](https://rud.is/b/)" og: type: "article" title: "Hello, Dorling! (Creating Dorling Cartograms from R Spatial Objects)" url: "https://github.com/hrbrmstr/spdorling" footer: - content: '[GitLab](https://gitlab.com/hrbrmstr)
' - content: 'This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.' date: "r Sys.Date()" output: markdowntemplates::prismskel ---

The title, author & date should be familiar fields but the author and date get some different placement since the goal is more of a flowing document than academic report.

If navlink is present (it’s not required) there will be a static bar at the top of the HTML document with a link on the right (any content, really, but a link is what’s in the example). Remove navlink and no bar will be there.

The og section is for open graph tags and you customize them how you like. Open graph tags make it easier to share posts on social media or even Slack since they’ll auto-expand various content bits.

There’s also a custom footer (exclude it if you don’t want one) that can take multiple content sub-elements.

The goal isn’t so much to give you a 100% usable R Markdown template but something you can clone and customize for your own use. Since this example shows how to use custom fonts and a different code highlighter (which meant using some custom knitr hooks), it should be easier to customize than some of the other ones in the template playground package. FWIW I plan on adapting this for a work template this week.

The other big customization is the use of Prism with a dark theme. Again, you can clone + customize this at-will but I may add config options for all Prism themes at some point (mostly if there is interest).

FIN

(Well, almost fin)

Kick the tyres on both the new template and the new package and drop suggestions here for the time being (until I get fully transitioned to a new git-hosting platform). One TODO for spdorling is to increase the point count for the circle polygons but I’m sure folks can come up with enhancement requests to the API after y’all have played with it for a while.

As noted a few times, the Rmd example with the Dorling cartograms is below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Deep Catalan roots: playing with stringdist

Mon, 06/04/2018 - 02:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

Preambule

This academic year I am participating in European Doctoral School of Demography. It is a unique one-year-long training for PhD students in demography. It keeps migrating across European research centers; this year Jim Vaupel’s research group in Odense hosts the program. Best demographers visit us to give us various one-week-long courses. Here we are, the EDSD cohort 2017/18:

Back in February (yes, I know, that was a quarter of a year ago, EDSD is quite dense), Francisco Villavichencio gave us a fascinating lecture on probabilistic string matching. Pancho used the approach in one of his PhD papers, in which he squeezed some demographic knowledge from scarce Catalan data on marriages that happened back in 16-17 centuries. Each marriage record contains name of a bride, name and surname of a groom, and the same data for two pairs their of parents. Having data for period spanning several adjacent generations, Pancho linked kids to their parents, thus getting interval censored dataset on demographic events of Renaissance Catalans. From this data he managed to estimate life expectancy of that population! Just make sure you check the paper if you want more details:

Villavicencio, F., Jordà, J. P., & Pujadas-Mora, J. M. (2015). Reconstructing lifespans through historical marriage records of Barcelona from the sixteenth and seventeenth centuries. In Population reconstruction (pp. 199–216). Springer.

Our assignment was to play with a sample of the dataset and link daughters, who got married in 1597–1629, to their possible parents, who got married in 1573–1617, using string distance metrics. But that’s not what I am about to show you.

A side-walk solution

While doing the assignment, I decided to check if there lived “re-incarnations” of my group-mates in Barcelona back in the day. Instead of linking daughters to parents I linked us, 20 young demographers from various countries, to Catalans from 16-17 century. Here is how I handed in the assignment =)

So, let’s dive into string distances and probabilistic string matching.

Important note! For the sake of post’s readability, this time I chose not to insert code chunks in the post. Instead, you are welcome to explore the github repo that replicates all the analyses presented here. Since I cannot publish openly a big chunk of The Barcelona Historical Marriage Database, the guthub repo only contains minimal indicative data sample – 10 best matching records to each of the groupmate’s name.

Exploring the methods

There are multiple ways to calculate string distances. Most popular are implemented in stringdist R package. I’m not going to present the methods themselve, just check stringdist documentation if interested. All of the methods are pretty costly computationally. Even when each particular comparison happens quite fast, the problem is that for each person, for whom a match is being searched, we need to run the comparison with each of the records in the other dataset. Of course, when you want to use several methods and average the result of their “voting”, it takes even more time. Thus, a good idea is to narrow the list of candidates for matching as much as possible.

So, first I decided to compare the speed of these methods and choose the fastest for the first step comparison. Here is what I got calculating the distance between words “demography” and “democracy”.

As we see, there is quite some variation in the speed of different methods, and sometimes an iteration takes much longer than usual. The fastest method is “jw”, Jaro-Winker distance. I use it for filtering out the pairs of candidate names that definitely have nothing in common. The choice of the cutting-off threshold is, of course, arbitrary and is based on empirical tests. The trade-off here is, on the one hand, to narrow down the list of candidates for match but, on the other hand, not to throw away the possible match based on just one distance measure.

Finally, let’s find our Catalan “ancestors”

Instead of choosing some cut-off value I ranked the results of “jw” distance and filtered 10 best fitting results for each of the names of my fellows. For these pre-selected candidates I calculated also several other distance measures: “osa”, “lcs”, “qgram”, “cosine”, and “jaccard”. In the final tables I showed only 3 best matching results ranked by the geometric average of all the 6 calculated distances taken with equal weights.

Here is the table for guys:

My personal favorite here is Hanbo Wo becoming Antonio Duc.

And a similar table for girls:

Here I like Elena Bastianelli turning to Elena Albanell.

Conclusions

There are 20 people in our EDSD group, 10 guys and 10 girls. We come from different countries: Germany, Italy, China, United States, Russia, Spain, Mexico, Bosnia, Poland, Hungary, Estonia, and Denmark. Our names are quite different from that of 16-17 century Catalans. Still, using string distance matching we can find similarly named persons in the historical dataset. This fun example just exposes the power of formal text mining approach.

Replication materials at github var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New R package xplain: Providing interactive interpretations and explanations of statistical results

Sun, 06/03/2018 - 23:10

(This article was first published on Topics in R, and kindly contributed to R-bloggers)

The package xplain is designed to help users interpret the results of their statistical analyses.

It does so not in an abstract way as textbooks do. Textbooks do not help the user of a statistical method understand his findings directly. What does a result of 3.14 actually mean? This is often hard to answer with a textbook alone because the book may provide its own examples but cannot refer to the specifics of the user’s case. However, as we all know, we understand things best when they are explained to us with reference to the actual problem we are working on. xplain is made to fill this gap that textbooks (and other learning materials) leave.

The basic idea behind xplain is simple: Package authors or other people intested in explaining statistics provide interpretation information for a statistical method (i.e. an R function) in the format of an XML file. With a simple syntax this interpretation information can reference the results of the user’s call of the explained R function. At runtime, xplain then provides the user with textual interpretation that really relates to his/her case.

Providing xplain interpretation information can be interesting for:

• R package authors who implement a statistical method
• statisticians who develop statistical methods themselves
• college and university teachers who want to make their teaching content more accessible for their students
• everybody who enjoys teaching and explaining statistics and thinks he/she has something to contribute

xplain offers support for interpretation information in different languages and on different levels of difficulty.

Read the xplain web tutorial to learn everything about how to use xplain: http://www.zuckarelli.de/xplain/index.html.

More ressources on xplain:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Topics in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Even Simpler SQL

Sun, 06/03/2018 - 01:00

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

I’ve had some feedback on the last post, and rather than repeat the same thing multiple times, I’m going all @drob, and writing this instead..

When I tweeted out the link to my post I gave it the tag line “why I’d rather write dplyr than SQL”.

What I couldn’t fit in to the tweet was that this was based on the caveat that some of the SQL I have had to write has been incredibly complicated by the age / version / lack of functionality of the SQL database I was using, and the nature of the task at hand.

In those situations, being able to write dplyr to manipulate my data would have made my life a lot easier.

However, I am not against SQL.

Far from it, I love working with SQL and writing complex queries.

The more you learn, the more you understand what can be done with SQL, and it’s incredibly powerful.

But – there are definitely times when you think, “this would be a lot easier in R”.

TL / DR :

SQL is great, and you should definitely learn it

AND

Dplyr is great, and you should definitely learn it.

Then you can decide which is best for the situation you are currently facing.
In real life you wouldn’t need a hugely powerful database to wrangle 684 rows, and my main reason for using dplyr was that it was a small dataset and the resultant table was going to be assigned to ggplot2 for plotting purposes.

Less code, same results

I realised the SQL code I demonstrated for the final query was a bit convoluted, mainly because I wanted people who are new to it to be able to follow the code ( which hopefully they did).

However that final query could have been a lot more succinct.
Here was the first part:

SELECT * , ROW_NUMBER() OVER (PARTITION BY IN_OUT, Movement_Type,Staging_Post,Movement15 ORDER BY (MovementDateTime)) * [counter] AS Movement_15_SEQNO FROM ( SELECT [MovementDateTime], [FirstName], [LastName], [Ward_Dept], [Staging_Post], [Movement_Type], [IN_OUT], cast(round(floor(cast([MovementDateTime] AS float(53))*24*4)/(24*4),5) AS smalldatetime) AS Movement15, (CASE WHEN IN_OUT = 'IN' THEN 1 ELSE -1 END) AS [counter] FROM [SERVER].[dbo].[TABLENAME])x ORDER BY MovementDateTime

Having specified the necessary columns within that first query, we can simply do a

SELECT *, ROW_NUMBER() OVER (PARTITION BY IN_OUT, Movement_Type,Staging_Post,Movement15 ORDER BY (MovementDateTime)) * [counter] AS Movement_15_SEQNO

to add in the new column.

This time round, we are creating the row number column and multiplying it by the counter field in 1 step.

So this gives us a much shorter query, (as we are removing 1 level of nesting, and not specifying each column in the subsequent levels).
It runs to only 14 lines, compared to the 37 in the final query last time round.
Here’s the new , final version:

SELECT * , ROW_NUMBER() OVER (PARTITION BY IN_OUT, Movement_Type,Staging_Post,Movement15 ORDER BY (MovementDateTime)) * [counter] AS Movement_15_SEQNO FROM ( SELECT [MovementDateTime], [FirstName], [LastName], [Ward_Dept], [Staging_Post], [Movement_Type], [IN_OUT], cast(round(floor(cast([MovementDateTime] AS float(53))*24*4)/(24*4),5) AS smalldatetime) AS Movement15, (CASE WHEN IN_OUT = 'IN' THEN 1 ELSE -1 END) AS [counter] FROM [SERVER].[dbo].[TABLENAME])x ORDER BY MovementDateTime

But wait.

We can also make our dplyr code even simpler.

One of the comments on my last post suggested that we should use if_else(), instead of case_when() for creating the counter field.
And that is a great suggestion, because there are only 2 possible values that the IN_OUT column can have.

In addition, having created the counter field, we can make use of it straight away to create the sequence number within the same pipe.
So our final dplyr code, (which works), looks like this:

plot_data <- data %>% mutate(Movement15 = lubridate::floor_date(MovementDateTime,"15 minutes")) %>% group_by(IN_OUT, Movement_Type,Staging_Post,Movement15) %>% mutate(counter = if_else(IN_OUT == 'IN',1,-1), Movement_15_SEQNO = cumsum(counter)) %>% ungroup()

6 lines, compared to 8 in the previous example.

Its not really THAT big a deal for this example, but its as well to be aware that you could simplify further if you wanted to.
As always, get stuff working first, then optimise it as needs be.

If you want really concise and powerful R code, which is even more ‘SQL like’, then you should look at data.table.

I haven’t used it a lot, but even with the short amound of time I devoted to it, I found I could write less code and see hugely impressive speed of execution , so if you get to the point where you really want to strip everything down then you will probably end up getting familiar with DT and its syntax.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Ceteris Paribus Plots – a new DALEX companion

Fri, 06/01/2018 - 19:26

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

If you like magical incantations in Data Science, please welcome the Ceteris Paribus Plots. Otherwise feel free to call them What-If Plots.

Ceteris Paribus (latin for all else unchanged) Plots explain complex Machine Learning models around a single observation. They supplement tools like breakDown, Shapley values, LIME or LIVE. In addition to feature importance/feature attribution, now we can see how the model response changes along a specific variable, keeping all other variables unchanged.

How cancer-risk-scores change with age? How credit-scores change with salary? How insurance-costs change with age?

Well, use the ceterisParibus package to generate plots like the one below.
Here we have an explanation for a random forest model that predicts apartments prices. Presented profiles are prepared for a single observation marked with dashed lines (130m2 apartment on 3rd floor). From these profiles one can read how the model response is linked with particular variables.

Instead of original values on the OX scale one can plot qunatiles. This way one can put all variables in a single plot.

And once all variables are in the same scale, one can compare two or more models.

Yes, they are model agnostic and will work for any model!
Yes, they can be interactive (see plot_interactive function or examples below)!
And yes, you can use them with other DALEX explainers!
More examples with R code.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### StatCheck the Game

Fri, 06/01/2018 - 18:29

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you don't get enough joy from publishing scientific papers in your day job, or simply want to experience what it's like to be in a publish-or-perish environment where the P-value is the only important part of a paper, you might want to try StatCheck: the board game where the object is to publish two papers before any of your opponents.

As the game progresses, players combine "Test", "Statistic" and "P-value" cards to form the statistical test featured in the paper (and of course, significant tests are worth more than non-significant ones). Opponents may then have the opportunity to play a "StatCheck" card to challenge the validity of the test, which can then be verified using a companion R package or online Shiny application. Other modifier cards include "Bayes Factor" (which can be used to boost the value of your own papers, or diminish the value of an opponents'), "Post-Hoc Theory" (improving the value of already-published papers), and "Behind the Paywall" (making it more difficult to challenge the validity of your statistics).

StatCheck The Game was created by Sacha Epskamp and Adela Isvoranu, who provide all the the code to create the cards as open source on GitHub, along with instructions to print and play with your own game materials. You can find everything you need (except the needed 8-sided die and some like-minded friends to play with) at the link below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

Fri, 06/01/2018 - 10:19

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

The second edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($10.99) and kindle ($7.99/Rs449) versions.  This second edition includes more content,  extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Second Edition – Machine Learning in Stereo(Paperback-$10.99) 2. Practical machine with R and Python Second Edition – Machine Learning in Stereo(Kindle-$7.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Praise you like I should: Shiny Appreciation Month

Fri, 06/01/2018 - 10:15

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Back in the summer of 2012 I was meant to be focusing on one thing: finishing my thesis. But, unfortunately for me, a friend and former colleague came back from a conference (JSM) and told me all about a new package that she had seen demoed.

“You should sign up for the beta testing and try it out,” she said.

So, I did.

That package was Shiny and after just a couple of hours of playing around I was hooked. I was desperate to find a way to incorporate it into my thesis, but never managed to; largely due to the fact it wasn’t available on CRAN until a few months after I had submitted and because, at the time, it was quite limited in its functionality. However, I could see the potential – I was really excited about the ways it could be used to make analytics more accessible to non-technical audiences. After joining Mango I quickly became a Shiny advocate, telling everyone who would listen about how great it was.

Six years on at Mango, not a moment goes by when somebody in the team isn’t using Shiny for something. From prototyping to large scale deployments, we live and breathe Shiny. And we are extremely grateful to the team at RStudio—led by Joe Cheng—for the continued effort that they are putting in to its development. It really is a hugely different tool to the package I beta tested so long ago.

As Shiny has developed and the community around it has become greater so too has the need to teach it because more people than ever are looking to become Shiny users. For a number of years, we have been teaching the basics of Shiny to those who want to get started, and more serious development tools to those who want to deploy apps in production. But increasingly, we have seen a demand for more. And as the Shiny team have added more and more functionality it was time for a major update to our teaching materials.

Over the past six months we have had many long discussions over what functionality should be included. We have debated best practices, we have drawn on all of our combined experiences of both learning and deploying Shiny, and we eventually reached a consensus over what we felt was best for industry users of Shiny to learn.

We are now really pleased to announce an all new set of Shiny training courses.

Our courses cover everything from taking your first steps in building a Shiny application, to building production-ready applications and a whole host of topics in between. For those who want to take a private course we can tailor to your needs, and topics as diverse as getting the most from tables in DT to managing database access in apps can all be covered in just a few days.

For us, an important element of these courses, is that they are all taught by data science consultants who have hands-on experience building and deploying apps for commercial use. These consultants are supported by platform experts who can advise on the best approaches for getting an app out to end users so that you can see the benefits of using Shiny as quickly as possible.

But, one blog post was never going to be enough for all of the Shiny enthusiasts at Mango to share their passion. We needed more time, more than one blog post and more ways to share with the community.

Therefore, Mango are declaring June to be Shiny Appreciation Month!

For the whole of June, we will be talking all things Shiny. Follow us on Twitter where we will be sharing tips, ideas and resources. To get involved, share your own with us and the Shiny community, using #ShinyAppreciation. On the blog we will be sharing, among other things, some of the ways we are using Shiny in industry and some of the technical challenges we have had to overcome.

Watch this space for updates but, for now, if you want to know more about the Shiny training that we offer, take a look at our training pages. If you are based in the UK we will be running public Shiny courses in London (see below for the currently scheduled dates). We will also be offering a snapshot of the materials for intermediate Shiny users at London EARL in September.

Public course dates:

Introduction to Shiny: 17th July
Intermediate Shiny: 18th July, 5th September

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Coloring Sudokus

Fri, 06/01/2018 - 09:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Someday you will find me
caught beneath the landslide
(Champagne Supernova, Oasis)

I recently read a book called Snowflake Seashell Star: Colouring Adventures in Numberland by Alex Bellos and Edmund Harris which is full of mathematical patterns to be coloured. All images are truly appealing and cause attraction to anyone who look at them, independently of their age, gender, education or political orientation. This book demonstrates how maths are an astonishing way to reach beauty.

One of my favourite patterns are tridokus, a sophisticated colored version of sudokus. Coloring a sudoku is simple: once that is solved it is enough to assign a color to each number (from 1 to 9).  If you superimpose three colored sudokus with no cells at the same position sharing the same color, and using again nine colors, the resulting image is a tridoku:

There is something attractive in a tridoku due to the balance of colors but also they seem a quite messy: they are a charmingly unbalanced.  I wrote a script to generalize the concept to n-dokus. The idea is the same: superimpose n sudokus without cells sharing color and position (I call them disjoint sudokus) using just nine different colors. I did’n’t prove it, but I think the maximum amount of sudokus can be overimposed with these constrains is 9. This is a complete series from 1-doku to 9-doku (click on any image to enlarge):

I am a big fan of colourlovers package. These tridokus are colored with some of my favourite palettes from there:

Just two technical things to highlight:

• There is a package called sudoku that generates sudokus (of course!). I use it to obtain the first solved sudoku which forms the base.
• Subsequent sudokus are obtained from this one doing two operations: interchanging groups of columns first (there are three groups: columns 1 to 3, 4 to 6 and 7 to 9) and interchanging columns within each group then.

You can find the code here: do you own colored n-dokus!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Geocomputation with R: workshop at eRum

Thu, 05/31/2018 - 22:43

(This article was first published on R – jannesm, and kindly contributed to R-bloggers)

This is a post by Robin Lovelace, Jakub Nowosad and Jannes Muenchow. Together we’re writing an open source book called Geocomputation with R. The project aims to introducing people to R’s rapidly evolving geographic data capabilities and provide a foundation for developing scripts, functions and applications for geographic data science. We recently presented some contents of the in-progress book at the eRum
conference, where Jannes ran a workshop on the topic. In this article we share teaching materials from eRum for the benefit of those who couldn’t be there in person and provide a ‘heads-up’ to the R-Spatial community about plans for the book. We’ll start with an overview of ‘geocomputation’ (and define what we mean by the term) and finish by describing how R can be used as a bridge to access dedicated GIS software.

Gecomp ‘base’ ics

The first thing many people will be asking is “what is geocomputation anyway”? As Jannes mentioned in his talk, the choice of name was influenced by the fact that the term seems to have originated in Leeds, where one of the authors (Robin) is based. The first conference on the subject was in Leeds in 1996, with associated extended abstracts including old-school computer graphics still available here; and there was a 21 year home-coming anniversary in 2017 where Robin and Jakub presented. A more practical reason is that the term is unambiguous: it’s about using computing techniques to do new things with geographic data, as indicated in Section 1.1 of the book. Our approach differs in one way from the early conception of geocomputation, however:

Unlike early works in the field all the work presented in this book is reproducible using code and example data supplied alongside the book (using R, a great language for reproducible research).

Like many open source projects R is evolving. Although ‘base R’ is conservative (as demonstrated in Roger Bivand’s keynote, in which he did a live demo using a version of R from 1997 that still runs!), the ‘ecosystem’ of packages that extend its capabilities changes fast (video here, slides at rsbivand/eRum18).

To clarify what we mean by ‘base R’, we can identify base packages with the following code (source: stackoverflow):

x = installed.packages() row.names(x)[!is.na(x[ ,"Priority"])] ## [1] "base" "boot" "class" "cluster" "codetools" ## [6] "compiler" "datasets" "foreign" "graphics" "grDevices" ## [11] "grid" "KernSmooth" "lattice" "MASS" "Matrix" ## [16] "methods" "mgcv" "nlme" "nnet" "parallel" ## [21] "rpart" "spatial" "splines" "stats" "stats4" ## [26] "survival" "tcltk" "tools" "utils"

The output shows there are 28 packages that are currently part of the base distribution (R Core makes “base R” as Martin Maechler put it
during another keynote). These can be relied on to change very little in terms of their API, although bug fixes and performance improvements happen continuously.

The same cannot be said of contributed packages. Packages are created,
die (or are abandoned) and change, sometimes dramatically. And this applies as much (or more) to r-spatial as to any other part of R’s ecosystem, as can be seen by looking at any one of R’s task views. At the time of writing the Spatial task view alone listed 177 packages, many of them recently contributed and in-development.

In this context it helps to have an understanding of the history (Bivand, Pebesma, and Gómez-Rubio 2013). Like in politics, knowing the past can help navigate the future. More specifically, knowing which packages are mature or up-and-coming can help decide which one to use!

For this reason, after a slide on set-up (which is described in detail in chapter 2 of the book), the workshop spent a decent amount of time talking about the history of spatial data in R, as illustrated in slide 20. A more detailed account of the history of R-spatial is provided in section 1.5 of the book.

The slides outlining the basics of Geocomputation with R (which is roughly a synonym for r-spatial) can be found here.

Vector data

Spatial vector data are best used for objects that represent discrete borders such as bus stops (points), streets (lines) and houses (polygons). For instance, we can represent ‘Budapest’ (the city where eRum 2018 was held) as a spatial point with the help of the sf package (Pebesma 2018) as follows:

budapest_df = data.frame( name = "Budapest", x = 19.0, y = 47.5 ) class(budapest_df) ## [1] "data.frame" budapest_sf = sf::st_as_sf(budapest_df, coords = c("x", "y")) class(budapest_sf) ## [1] "sf" "data.frame"

Why bother creating a new class if both objects contain the same essential data? It’s what you can do with an object that’s important. The reason for using the sf class can be understood as follows: it gives the budapest_sf spatial superpowers. We can, for example, now identify what country the point is using a spatial function such as a spatial join implemented in the function st_join()` (spatial subsetting would also do the trick, as covered in section 4.2.1). First, we need to load the ‘world’ dataset and set the ‘CRS’ of the object:

# set-up: library(spData) sf::st_crs(budapest_sf) = 4326 # spatial join: sf::st_join(budapest_sf, world) ## Simple feature collection with 1 feature and 11 fields ## geometry type: POINT ## dimension: XY ## bbox: xmin: 19 ymin: 47.5 xmax: 19 ymax: 47.5 ## epsg (SRID): 4326 ## proj4string: +proj=longlat +datum=WGS84 +no_defs ## name iso_a2 name_long continent region_un subregion ## 1 Budapest HU Hungary Europe Europe Eastern Europe ## type area_km2 pop lifeExp gdpPercap geometry ## 1 Sovereign country 92476.46 9866468 75.87317 24016.3 POINT (19 47.5)

The slides describing vector data in R can be found here.

Raster data

On the other hand, a raster data represents continuous surfaces in form of a regular grid. You can think about a raster as a matrix object containing information about its spatial location. It has rows and columns, each cell has a value (it could be NA) and its spatial properties are described by the cell resolution (res), outer borders (bounding box – xmn, xmx, ymn, ymx), and coordinate reference system (crs). In R the raster package supports the spatial raster format (Hijmans 2017).

library(raster) elev = raster(nrow = 6, ncol = 6, vals = 1:36, res = 0.5, xmn = -1.5, xmx = 1.5, ymn = -1.5, ymx = 1.5, crs = "+proj=longlat")

The data structure makes raster processing much more efficient and faster than vector data processing.

elev2 = elev^2 elev3 = elev / elev2 elev4 = (elev2 - elev3) * log(elev) elev_stack = stack(elev, elev2, elev3, elev4) plot(elev_stack)

Raster objects can be subsetted (by index, coordinates, or other spatial objects), transformed using local, focal, zonal and global operations, and summarized. Importantly, there are many tools allowing for interactions between raster and vector data models, and transformation between them.

The slides associated with the raster data part of the workshop can be found here.

Visualizing spatial data

The spatial powers mentioned previously have numerous advantages. One of the most attractive is that geographic data in an appropriate class can be visualized on a map, the topic of Chapter 9 of the book. The workshop was an opportunity to expand on the contents of that chapter and ask: what’s the purpose of maps in the first place? To answer that question we used an early data visualization / infographic created by Alexander von Humboldt, illustrated below. The point of this is that it’s not always the accuracy of a map that’s most important (although that is important): the meaning that you wish to convey and the target audience should be central to the design of the map (in Humboldt’s case the unity of nature to an audience of Enlightenment book readers!):

In the context of geographic data in R, it is easier than ever to create attractive maps to tell a story. The previously created point representing Budapest, for example, can be visualized using the tmap package as follows:

library(tmap) budapest_df = data.frame(name = "Budapest", x = 19, y = 47.5) class(budapest_df) #> [1] "data.frame" budapest_sf = sf::st_as_sf(budapest_df, coords = c("x", "y")) class(budapest_sf) #> [1] "sf" "data.frame" tmap_mode("view") #> tmap mode set to interactive viewing m = tm_shape(budapest_sf) + tm_dots() + tm_view(basemaps = "OpenStreetMap", set.view = 9) tmap_leaflet(m)

A range of mapping techniques were covered in the workshop including the plot() method from the sf package that generates multiple maps from a single object by default, such as this one representing the nz (short for New Zealand) object from the spData package:

More advanced maps were demonstrated, including this animated map of the United States (for information on how to make animated maps with R, see section 9.3) of Geocomputation with R.

The slides forming the basis of the visualization part of the tutorial can be found here.

Last but not least was a section on GIS bridges

Defining a Geographic Information System as a system for the analysis, manipulation and visualization of geographic data (Longley 2015), we can safely claim that R already has become a GIS. However, R has also its shortcomings when it comes to spatial data analysis. To name but a few, R is not particularly good at handling large geographic data, it is not a geodatabase and it is missing literally hundreds of geoalgorithms readily available in GIS software packages and spatial libraries. Fortunately, R has been designed from the beginning as an interactive interface to other languages and software packages (Chambers 2016). Hence, as long as we can access the functionality of GIS software from within R, we can easily overcome R’s spatial data analysis shortcomings. For instance, when attaching, the sf package to the global environment, it automatically links to GEOS, GDAL and proj.4, this means, the sf package gives the R user automatically access to the functionality of these spatial libraries. Equally, there are a number of packages that provides access to the geoalgorithms of major open source GIS Desktop software packages:

3. RQGIS provides access to QGIS. For much more details and background information, please check out the corresponding R Journal publication.

Note that you must have installed the GIS software on your machine before you can use it through R.[1]

In the workshop we shortly presented how to use RQGIS. The corresponding slides can be found here. In the book we additionally demonstrate how to use RSAGA and rgrass7 in Chapter 10.

Background on the book

Geocomputation with R is a collaborative project. We joined forces because each of us has been been teaching and contributing to R’s spatial ecosystem for years and we all had a similar vision of a book to disseminate R’s impressive geographic capabilities more widely.

As described in a previous article by Jakub, we’re making good progress towards finishing the book by the end of summer 2018, meaning Geocomputation with R will be published before the end of the year. The target audience is broad but we think it will be especially useful to post and under-graduate students, R users wanting to work with spatial data, and GIS users wanting to get to grips with command-line statistical modeling software. A reason for publishing the article here is that we have around 3 months (until the end of August) to gather as much feedback on the book as possible before it’s published. We plan to keep the code and prose up-to-date after that but now is the ideal time to get involved. We welcome comments and suggestions on the issue tracker, especially from people with experience in the R-Spatial world in relation to:

• Bugs: issues with lines of prose or code that could be improved.
• Future-proofing: will the code and advice given stand the test of time? If you think some parts will go out of date quick, please let us know!
• Anything else: ideas for other topics to cover, for example.

We would like to thank the anonymous peer reviewers who have provided feedback so far. We’re still working on changes to respond to their excellent comments. If you’re interested in getting involved in this project, please see the project’s GitHub repo at github.com/Robinlovelace/geocompr and check-out the in-progress chapters at geocompr.robinlovelace.net.

References

Bivand, Roger S., Edzer Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R. 2nd ed. 2013 edition. New York:Springer.

Chambers, John M. 2016. Extending R. CRC Press.

Hijmans, Robert J. 2017. Raster: Geographic Data Analysis and Modeling.

Longley, Paul. 2015. Geographic Information Science & Systems. Fourth edition. Hoboken, NJ: Wiley.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal.

[1] Note also that RPyGeo provides access to ArcMap which is a commercial Desktop GIS software.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – jannesm. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New round of R Consortium grants announced

Thu, 05/31/2018 - 17:21

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The R Consortium has just announced its latest round of project grants. After reviewing the proposals submitted by the R community, the Infrastructure Steering Committee has elected to fund the following projects for the Spring 2018 season:

• Further updates to the DBI package, to provide a consistent interface between R and databases
• Updating the infrastructure in R for building binary packages on Windows and Mac
• A collaboration with Statisticians in the Pharmaceutical Industry to validate R packages to regulatory standards
• A project to validate and maintain some of the historic numerical algorithms used by R
• A working group to facilitate working with US census data in R
• Consolidation of tools and workflows for handling missing data in R
• Templates and tools for the development of teaching resources with R

This latest round of grants brings the total funding for projects proposed by R community members to more than $650,000. The R Consortium itself is funded by its members (including my own employer, Microsoft), so if you'd like to see more such projects why not ask your own employer to become a member var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Three ways of visualizing a graph on a map Thu, 05/31/2018 - 14:32 (This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers) When visualizing a network with nodes that refer to a geographic place, it is often useful to put these nodes on a map and draw the connections (edges) between them. By this, we can directly see the geographic distribution of nodes and their connections in our network. This is different to a traditional network plot, where the placement of the nodes depends on the layout algorithm that is used (which may for example form clusters of strongly interconnected nodes). In this blog post, I’ll present three ways of visualizing network graphs on a map using R with the packages igraph, ggplot2 and optionally ggraph. Several properties of our graph should be visualized along with the positions on the map and the connections between them. Specifically, the size of a node on the map should reflect its degree, the width of an edge between two nodes should represent the weight (strength) of this connection (since we can’t use proximity to illustrate the strength of a connection when we place the nodes on a map), and the color of an edge should illustrate the type of connection (some categorical variable, e.g. a type of treaty between two international partners). Preparation We’ll need to load the following libraries first: library(assertthat) library(dplyr) library(purrr) library(igraph) library(ggplot2) library(ggraph) library(ggmap) Now, let’s load some example nodes. I’ve picked some random countries with their geo-coordinates: country_coords_txt <- " 1 3.00000 28.00000 Algeria 2 54.00000 24.00000 UAE 3 139.75309 35.68536 Japan 4 45.00000 25.00000 'Saudi Arabia' 5 9.00000 34.00000 Tunisia 6 5.75000 52.50000 Netherlands 7 103.80000 1.36667 Singapore 8 124.10000 -8.36667 Korea 9 -2.69531 54.75844 UK 10 34.91155 39.05901 Turkey 11 -113.64258 60.10867 Canada 12 77.00000 20.00000 India 13 25.00000 46.00000 Romania 14 135.00000 -25.00000 Australia 15 10.00000 62.00000 Norway" # nodes come from the above table and contain geo-coordinates for some # randomly picked countries nodes <- read.delim(text = country_coords_txt, header = FALSE, quote = "'", sep = "", col.names = c('id', 'lon', 'lat', 'name')) So we now have 15 countries, each with an ID, geo-coordinates (lon and lat) and a name. These are our graph nodes. We’ll now create some random connections (edges) between our nodes: set.seed(123) # set random generator state for the same output N_EDGES_PER_NODE_MIN <- 1 N_EDGES_PER_NODE_MAX <- 4 N_CATEGORIES <- 4 # edges: create random connections between countries (nodes) edges <- map_dfr(nodes$id, function(id) { n <- floor(runif(1, N_EDGES_PER_NODE_MIN, N_EDGES_PER_NODE_MAX+1)) to <- sample(1:max(nodes$id), n, replace = FALSE) to <- to[to != id] categories <- sample(1:N_CATEGORIES, length(to), replace = TRUE) weights <- runif(length(to)) data_frame(from = id, to = to, weight = weights, category = categories) }) edges <- edges %>% mutate(category = as.factor(category)) Each of these edges defines a connection via the node IDs in the from and to columns and additionally we generated random connection categories and weights. Such properties are often used in graph analysis and will later be visualized too. Our nodes and edges fully describe a graph so we can now generate a graph structure g with the igraph library. This is especially necessary for fast calculation of the degree or other properties of each node later. g <- graph_from_data_frame(edges, directed = FALSE, vertices = nodes) We now create some data structures that will be needed for all the plots that we will generate. At first, we create a data frame for plotting the edges. This data frame will be the same like the edges data frame but with four additional columns that define the start and end points for each edge (x, y and xend, yend): edges_for_plot <- edges %>% inner_join(nodes %>% select(id, lon, lat), by = c('from' = 'id')) %>% rename(x = lon, y = lat) %>% inner_join(nodes %>% select(id, lon, lat), by = c('to' = 'id')) %>% rename(xend = lon, yend = lat) assert_that(nrow(edges_for_plot) == nrow(edges)) Let’s give each node a weight and use the degree metric for this. This will be reflected by the node sizes on the map later. nodes$weight = degree(g)

Now we define a common ggplot2 theme that is suitable for displaying maps (sans axes and grids):

maptheme <- theme(panel.grid = element_blank()) + theme(axis.text = element_blank()) + theme(axis.ticks = element_blank()) + theme(axis.title = element_blank()) + theme(legend.position = "bottom") + theme(panel.grid = element_blank()) + theme(panel.background = element_rect(fill = "#596673")) + theme(plot.margin = unit(c(0, 0, 0.5, 0), 'cm'))

Not only the theme will be the same for all plots, but they will also share the same world map as “background” (using map_data('world')) and the same fixed ratio coordinate system that also specifies the limits of the longitude and latitude coordinates.

country_shapes <- geom_polygon(aes(x = long, y = lat, group = group), data = map_data('world'), fill = "#CECECE", color = "#515151", size = 0.15) mapcoords <- coord_fixed(xlim = c(-150, 180), ylim = c(-55, 80)) Plot 1: Pure ggplot2

Let’s start simple by using ggplot2. We’ll need three geometric objects (geoms) additional to the country polygons from the world map (country_shapes): Nodes can be drawn as points using geom_point and their labels with geom_text; edges between nodes can be realized as curves using geom_curve. For each geom we need to define aesthetic mappings that “describe how variables in the data are mapped to visual properties” in the plot. For the nodes we map the geo-coordinates to the x and y positions in the plot and make the node size dependent on its weight (aes(x = lon, y = lat, size = weight)). For the edges, we pass our edges_for_plot data frame and use the x, y and xend, yend as start and end points of the curves. Additionally, we make each edge’s color dependent on its category, and its “size” (which refers to its line width) dependent on the edges’ weights (we will see that the latter will fail). Note that the order of the geoms is important as it defines which object is drawn first and can be occluded by an object that is drawn later in the next geom layer. Hence we draw the edges first and then the node points and finally the labels on top:

ggplot(nodes) + country_shapes + geom_curve(aes(x = x, y = y, xend = xend, yend = yend, # draw edges as arcs color = category, size = weight), data = edges_for_plot, curvature = 0.33, alpha = 0.5) + scale_size_continuous(guide = FALSE, range = c(0.25, 2)) + # scale for edge widths geom_point(aes(x = lon, y = lat, size = weight), # draw nodes shape = 21, fill = 'white', color = 'black', stroke = 0.5) + scale_size_continuous(guide = FALSE, range = c(1, 6)) + # scale for node size geom_text(aes(x = lon, y = lat, label = name), # draw text labels hjust = 0, nudge_x = 1, nudge_y = 4, size = 3, color = "white", fontface = "bold") + mapcoords + maptheme

A warning will be displayed in the console saying “Scale for ‘size’ is already present. Adding another scale for ‘size’, which will replace the existing scale.”. This is because we used the “size” aesthetic and its scale twice, once for the node size and once for the line width of the curves. Unfortunately you cannot use two different scales for the same aesthetic even when they’re used for different geoms (here: “size” for both node size and the edges’ line widths). There is also no alternative to “size” I know of for controlling a line’s width in ggplot2.

With ggplot2, we’re left of with deciding which geom’s size we want to scale. Here, I go for a static node size and a dynamic line width for the edges:

ggplot(nodes) + country_shapes + geom_curve(aes(x = x, y = y, xend = xend, yend = yend, # draw edges as arcs color = category, size = weight), data = edges_for_plot, curvature = 0.33, alpha = 0.5) + scale_size_continuous(guide = FALSE, range = c(0.25, 2)) + # scale for edge widths geom_point(aes(x = lon, y = lat), # draw nodes shape = 21, size = 3, fill = 'white', color = 'black', stroke = 0.5) + geom_text(aes(x = lon, y = lat, label = name), # draw text labels hjust = 0, nudge_x = 1, nudge_y = 4, size = 3, color = "white", fontface = "bold") + mapcoords + maptheme

Plot 2: ggplot2 + ggraph

Luckily, there is an extension to ggplot2 called ggraph with geoms and aesthetics added specifically for plotting network graphs. This allows us to use separate scales for the nodes and edges. By default, ggraph will place the nodes according to a layout algorithm that you can specify. However, we can also define our own custom layout using the geo-coordinates as node positions:

node_pos <- nodes %>% select(lon, lat) %>% rename(x = lon, y = lat) # node positions must be called x, y lay <- create_layout(g, 'manual', node.positions = node_pos) assert_that(nrow(lay) == nrow(nodes)) # add node degree for scaling the node sizes lay$weight <- degree(g) We pass the layout lay and use ggraph’s geoms geom_edge_arc and geom_node_point for plotting: ggraph(lay) + country_shapes + geom_edge_arc(aes(color = category, edge_width = weight, # draw edges as arcs circular = FALSE), data = edges_for_plot, curvature = 0.33, alpha = 0.5) + scale_edge_width_continuous(range = c(0.5, 2), # scale for edge widths guide = FALSE) + geom_node_point(aes(size = weight), shape = 21, # draw nodes fill = "white", color = "black", stroke = 0.5) + scale_size_continuous(range = c(1, 6), guide = FALSE) + # scale for node sizes geom_node_text(aes(label = name), repel = TRUE, size = 3, color = "white", fontface = "bold") + mapcoords + maptheme The edges’ widths can be controlled with the edge_width aesthetic and its scale functions scale_edge_width_*. The nodes’ sizes are controlled with size as before. Another nice feature is that geom_node_text has an option to distribute node labels with repel = TRUE so that they do not occlude each other that much. Note that the plot’s edges are differently drawn than with the ggplot2 graphics before. The connections are still the same only the placement is different due to different layout algorithms that are used by ggraph. For example, the turquoise edge line between Canada and Japan has moved from the very north to south across the center of Africa. Plot 3: the hacky way (overlay several ggplot2 “plot grobs”) I do not want to withhold another option which may be considered a dirty hack: You can overlay several separately created plots (with transparent background) by annotating them as “grobs” (short for “graphical objects”). This is probably not how grob annotations should be used, but anyway it can come in handy when you really need to overcome the aesthetics limitation of ggplot2 described above in plot 1. As explained, we will produce separate plots and “stack” them. The first plot will be the “background” which displays the world map as before. The second plot will be an overlay that only displays the edges. Finally, a third overlay shows only the points for the nodes and their labels. With this setup, we can control the edges’ line widths and the nodes’ point sizes separately because they are generated in separate plots. The two overlays need to have a transparent background so we define it with a theme: theme_transp_overlay <- theme( panel.background = element_rect(fill = "transparent", color = NA), plot.background = element_rect(fill = "transparent", color = NA) ) The base or “background” plot is easy to make and only shows the map: p_base <- ggplot() + country_shapes + mapcoords + maptheme Now we create the first overlay with the edges whose line width is scaled according to the edges’ weights: p_edges <- ggplot(edges_for_plot) + geom_curve(aes(x = x, y = y, xend = xend, yend = yend, # draw edges as arcs color = category, size = weight), curvature = 0.33, alpha = 0.5) + scale_size_continuous(guide = FALSE, range = c(0.5, 2)) + # scale for edge widths mapcoords + maptheme + theme_transp_overlay + theme(legend.position = c(0.5, -0.1), legend.direction = "horizontal") The second overlay shows the node points and their labels: p_nodes <- ggplot(nodes) + geom_point(aes(x = lon, y = lat, size = weight), shape = 21, fill = "white", color = "black", # draw nodes stroke = 0.5) + scale_size_continuous(guide = FALSE, range = c(1, 6)) + # scale for node size geom_text(aes(x = lon, y = lat, label = name), # draw text labels hjust = 0, nudge_x = 1, nudge_y = 4, size = 3, color = "white", fontface = "bold") + mapcoords + maptheme + theme_transp_overlay Finally we combine the overlays using grob annotations. Note that proper positioning of the grobs can be tedious. I found that using ymin works quite well but manual tweaking of the parameter seems necessary. p <- p_base + annotation_custom(ggplotGrob(p_edges), ymin = -74) + annotation_custom(ggplotGrob(p_nodes), ymin = -74) print(p) As explained before, this is a hacky solution and should be used with care. Still it is useful also in other circumstances. For example when you need to use different scales for point sizes and line widths in line graphs or need to use different color scales in a single plot this way might be an option to consider. All in all, network graphs displayed on maps can be useful to show connections between the nodes in your graph on a geographic scale. A downside is that it can look quite cluttered when you have many geographically close points and many overlapping connections. It can be useful then to show only certain details of a map or add some jitter to the edges’ anchor points. The full R script is available as gist on github. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Defining Marketing with the Rvest and Tidytext Packages Thu, 05/31/2018 - 02:00 (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers) I am preparing to facilitate another session of the marketing course for the La Trobe University MBA. The first lecture delves into the definition of marketing. Like most other social phenomena, marketing is tough to define. Definitions of social constructs often rely on the perspective taken by the person or group writing the definition. As such, definitions also change over time. While a few decades ago, definitions of marketing revolved around sales and advertising, contemporary definitions are more holistic and reference creating value. Heidi Cohen wrote a blog post where she collated 72 definitions of marketing. So rather than arguing over which definition is the best, why not use all definitions simultaneously? This article attempts to define a new definition of marketing, using a data science approach. We can use the R language to scrape the 72 definitions from Heidi’s website and attempt text analysis to extract the essence of marketing from this data set. I have mentioned in a previous post about qualitative data science that automated text analysis is not always a useful method to extract meaning from a text. I decided to delve a little deeper into automated text analysis to see if we find out anything useful about marketing using the rvest and tidytext packages. The presentation below shows the slides I use in my introductory lecture into marketing. The code and analyses are shown below the slideshow. You can download the most recent version of the code from my GitHub Repository. Scraping text with Rvest Web scraping is a common technique to download data from websites where this data is not available as a clean data source. Web scraping starts with downloading the HTML code from the website and the filtering the wanted text from this file. The rvest package makes this process very easy. The code for this article uses a pipe (%>%) with three rvest commands. The first step is to download the wanted html code from the web using the read_html function. The output of this function is piped to the html_nodes function, which does all the hard work. In this case, the code picks all lines of text that are embedded in an 1. container, i.e. ordered lists. You can use the SelectorGadget to target the text you like to scrape. The last scraping step cleans the text by piping the output of the previous commands to the html_text function. The result of the scraping process is converted to a Tibble, which is a type of data frame used in the Tidyverse. The definition number is added to the data, and the Tibble is converted to the format required by the Tidytext package. The resulting data frame is much longer than the 72 definitions because there are other lists on the page. Unfortunately, I could not find a way to filter only the 72 definitions. library(tidyverse) library(rvest) definitions <- read_html("https://heidicohen.com/marketing-definition/") %>% html_nodes("ol li") %>% html_text() %>% as_data_frame() %>% mutate(No = 1:nrow(.)) %>% select(No, Definition = value) Tidying the Text The Tidytext package extends the tidy data philosophy to a text. In this approach to text analysis, a corpus consists of a data frame where each word is a separate item. The code snippet below takes the first 72 rows and the unnest_tokens function extracts each word from the 72 definitions. This function can also extract ngrams and other word groups from the text. The Tidytext package is an extremely versatile piece of software which goes far beyond the scope of this article. Julia Silge and David Robinson have written a book about text mining using this package, which provides a very clear introduction to the craft of analysing text. The last section of the pipe removes the trailing “s” from each word to convert plurals into single words. The mutate function in the Tidyverse creates or recreates a new variable in a data frame. library(tidytext) def_words <- definitions[1:72, ] %>% unnest_tokens(word, Definition) %>% mutate(word = gsub("s$", "", word))

This section creates a data frame with two variables. The No variable indicates the definition number (1–72) and the word variable is a word within the definition. The order of the words is preserved in the row name. To check the data frame you can run unique(def_words$No[which(def_words$word == "marketing")]). This line finds all definition numbers with the word “marketing”, wich results, as expected, in the number 1 to 72.

Using Rvest and Tidytext to define marketing

We can now proceed to analyse the definitions scraped from the website with Rvest and cleaned with Tidytext. The first step is to create a word cloud, which is a popular way to visualise word frequencies. This code creates a data frame for each unique word, excluding the word marketing itself, and uses the wordcloud package to visualise the fifty most common words.

library(wordcloud) library(RColorBrewer) word_freq <- def_words %>% anti_join(stop_words) %>% count(word) %>% filter(word != "marketing") word_freq %>% with(wordcloud(word, n, max.words = 50, rot.per = .5, colors = rev(brewer.pal(5, "Dark2"))))

While a word cloud is certainly a pretty way to visualise the bag of words in a text, it is not the most useful way to get the reader to understand the data. The words are jumbled, and the reader needs to search for meaning. A better way to visualise word frequencies is a bar chart. This code takes the data frame created in the previous snippet, determines the top ten occurring words. The mutate statement reorders to factor levels so that the words are plotted in order.

word_freq %>% top_n(10) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill = "dodgerblue4") + coord_flip() + theme(text = element_text(size=20))

A first look at the word cloud and bar chart suggests that marketing is about customers and products and services. Marketing is a process that includes branding and communication; a simplistic but functional definition.

Topic Modeling using Tidytext

Word frequencies are a weak method to analyse text because it interprets each word as a solitary unit. Topic modelling is a more advanced method that analyses the relationships between words, i.e. the distance between them. The first step is to create a Document-Term Matrix, which is a matrix that indicates how often a word appears in a text.  As each of the 72 texts are very short, I decided to treat the collection of definitions as one text about marketing. The cast_dtm function converts the data frame to a Document-Term Matrix.

The following pipe determines the top words in the topics. Just like k-means clustering, the analyst needs to choose the number of topics before analysing the text. In this case, I have opted for 4 topics. The code determines the contribution of each word to the four topics and selects the five most common words in each topic. The faceted bar chart shows each of the words in the four topics.

marketing_dtm <- word_freq %>% mutate(doc = 1) %>% cast_dtm(doc, word, n) marketing_lda <- LDA(marketing_dtm, k = 4) %>% tidy(matrix = "beta") %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) marketing_lda %>% mutate(term = reorder(term, beta)) %>% ggplot(aes(term, beta, fill = factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~topic, scales = "free") + coord_flip() + theme(text = element_text(size=20))

This example also does not tell me much more about what marketing is, other than giving a slightly more sophisticated version of the word frequency charts. This chart shows me that marketing is about customers that enjoy a service and a product. Perhaps the original definitions are not distinctive enough to be separated from each other. The persistence of the word “president” is interesting as it seems to suggest that marketing is something that occurs at the highest levels in the business.

This article is only a weak summary of the great work by Julia Silge who coauthored the Tidytext package. The video below provides a comprehensive introduction to topic modelling.

What have we learnt?

This excursion into text analysis using rvest and Tidytext shows that data science can help us to make some sense out of an unread text. If I did not know what this page was about, then perhaps this analysis would enlighten me. This kind of analysis can assist us in wading through to large amounts of text to select the ones we want to read. I am still not convinced that this type of analysis will provide any knowledge beyond what can be obtained from actually reading and engaging with a text.

Although I am a data scientist and want to maximise the use of code in analysing data, I am very much in favour of developing human intelligence before we worry about the artificial kind.

The post Defining Marketing with the Rvest and Tidytext Packages appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Exploring R Packages with cranly

Thu, 05/31/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In a previous post, I showed a very simple example of using the R function tools::CRAN_package_db() to analyze information about CRAN packages. CRAN_package_db() extracts the metadata CRAN stores on all of its 12,000 plus packages and arranges it into a “database”, actually a complicated data frame in which some columns have vectors or lists as entries.

It’s simple to run the function and it doesn’t take very long on my Mac Book Air.

p_db <- tools::CRAN_package_db()

The following gives some insight into what’s contained in the data frame.

dim(p_db) ## [1] 12635 65 matrix(names(p_db),ncol=2) ## [,1] [,2] ## [1,] "Package" "Collate.windows" ## [2,] "Version" "Contact" ## [3,] "Priority" "Copyright" ## [4,] "Depends" "Date" ## [5,] "Imports" "Description" ## [6,] "LinkingTo" "Encoding" ## [7,] "Suggests" "KeepSource" ## [8,] "Enhances" "Language" ## [9,] "License" "LazyData" ## [10,] "License_is_FOSS" "LazyDataCompression" ## [11,] "License_restricts_use" "LazyLoad" ## [12,] "OS_type" "MailingList" ## [13,] "Archs" "Maintainer" ## [14,] "MD5sum" "Note" ## [15,] "NeedsCompilation" "Packaged" ## [16,] "Additional_repositories" "RdMacros" ## [17,] "Author" "SysDataCompression" ## [18,] "Authors@R" "SystemRequirements" ## [19,] "Biarch" "Title" ## [20,] "BugReports" "Type" ## [21,] "BuildKeepEmpty" "URL" ## [22,] "BuildManual" "VignetteBuilder" ## [23,] "BuildResaveData" "ZipData" ## [24,] "BuildVignettes" "Published" ## [25,] "Built" "Path" ## [26,] "ByteCompile" "X-CRAN-Comment" ## [27,] "Classification/ACM" "Reverse depends" ## [28,] "Classification/ACM-2012" "Reverse imports" ## [29,] "Classification/JEL" "Reverse linking to" ## [30,] "Classification/MSC" "Reverse suggests" ## [31,] "Classification/MSC-2010" "Reverse enhances" ## [32,] "Collate" "MD5sum" ## [33,] "Collate.unix" "Package"

Looking at a few rows and columns gives a feel for how complicated its structure is.

p_db[1:10, c(1,2,4,5)] ## Package Version Depends ## 1 A3 1.0.0 R (>= 2.15.0), xtable, pbapply ## 2 abbyyR 0.5.4 R (>= 3.2.0) ## 3 abc 2.1 R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit ## 4 abc.data 1.0 R (>= 2.10) ## 5 ABC.RAP 0.9.0 R (>= 3.1.0) ## 6 ABCanalysis 1.2.1 R (>= 2.10) ## 7 abcdeFBA 0.4 Rglpk,rgl,corrplot,lattice,R (>= 2.10) ## 8 ABCoptim 0.15.0 ## 9 ABCp2 1.2 MASS ## 10 abcrf 1.7 R(>= 3.1) ## Imports ## 1 ## 2 httr, XML, curl, readr, plyr, progress ## 3 ## 4 ## 5 graphics, stats, utils ## 6 plotrix ## 7 ## 8 Rcpp, graphics, stats, utils ## 9 ## 10 readr, MASS, matrixStats, ranger, parallel, stringr, Rcpp (>=\n0.11.2)

So, having spent a little time leaning how vexing working with this data can be, I was delighted when I discovered Ioannis Kosmidis’ cranly package during my March “Top 40” review. cranly is a very impressive package, built along tidy principles, that is helpful for learning about individual packages, analyzing the structure of package and author relationships, and searching for packages.

library(cranly) library(tidyverse) ## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.5 ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 ## ✔ readr 1.1.1 ✔ forcats 0.3.0 ## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag()

The first really impressive feature is a “one button” clean function that does an amazing job of getting the data in shape to work with. In my preliminary work, I struggled just to get the author data clean. In the approach that I took, getting rid of text like [aut, cre] to get a count of authors took more regular expression work than I wanted to deal with. But clean_CRAN_db does a good job of cleaning up the whole database. Note that the helper function clean_up_author has a considerable number of hard-coded text strings that must have taken hours to get right.

package_db <- clean_CRAN_db(p_db)

Once you have the clean data, it is easy to run some pretty interesting analyses. This first example, straight out of the package vignette, builds the network of package relationships based on which packages import which, and then plots a summary for the top 20 most imported packages.

package_network <- build_network(package_db) package_summaries <- summary(package_network) plot(package_summaries, according_to = "n_imported_by", top = 20)

There is also a built-in function to compute the importance or relevance of a package using the page rank algorithm.

plot(package_summaries, according_to = "page_rank", top = 20)

The build_network function also offers the opportunity to investigate the collaboration of package authors by building a network from the authors’ perspective.

author_network <- build_network(object = package_db, perspective = "author")

Here, we look at J.J. Allaire’s network. exact = FALSE means that the algorithm is not using exact matching.

plot(author_network, author = "JJ Allaire", exact = FALSE)

446 collaborators in 103 packages:
ABC.RAP, AmesHousing, analogsea, babynames
bigrquery, bindr, bindrcpp, blob
callr, cowplot, crosstalk, curl
DBItest, dbplyr, devtools, dplyr
dtplyr, dygraphs, feather, flexdashboard
fontquiver, forcats, fs, gdtools
ggplot2, ggplot2movies, ggridges, ggstance
htmlwidgets, httr, JuniperKernel, keras
keyring, knitrProgressBar, later, lazyeval
leaflet, manipulate, markdown, miniUI
modelr, nycflights13, odbc, pillar
pixels, pkgdown, pool, processx
profvis, promises, purrr, purrrlyr
rmdshower, RMySQL, roxygen2, RPostgres
rsample, rsparkling, RSQLite, rstudioapi
rticles, rvest, scales, shiny
shinybootstrap2, shinydashboard, shinythemes, sparklyr
stringr, svglite, swagger, tensorflow
testthat, tfdatasets, tfestimators, tfruns
tibble, tidyposterior, tidyr, tidyselect
tidyverse, tidyxl, usethis, vdiffr
withr, xml2, yardstick","Author: Randall Pruim
46 collaborators in 13 packages:
abd, fastR, fastR2, ggformula
Lock5withR, mosaic, mosaicCalc, mosaicCore
mosaicData, NHANES, rticles, Sleuth2
Sleuth3","Author: Henrik Bengtsson
198 collaborators in 38 packages:
ACNE, aroma.affymetrix, aroma.apd, aroma.cn
aroma.core, BatchJobs, BrailleR, calmate
covr, dChipIO, digest, doFuture
future, future.apply, future.BatchJobs, future.batchtools
listenv, markdown, matrixStats, profmem
PSCBS, R.cache, R.devices, R.filesets
R.huge, R.matlab, R.methodsS3, R.oo
R.rsp, R.utils, RJaCGH, RPushbullet
startup, sudoku","Author: Dean Attali
45 collaborators in 11 packages:
ezknitr, ggExtra, ggquickeda, lightsout
111 collaborators in 5 packages:
allelic, BayesFactor, covr, knitr
RcppProgress","Author: Michal Bojanowski
125 collaborators in 6 packages:
alluvial, bookdown, intergraph, knitr
710 collaborators in 123 packages:
analogsea, assertthat, babynames, bigQueryR
bigrquery, blob, bnclassify, bookdown
broom, cellranger, classifly, cli
clusterfly, colorplaner, curl, damr
DBI, dbplyr, DescribeDisplay, DescTools
devtools, dplyr, dtplyr, evaluate
fda, feather, forcats, fs
fueleconomy, gdtools, geozoo, GGally
ggmap, ggmosaic, ggplot2, ggplot2movies
ggstance, ggthemes, ggvis, gh
gtable, haven, hflights, HistData
httr, itertools, knitr, knitrProgressBar
labelled, lazyeval, leaflet, lemon
lubridate, lvplot, magrittr, meifly
memoise, modelr, namespace, nasaweather
nlmixr, nullabor, nycflights13, odbc
packagedocs, partools, pillar, pkgdown
plotrix, plumbr, plyr, prettydoc
productplots, profr, proto, pryr
purrr, purrrlyr, rappdirs, Rd2roxygen
reshape, reshape2, rggobi, rlang
RPostgres, rsample, RSQLite, rstan
rstudioapi, rticles, rvest, RxODE
scales, sessioninfo, sf, skimr
spectacles, stringr, svglite, testthat
tibble, tidyr, tidyselect, tidyverse
tidyxl, tourr, tourrGui, tribe
unjoin, usethis, wesanderson, withr
xml2, yaml, yesno","Author: Winston Chang
172 collaborators in 33 packages:
analogsea, bisectr, callr, colorplaner
extrafontdb, fontcm, gcookbook, ggplot2
lemon, namespace, processx, profvis
R6, remotes, rmarkdown, Rttf2pt1
sessioninfo, shiny, shinybootstrap2, shinydashboard
shinytest, shinythemes, webdriver, webshot
withr","Author: Bob Rudis
135 collaborators in 41 packages:
analogsea, cdcfluview, censys, cymruservices
darksky, docxtractr, epidata, flexdashboard
gdns, geoparser, ggalt, ggthemes
GSODR, hrbrthemes, htmltidy, hyphenatr
iptools, longurl, metricsgraphics, ndjson
osmdata, qrencoder, rgeolocate, rvg
securitytxt, sergeant, slackr, spiderbar
splashr, statebins, swatches, SWMPrExtension
tigris, uaparserjs, urltools, vegalite
viridis, viridisLite, voteogram, waffle
wand","Author: David Hugh-Jones
106 collaborators in 5 packages:
anim.plots, covr, huxtable, knitr
refset","Author: Yihui Xie
389 collaborators in 40 packages:
animation, blogdown, bookdown, BrailleR
citr, corrplot, DT, evaluate
formatR, fun, highr, htmlwidgets
imguR, installr, kableExtra, knitr
leaflet, markdown, mime, MSG
params, prettydoc, printr, R2SWF
Rd2roxygen, rhandsontable, rmarkdown, rticles
servr, shiny, testit, threejs
tikzDevice, tinytex, tufte, webshot
widgetframe, xaringan, xfun, yaml","Author: Weicheng Zhu
94 collaborators in 5 packages:
animation, AssocTests, edcc, knitr
PedCNV","Author: Jalal-Edine ZAWAM
24 collaborators in 6 packages:
manipulateWidget, spMaps","Author: Francois Guillem
26 collaborators in 8 packages:
leafletR, manipulateWidget, tinyProject, webshot","Author: Benoit Thieurmel
36 collaborators in 10 packages:
manipulateWidget, rAmCharts, ROI.plugin.clp, spMaps
suncalc, visNetwork","Author: Titouan Robert
26 collaborators in 6 packages:
rAmCharts, visNetwork","Author: RTE
24 collaborators in 6 packages:
manipulateWidget, spMaps","Author: Jeroen Ooms
139 collaborators in 59 packages:
antiword, base64, bcrypt, brotli
cld2, cld3, codemetar, commonmark
covr, curl, gdtools, geojson
gpg, graphql, httpuv, hunspell
ijtiff, jose, jqr, js
jsonld, jsonlite, magick, markdown
minimist, Mobilize, mongolite, monkeylearn
Ohmage, opencpu, openssl, pdftools
protolite, RAppArmor, redland, rgdal
RProtoBuf, RPublica, rsvg, rversions
rzmq, s2, sf, sodium
spelling, sys, tesseract, unix
unrtf, V8, webp, webutils
writexl, xml2, xslt","Author: Dirk Eddelbuettel
332 collaborators in 72 packages:
DescTools, digest, drat, fortunes
gaussfacts, gcbd, gettz, gtrendsR
gunsales, hurricaneexposure, inline, komaletter
lbfgs, linl, littler, mvabund
mvst, n1qn1, nanotime, nlmixr
nloptr, permGPU, pinp, pkgKitten
prrd, random, RApiDatetime, RApiSerialize
Rblpapi, Rcpp, RcppAnnoy, RcppAPT
RcppClassic, RcppClassicExamples, RcppCNPy, RcppDE
RcppGSL, RcppMsgPack, RcppQuantuccia, RcppRedis
RcppSMC, RcppStreams, RcppTOML, RcppXts
RcppZiggurat, RDieHarder, reticulate, rfoaas
RInside, Rmalschains, rmsfact, RPostgreSQL
RProtoBuf, RPushbullet, RQuantLib, RVowpalWabbit
sanitizers, tensorflow, tint, x13binary","Author: Daniel Gromer
7 collaborators in 2 packages:
apa, dygraphs","Author: John Muschelli
102 collaborators in 29 packages:
ari, brainR, cifti, crsra
diffr, fedreporter, freesurfer, fslr
kirby21.fmri, kirby21.t1, knitr, matlabr
neurobase, neurohcp, nitrcbot, oasis
oro.nifti, papayar, rscopus, spant
spm12r, stapler, sublime, tuber
WhiteStripe","Author: R Core
456 collaborators in 79 packages:
ARTP2, bbmle, BinQuasi, BNN
BoutrosLab.plotting.general, car, caret, CHNOSZ
CholWishart, chron, condvis, date
dendextend, devtools, dgof, dispRity
distr, distrMod, DtD, dynamichazard
effects, EMCluster, expint, fansi
flashClust, forecast, foreign, future.apply
gap.datasets, gdtools, GLDEX, GLDreg
glmBfp, icd, inlinedocs, lrmest
Matrix, MCMCpack, metaplus, mixlm
multimode, namespace, nlme, nscancor
nsprcomp, pbdDMAT, pbdMPI, pbdZMQ
PCICt, permute, ph2hetero, phia
properties, pryr, pubh, QuasiSeq
rstan, rstanarm, RxODE, sciplot
sem, sessioninfo, shiny, SpatialExtremes
statip, stringdist, surveillance, testthat
vetr, weights, wikipediatrend","Author: Karl Broman
43 collaborators in 6 packages:
aRxiv, BatchMap, cowsay, onemap
qtlDesign, rticles","Author: Richard Cotton
118 collaborators in 33 packages:
assertive, assertive.base, assertive.code, assertive.data
assertive.data.uk, assertive.data.us, assertive.datetimes, assertive.files
assertive.matrices, assertive.models, assertive.numbers, assertive.properties
assertive.reflection, assertive.sets, assertive.strings, assertive.types
DYM, flippant, knitr, listless
multipanelfigure, pathological, poio, rebus
rebus.base, rebus.datetimes, rebus.numbers, rebus.unicode
runittotestthat, setter, sig, spatialkernel
withr","Author: Yuan Tang
91 collaborators in 14 packages:
autoplotly, caret, dml, forecast
ggfortify, h2o4gpu, keras, lfda
onnx, reticulate, tensorflow, tfdatasets
tfestimators, xgboost","Author: Gábor Csárdi
63 collaborators in 40 packages:
available, callr, cli, clisymbols
cranlike, crayon, debugme, desc
disposables, dotenv, filelock, gh
keypress, keyring, liteq, parsedate
pingr, pkgconfig, prettycode, processx
progress, rematch2, remotes, rhub
rmdshower, rstack, rversions, sand
sankey, secret, sessioninfo, shinytest
shinytoastr, showimage, spark, sys
webdriver, whoami, xmlparsedata, zip","Author: Jim Hester
194 collaborators in 25 packages:
available, bench, covr, desc
devtools, digest, fs, glue
gmailr, knitr, knitrBootstrap, lintr
markdown, memoise, naniar, odbc
repr, rex, spelling, types
withr","Author: Thomas Leeper
166 collaborators in 13 packages:
aws.alexa, batman, cowsay, drat
essurvey, installr, knitr, MarginalMediation
markdown, monkeylearn, poio, sparktex
UNF","Author: Apache Foundation
86 collaborators in 14 packages:
base64url, commonsMath, lexicon, openNLPdata
PortfolioEffectHFT, Rdroolsjars, ReporteRsjars, rkafkajars
RKEAjars, rtika, sparklyr, XLConnect
XLConnectJars, xlsxjars","Author: Oliver Keyes
80 collaborators in 36 packages:
batman, birdnik, dotwhisker, exif
favnums, forwards, geohash, hail
humaniformat, iptools, lucr, muckrock
olctools, openssl, ores, osi
pageviews, phonics, piton, primes
rdian, reconstructr, rEDM, rgeolocate
rLTP, rticles, rwars, threewords
whoapi, wicket, WikidataR, WikipediR","Author: Nick Golding
16 collaborators in 9 packages:
BayesComm, default, GRaF, greta
malariaAtlas, pop, tensorflow, versions
zoon","Author: Rob Hyndman
77 collaborators in 12 packages:
bfast, eechidna, forecast, fpp2
hdrcde, hts, Mcomp, rmarkdown
robets, sugrrants, thief, tsibble","Author: Romain Francois
268 collaborators in 28 packages:
bibtex, bigFastlm, DescTools, highlight
inline, knitr, knitrProgressBar, mvst
operators, permGPU, Rcpp, Rcpp11
RcppClassicExamples, RcppEigen, RcppExamples, RcppGSL
sos, svMisc, svTools, tibble","Author: Bryan Lewis
41 collaborators in 4 packages:
bigalgebra, digest, flexdashboard, reticulate","Author: Douglas Bates
115 collaborators in 25 packages:
bigFastlm, car, coda, cond
Devore7, EngrExpt, hoa, lme4
marg, Matrix, MatrixModels, MEMSS
minqa, mlmRev, mvnmle, NISTnls
nlme, nlreg, pedigreemm, PKPDmodels
Rcpp, RcppBlaze, RcppEigen, rstanarm
SASmixed","Author: Yixuan Qiu
113 collaborators in 20 packages:
bigFastlm, bookdown, fun, gdtools
highr, markdown, oem, prettydoc
R2SWF, rARPACK, rationalfun, RcppEigen
RcppNumerical, recosystem, RSpectra, showtext
showtextdb, svglite, sysfonts, vennLasso","Author: Ben Marwick
86 collaborators in 8 packages:
binford, bookdown, cvequality, drake
eechidna, ggalt, gsloid, rticles","Author: Aaron Wolen
102 collaborators in 5 packages:
qtl","Author: Beilei Bian
14 collaborators in 2 packages:
99 collaborators in 3 packages:
101 collaborators in 3 packages:
blogdown, knitr, xaringan","Author: Hiroaki Yutani
100 collaborators in 9 packages:
githubinstall, knitr, kntnr, kokudosuuchi
qiitr","Author: Ian Lyttle
109 collaborators in 6 packages:
blogdown, boxr, bsplus, knitr
lubridate, vembedr","Author: JJ Allaire
375 collaborators in 33 packages:
blogdown, bookdown, config, DT
dygraphs, flexdashboard, htmlwidgets, keras
knitr, learnr, manipulate, manipulateWidget
markdown, packrat, prettydoc, r2d3
Rcpp, RcppParallel, reticulate, revealjs
rmarkdown, rmdshower, rsconnect, rsparkling
rstudioapi, rticles, shiny, sparklyr
tensorflow, tfdatasets, tfestimators, tfruns
tufte","Author: Kevin Ushey
241 collaborators in 20 packages:
blogdown, bookdown, cronR, DescTools
icd, packrat, Rcpp, Rcpp11
RcppParallel, RcppRoll, reticulate, rex
rmarkdown, rsnps, rstudioapi, sourcetools
sparklyr, tfdatasets, tfestimators, withr","Author: Leonardo Collado-Torres
10 collaborators in 1 packages:
blogdown","Author: Xianying Tan
19 collaborators in 2 packages:
blogdown, DT","Author: RStudio Inc
126 collaborators in 12 packages:
blogdown, bookdown, config, d3heatmap
DT, htmltools, learnr, qrage
rmarkdown, skimr, tinytex, tufte","Author: Albert Kim
47 collaborators in 2 packages:
bookdown, infer","Author: Alessandro Samuel-Rosa
60 collaborators in 4 packages:
bookdown, febr, pedometrics, spsann","Author: Andrzej Oles
57 collaborators in 3 packages:
bookdown, markdown, tufte","Author: Bastiaan Quast
68 collaborators in 11 packages:
bookdown, decompr, diagonals, gvc
learNN, rddapp, rddtools, rnn
rticles, sigmoid, wiod","Author: Chester Ismay
61 collaborators in 4 packages:
bookdown, fivethirtyeight, infer, moderndive","Author: Christophe Dervieux
125 collaborators in 3 packages:
bookdown, knitr, nomisr","Author: Clifton Franklund
38 collaborators in 1 packages:
bookdown","Author: Daniel Emaasit
38 collaborators in 1 packages:
bookdown","Author: David Shuman
38 collaborators in 1 packages:
bookdown","Author: Drew Tyre
38 collaborators in 1 packages:
bookdown","Author: Frans van Dunne
38 collaborators in 1 packages:
bookdown","Author: Jeff Allen
97 collaborators in 6 packages:
bookdown, plumber, rhandsontable, rmarkdown
shinyAce, shinyTree","Author: Jennifer Bryan
69 collaborators in 15 packages:
bookdown, cellranger, clipr, gapminder
salesforcer, searchConsoleR, usethis","Author: Jonathan McPhers
38 collaborators in 1 packages:
bookdown","Author: Junwen Huang
38 collaborators in 2 packages:
bookdown, somebm","Author: Kevin Cheung
38 collaborators in 1 packages:
bookdown","Author: Kim Seonghyun
38 collaborators in 4 packages:
bookdown, cbar, refnr, tropr","Author: Kirill Muller
41 collaborators in 2 packages:
bookdown, qdapTools","Author: Luciano Selzer
64 collaborators in 4 packages:
bookdown, broom, ggfortify, multcompView","Author: Matthew Lincoln
58 collaborators in 6 packages:
bookdown, broom, clipr, europop
fuzzr, hypothesisr","Author: Maximilian Held
39 collaborators in 2 packages:
bookdown, qmethod","Author: Michael Sachs
41 collaborators in 3 packages:
bookdown, cosinor, ggquickeda","Author: Peter Hickey
45 collaborators in 2 packages:
bookdown, matrixStats","Author: Sahir Bhatnagar
41 collaborators in 3 packages:
bookdown, casebase, manhattanly","Author: Steve Simpson
41 collaborators in 3 packages:
38 collaborators in 1 packages:
bookdown","Author: Zhuoer Dong
47 collaborators in 2 packages:
bookdown, yaml","Author: Bartek Szopka
47 collaborators in 2 packages:
bookdown, DT","Author: jQuery Foundation
197 collaborators in 17 packages:
bookdown, crosstalk, dygraphs, ggvis
lazyrmd, leaflet, metricsgraphics, profvis
QCA, qrage, qtlcharts, rhandsontable
rmarkdown, shiny, sparkline, tfruns
vdiffr","Author: FriendCode Inc
38 collaborators in 1 packages:
bookdown","Author: R Foundation
70 collaborators in 9 packages:
BoutrosLab.plotting.general, dispRity, expint, MCMCpack
multimode, ph2hetero, rticles, statip
xml2","Author: Duncan Murdoch
262 collaborators in 22 packages:
BrailleR, car, digest, ellipse
fortunes, gpclib, gsl, inline
knitr, manipulateWidget, nlsr, orientlib
patchDVI, Rcmdr, Rdpack, rgl
rglwidget, sciplot, spatialkernel, tables
tkrgl, vcdExtra","Author: Jeffrey Horner
64 collaborators in 15 packages:
brew, Cairo, datamap, FastRWeb
markdown, mime, network, networkDynamic
Rook, TimeWarp, yaml","Author: David Robinson
138 collaborators in 12 packages:
broom, dataCompareR, fuzzyjoin, gutenbergr
knitr, orcutt, reprex, rgeolocate
94 collaborators in 15 packages:
brotli, cld3, deepboost, odbc
PortfolioEffectHFT, Rdroolsjars, re2r, rmarkdown
s2, snappier, sparsepp, tensorflow
tfdatasets, tfestimators, timechange","Author: Michael Friendly
383 collaborators in 24 packages:
ca, candisc, car, DescTools
effects, fortunes, genridge, Guerry
heplots, HistData, installr, knitr
Lahman, logmult, maptools, matlib
mvinfluence, sem, statquotes, tableplot
vcd, vcdExtra, vegan, WordPools","Author: Mango Solutions
29 collaborators in 6 packages:
callr, mangoTraining, processx, remotes
rmdshower, sasMap","Author: Johannes Ranke
94 collaborators in 7 packages:
chemCal, drfit, kinfit, knitr
mkin, rlo, webchem","Author: Mike Bostock
81 collaborators in 15 packages:
ChemoSpec, collapsibleTree, D3partitionR, d3r
edgebundleR, exCon, ggiraph, ggvis
leaflet.extras, profvis, r2d3, scatterD3
sunburstR, tfruns, vegalite","Author: Wush Wu
117 collaborators in 9 packages:
ckanr, digest, FeatureHashing, knitr
Rcereal, RcppCNPy, supc, swirl
swirlify","Author: Ruben Arslan
30 collaborators in 2 packages:
codebook, rmarkdown","Author: Carl Boettiger
86 collaborators in 19 packages:
codemetar, dataone, drat, EcoNetGen
EML, knitcitations, pmc, rcrossref
rfishbase, rfisheries, RNeXML, rplos
rticles, taxize, treebase","Author: Noam Ross
128 collaborators in 9 packages:
codemetar, cowsay, data.tree, fasterize
knitr, opencage, randgeo, viridis
viridisLite","Author: Kaiyin Zhong
87 collaborators in 3 packages:
CollapsABEL, collUtils, knitr","Author: John MacFarlane
31 collaborators in 2 packages:
commonmark, rmarkdown","Author: Ramnath Vaidyanathan
118 collaborators in 7 packages:
concaveman, gistr, htmlwidgets, knitr
rticles, servr, sparkline","Author: Taiyun Wei
92 collaborators in 3 packages:
corrplot, fun, knitr","Author: Robert Krzyzanowski
109 collaborators in 4 packages:
covr, knitr, recombinator, rex","Author: Carson Sievert
61 collaborators in 10 packages:
cowsay, eechidna, etl, flexdashboard
LDAvis, pitchRx, plotly, repr
servr, XML2R","Author: Joe Cheng
245 collaborators in 18 packages:
crosstalk, d3heatmap, DT, flexdashboard
later, leaflet, markdown, miniUI
packrat, promises, raster, rmarkdown
santaR, shiny","Author: Mark Otto
97 collaborators in 5 packages:
crosstalk, rmarkdown, shiny, shinybootstrap2
shinyWidgets","Author: Jacob Thornton
97 collaborators in 5 packages:
crosstalk, rmarkdown, shiny, shinybootstrap2
shinyWidgets","Author: Bootstrap
97 collaborators in 5 packages:
crosstalk, rmarkdown, shiny, shinybootstrap2
97 collaborators in 5 packages:
crosstalk, rmarkdown, shiny, shinybootstrap2
shinyWidgets","Author: Brian Reavis
51 collaborators in 4 packages:
crosstalk, DT, shiny, shinybootstrap2","Author: Kristopher Michael Kowal
34 collaborators in 2 packages:
crosstalk, shiny","Author: Denis Ineshin
34 collaborators in 2 packages:
crosstalk, shiny","Author: Sami Samhuri
34 collaborators in 2 packages:
crosstalk, shiny","Author: Jonathan Keane
87 collaborators in 2 packages:
crunch, knitr","Author: Erin LeDell
19 collaborators in 7 packages:
cvAUC, h2o4gpu, Metrics, rHealthDataGov
rsparkling, subsemble, SuperLearner","Author: Joshua Kunst
33 collaborators in 5 packages:
d3plus, flexdashboard, ggthemes, highcharter
rchess","Author: Jonathan Sidi
91 collaborators in 10 packages:
d3Tree, ggedit, heatmaply, jsTree
knitr, lmmen, regexSelect, shinyHeatmaply
slickR, texPreview","Author: Kenton Russell
65 collaborators in 11 packages:
d3Tree, formattable, htmltidy, htmlwidgets
leaflet, mapedit, mapview, networkD3
rbokeh, rmapshaper, rpivotTable","Author: Michael Chirico
94 collaborators in 4 packages:
data.table, funchir, knitr, vcov","Author: Ryan Hafen
36 collaborators in 11 packages:
gqlr, housingData, lazyrmd, packagedocs
rbokeh, stlplus, trelliscope","Author: Frank E Harrell Jr
186 collaborators in 5 packages:
DescTools, greport, Hmisc, knitr
rms","Author: Nathan Russell
107 collaborators in 3 packages:
DescTools, hashmap, Rcpp","Author: Nacho Caballero
100 collaborators in 3 packages:
df2json, knitr, markdown","Author: Qiang Kou
26 collaborators in 5 packages:
digest, MultiCNVDetect, Rcpp, RcppDL
RcppMLPACK","Author: Maximilian Girlich
9 collaborators in 1 packages:
DT","Author: SpryMedia
41 collaborators in 3 packages:
DT, shiny, shinybootstrap2","Author: Leon Gersen
34 collaborators in 2 packages:
DT, shinyWidgets","Author: Dan Vanderkam
7 collaborators in 1 packages:
dygraphs","Author: Jonathan Owen
38 collaborators in 4 packages:
dygraphs, networkD3, rbokeh, rhandsontable","Author: Petr Shevtsov
7 collaborators in 1 packages:
dygraphs","Author: Ben Baumer
100 collaborators in 6 packages:
etl, infer, knitr, mdsr
nyctaxi, teamcolors","Author: Julien Barnier
91 collaborators in 5 packages:
explor, knitr, questionr, rmdformats
31 collaborators in 5 packages:
feather, hrbrthemes, keras, landscapetools
RClickhouse","Author: Barbara Borges
26 collaborators in 3 packages:
flexdashboard, learnr, pool","Author: Keen IO
17 collaborators in 1 packages:
flexdashboard","Author: Abdullah Almsaeed
17 collaborators in 1 packages:
flexdashboard","Author: Jonas Mosbech
17 collaborators in 1 packages:
flexdashboard","Author: Noel Bossart
17 collaborators in 1 packages:
flexdashboard","Author: Lea Verou
17 collaborators in 1 packages:
flexdashboard","Author: Dmitry Baranovskiy
23 collaborators in 2 packages:
flexdashboard, QCA","Author: Sencha Labs
17 collaborators in 1 packages:
flexdashboard","Author: Bojan Djuricic
17 collaborators in 1 packages:
flexdashboard","Author: Tomas Sardyha
17 collaborators in 1 packages:
flexdashboard","Author: Steve Matteson
11 collaborators in 2 packages:
fontLiberation, prettydoc","Author: Microsoft
10 collaborators in 3 packages:
foreach, glmnetUtils, RcppParallel","Author: Kohske Takahashi
104 collaborators in 5 packages:
formatR, ggpolypath, knitr, labeledLoop
markdown","Author: Brian Diggs
147 collaborators in 2 packages:
fortunes, knitr","Author: Alex Zvoleff
85 collaborators in 5 packages:
gfcanalysis, glcm, knitr, wrspathrow
wrspathrowData","Author: Joseph Larmarange
108 collaborators in 6 packages:
GGally, knitr, labelled, lubridate
prevR, questionr","Author: Kamil Slowikowski
95 collaborators in 3 packages:
ggpmisc, ggrepel, knitr","Author: Kevin Kuo
20 collaborators in 5 packages:
graphframes, mleap, networkD3, sparklyr
tfestimators","Author: H2O ai
10 collaborators in 3 packages:
h2o, h2o4gpu, rsparkling","Author: Navdeep Gill
9 collaborators in 2 packages:
h2o4gpu, rsparkling","Author: Sebastian Meyer
105 collaborators in 4 packages:
hhh4contacts, knitr, polyCub, surveillance","Author: Andre Simon
86 collaborators in 2 packages:
highlight, knitr","Author: Qiang Li
88 collaborators in 2 packages:
highr, knitr","Author: Dave Raggett
37 collaborators in 2 packages:
htmltidy, rmarkdown","Author: Ivan Sagalaev
77 collaborators in 5 packages:
htmltidy, profvis, rmarkdown, shiny
tfruns","Author: Drifty
31 collaborators in 2 packages:
ionicons, rmarkdown","Author: François Chollet
7 collaborators in 1 packages:
keras","Author: Daniel Falbel
13 collaborators in 5 packages:
keras, ptstem, ptwikiwords, rslp
tfestimators","Author: Wouter Van Der Bijl
7 collaborators in 1 packages:
keras","Author: Martin Studer
13 collaborators in 3 packages:
85 collaborators in 1 packages:
knitr","Author: Alastair Andrew
85 collaborators in 1 packages:
knitr","Author: Aron Atkins
119 collaborators in 3 packages:
knitr, packrat, rmarkdown","Author: Ashley Manton
85 collaborators in 1 packages:
knitr","Author: Cassio Pereira
85 collaborators in 1 packages:
knitr","Author: Donald Arseneau
85 collaborators in 1 packages:
knitr","Author: Doug Hemken
85 collaborators in 2 packages:
knitr, SASmarkdown","Author: Elio Campitelli
85 collaborators in 1 packages:
knitr","Author: Fabian Hirschmann
85 collaborators in 1 packages:
knitr","Author: Fitch Simeon
85 collaborators in 1 packages:
knitr","Author: Gregoire Detrez
85 collaborators in 1 packages:
knitr","Author: Heewon Jeon
86 collaborators in 4 packages:
knitr, KoNLP, Ruchardet, Sejong","Author: Hodges Daniel
85 collaborators in 1 packages:
85 collaborators in 1 packages:
knitr","Author: James Manton
89 collaborators in 5 packages:
knitr, morgenstemning, nat, nat.nblast
nat.templatebrains","Author: Jared Lander
85 collaborators in 2 packages:
knitr, resumer","Author: Jason Punyon
85 collaborators in 1 packages:
knitr","Author: Javier Luraschi
131 collaborators in 8 packages:
knitr, pixels, profvis, r2d3
rmarkdown, sparklyr, sparkwarc, swagger","Author: Jeff Arnold
85 collaborators in 1 packages:
knitr","Author: Jenny Bryan
89 collaborators in 2 packages:
knitr, tidyxl","Author: Jeremy Ashkenas
85 collaborators in 1 packages:
knitr","Author: Jeremy Stephens
99 collaborators in 3 packages:
knitr, redcapAPI, yaml","Author: John Honaker
85 collaborators in 1 packages:
knitr","Author: Johan Toloe
85 collaborators in 1 packages:
knitr","Author: Kevin K Smith
85 collaborators in 1 packages:
knitr","Author: Kirill Mueller
113 collaborators in 3 packages:
knitr, ProjectTemplate, rticles","Author: Martin Modrák
85 collaborators in 1 packages:
knitr","Author: Michel Kuhlmann
85 collaborators in 1 packages:
knitr","Author: Nick Salkowski
85 collaborators in 1 packages:
85 collaborators in 1 packages:
knitr","Author: Ruaridh Williamson
93 collaborators in 2 packages:
knitr, rio","Author: Scott Kostyshak
85 collaborators in 1 packages:
knitr","Author: Sietse Brouwer
85 collaborators in 1 packages:
knitr","Author: Simon de Bernard
85 collaborators in 1 packages:
knitr","Author: Sylvain Rousseau
85 collaborators in 1 packages:
knitr","Author: Thibaut Assus
85 collaborators in 1 packages:
85 collaborators in 1 packages:
knitr","Author: Tom Torsney-Weir
85 collaborators in 1 packages:
knitr","Author: Trevor Davis
93 collaborators in 2 packages:
knitr, rappdirs","Author: Viktoras Veitas
85 collaborators in 1 packages:
knitr","Author: Zachary Foster
104 collaborators in 5 packages:
knitr, metacoder, taxa, taxize
traits","Author: Marcus Geelnard
16 collaborators in 3 packages:
later, RcppParallel, reticulate","Author: Ajax org B V
12 collaborators in 2 packages:
learnr, shinyAce","Author: Zeno Rocha
8 collaborators in 1 packages:
learnr","Author: Nick Payne
8 collaborators in 1 packages:
learnr","Author: Julie Cameron
8 collaborators in 1 packages:
learnr","Author: Quicken Loans
8 collaborators in 1 packages:
learnr","Author: Mozilla
12 collaborators in 2 packages:
learnr, metricsgraphics","Author: Marion Praz
7 collaborators in 1 packages:
15 collaborators in 1 packages:
markdown","Author: Vicent Marti
15 collaborators in 1 packages:
markdown","Author: Natacha Porte
15 collaborators in 1 packages:
markdown","Author: Charlotte Wickham
18 collaborators in 3 packages:
munsell, repurrrsive, rticles","Author: Association Computing Machinery
44 collaborators in 2 packages:
OpenMx, rticles","Author: Jonathan McPherson
58 collaborators in 3 packages:
packrat, rmarkdown, shiny","Author: Jason Long
7 collaborators in 1 packages:
prettydoc","Author: Renyuan Zou
7 collaborators in 1 packages:
prettydoc","Author: Michael Rose
7 collaborators in 1 packages:
prettydoc","Author: Doug Ashton
15 collaborators in 2 packages:
9 collaborators in 2 packages:
Rcpp, Rcpp11","Author: Gregory Vandenbrouck
7 collaborators in 1 packages:
RcppParallel","Author: Intel
7 collaborators in 1 packages:
RcppParallel","Author: Hakim El Hattab
4 collaborators in 1 packages:
revealjs","Author: Asvin Goel
4 collaborators in 1 packages:
revealjs","Author: Greg Denehy
4 collaborators in 1 packages:
revealjs","Author: Roy Storey
30 collaborators in 1 packages:
rmarkdown","Author: Alexander Farkas
54 collaborators in 2 packages:
rmarkdown, shiny","Author: Scott Jehl
54 collaborators in 2 packages:
rmarkdown, shiny","Author: Greg Franko
30 collaborators in 1 packages:
rmarkdown","Author: W3C
30 collaborators in 1 packages:
rmarkdown","Author: Dave Gandy
55 collaborators in 3 packages:
rmarkdown, shiny, waffle","Author: Ben Sperry
30 collaborators in 1 packages:
rmarkdown","Author: Aidan Lister
30 collaborators in 1 packages:
11 collaborators in 1 packages:
rmdshower","Author: Oleg Jahson
11 collaborators in 1 packages:
rmdshower","Author: Slava Oliyanchuk
11 collaborators in 1 packages:
rmdshower","Author: Roman Komarov
11 collaborators in 1 packages:
rmdshower","Author: Artem Polikarpov
11 collaborators in 1 packages:
rmdshower","Author: Tony Ganch
11 collaborators in 1 packages:
rmdshower","Author: Denis Hananein
11 collaborators in 1 packages:
rmdshower","Author: Michal Malohlava
6 collaborators in 1 packages:
rsparkling","Author: Jakub Hava
6 collaborators in 1 packages:
rsparkling","Author: Gary Ritchie
4 collaborators in 1 packages:
rstudioapi","Author: Journal Statistical
17 collaborators in 1 packages:
rticles","Author: Elsevier
17 collaborators in 1 packages:
rticles","Author: Miao Yu
17 collaborators in 1 packages:
rticles","Author: Stefan Petre
24 collaborators in 1 packages:
shiny","Author: Andrew Rowls
24 collaborators in 1 packages:
shiny","Author: John Fraser
24 collaborators in 1 packages:
shiny","Author: John Gruber
24 collaborators in 1 packages:
shiny","Author: Masayuki Tanaka
11 collaborators in 1 packages:
tfruns","Author: Shaun Bowe
11 collaborators in 1 packages:
tfruns","Author: Materialize
11 collaborators in 1 packages:
tfruns","Author: Yuxi You
11 collaborators in 1 packages:
tfruns","Author: Kevin Decker
11 collaborators in 1 packages:
tfruns","Author: Rodrigo Fernandes
11 collaborators in 1 packages:
tfruns","Author: Yauheni Pakala
11 collaborators in 1 packages:
tfruns","Author: Dave Liepmann
4 collaborators in 1 packages:
CRAN database version
Thu, 31 May 2018, 13:42
Author names with
\"JJ Allaire\"
Package names with
\"Inf\"","style":"font-family:Georgia, Times New Roman, Times, serif;font-size:15px"},"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)","highlight":{"enabled":true,"hoverNearest":false,"degree":1,"algorithm":"all","hideColor":"rgba(200,200,200,0.5)","labelOnly":true},"collapse":{"enabled":false,"fit":false,"resetHighlight":true,"clusterOptions":null},"legend":{"width":0.2,"useGroups":false,"position":"left","ncol":1,"stepX":100,"stepY":100,"zoom":true,"nodes":{"label":["Authors matching query","Collaborators"],"color":["#4A6FE3","#ECEEFC"]},"nodesToDataframe":true},"tooltipStay":300,"tooltipStyle":"position: fixed;visibility:hidden;padding: 5px;white-space: nowrap;font-family: verdana;font-size:14px;font-color:#000000;background-color: #f5f4ed;-moz-border-radius: 3px;-webkit-border-radius: 3px;border-radius: 3px;border: 1px solid #808074;box-shadow: 3px 3px 10px rgba(0, 0, 0, 0.2);","export":{"type":"png","css":"float:right;","background":"#fff","name":"cranly_network-31-May-2018-JJ Allaire-Inf.png","label":"PNG snapshot"}},"evals":[],"jsHooks":[]}

It is also possible to study individual packages. Here, I plot the very simple dependency tree for the time series package xts. There is a very good argument to be made that the simpler the dependency tree the more stable and reliable the package.

xts_tree <- build_dependence_tree(package_network, "xts") plot(xts_tree)

{"x":{"nodes":{"package":["lattice","xts","zoo"],"version":["0.20-35","0.10-2","1.8-1"],"priority":["recommended",null,null],"depends":[[],"zoo","stats"],"imports":[["grid","grDevices","graphics","stats","utils"],"methods",["utils","graphics","grDevices","lattice"]],"linkingto":[null,"zoo",null],"suggests":[["KernSmooth","MASS","latticeExtra"],["timeSeries","timeDate","tseries","chron","fts","tis","RUnit"],["coda","chron","DAAG","fts","ggplot2","mondate","scales","strucchange","timeDate","timeSeries","tis","tseries","xts"]],"enhances":["chron",null,null],"license":["GPL (>= 2)","GPL (>= 2)","GPL-2 | GPL-3"],"license_is_foss":[null,null,null],"license_restricts_use":[null,null,null],"os_type":[null,null,null],"archs":[null,null,null],"needscompilation":["yes","yes","yes"],"additional_repositories":[null,null,null],"author":["Deepayan Sarkar",["Jeffrey A Ryan","Joshua M Ulrich","Ross Bennett"],["Achim Zeileis","Gabor Grothendieck","Jeffrey A Ryan","Joshua M Ulrich","Felix Andrews"]],"authors@r":[null,"c(\n person(given=c(\"Jeffrey\",\"A.\"), family=\"Ryan\", role=c(\"aut\",\"cph\")),\n person(given=c(\"Joshua\",\"M.\"), family=\"Ulrich\", role=c(\"cre\",\"aut\"), email=\"josh.m.ulrich@gmail.com\"),\n person(given=\"Ross\", family=\"Bennett\", role=\"ctb\")\n )","c(person(given = \"Achim\", family = \"Zeileis\", role = c(\"aut\", \"cre\"), email = \"Achim.Zeileis@R-project.org\",\n comment = c(ORCID = \"0000-0003-0918-3766\")),\n person(given = \"Gabor\", family = \"Grothendieck\", role = \"aut\", email = \"ggrothendieck@gmail.com\"),\n person(given = c(\"Jeffrey\", \"A.\"), family = \"Ryan\", role = \"aut\", email = \"jeff.a.ryan@gmail.com\"),\n person(given = c(\"Joshua\", \"M.\"), family = \"Ulrich\", role = \"ctb\", email = \"josh.m.ulrich@gmail.com\"),\n person(given = \"Felix\", family = \"Andrews\", role = \"ctb\", email = \"felix@nfrac.org\"))"],"biarch":[null,null,null],"bugreports":["https://github.com/deepayan/lattice/issues","https://github.com/joshuaulrich/xts/issues",null],"buildkeepempty":[null,null,null],"buildmanual":[null,null,null],"buildresavedata":[null,null,null],"buildvignettes":[null,null,null],"built":[null,null,null],"bytecompile":[null,null,null],"classification/acm":[null,null,null],"classification/acm-2012":[null,null,null],"classification/jel":[null,null,null],"classification/msc":[null,null,null],"classification/msc-2010":[null,null,null],"collate":[null,null,null],"collate.unix":[null,null,null],"collate.windows":[null,null,null],"contact":[null,null,null],"copyright":[null,null,null],"date":["2017-03-23","2018-03-13","2018-01-08"],"description":["A powerful and elegant high-level data visualization\n system inspired by Trellis graphics, with an emphasis on\n multivariate data. Lattice is sufficient for typical graphics needs,\n and is also flexible enough to handle most nonstandard requirements.\n See ?Lattice for an introduction.","Provide for uniform handling of R's different time-based data classes by extending zoo, maximizing native format information preservation and allowing for user level customization and extension, while simplifying cross-class interoperability.","An S3 class with methods for totally ordered indexed\n observations. It is particularly aimed at irregular time series\n of numeric vectors/matrices and factors. zoo's key design goals\n are independence of a particular index/date/time class and\n consistency with ts and base R by providing methods to extend\n standard generics."],"encoding":[null,null,null],"keepsource":[null,null,null],"language":[null,null,null],"lazydata":["yes",null,null],"lazydatacompression":[null,null,null],"lazyload":["yes","yes",null],"mailinglist":[null,null,null],"maintainer":["Deepayan Sarkar ","Joshua M. Ulrich ","Achim Zeileis "],"note":[null,null,null],"packaged":["2017-03-25 11:25:40 UTC; deepayan","2018-03-14 16:45:50 UTC; josh","2018-01-08 09:22:51 UTC; zeileis"],"rdmacros":[null,null,null],"sysdatacompression":[null,null,null],"systemrequirements":[null,null,null],"title":["lattice<\/a> (0.20-35)
Generation: -2
Maintainer: Deepayan Sarkar
imports/imported by:5/219
depends/is dependency of:0/179
suggests/suggested by:3/201
enhances/enhaced by:1/0
","
xts<\/a> (0.10-2)
Generation: 0
Maintainer: Joshua M. Ulrich
imports/imported by:1/66
depends/is dependency of:1/23
suggests/suggested by:7/28
enhances/enhaced by:0/1
","
zoo<\/a> (1.8-1)
Generation: -1
Maintainer: Achim Zeileis
imports/imported by:4/170
depends/is dependency of:1/63
suggests/suggested by:13/48
enhances/enhaced by:0/3
\"xts\"
CRAN database version
Thu, 31 May 2018, 13:42","style":"font-family:Georgia, Times New Roman, Times, serif;font-size:15px"},"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)","highlight":{"enabled":true,"hoverNearest":false,"degree":1,"algorithm":"all","hideColor":"rgba(200,200,200,0.5)","labelOnly":true},"collapse":{"enabled":false,"fit":false,"resetHighlight":true,"clusterOptions":null},"legend":{"width":0.2,"useGroups":false,"position":"left","ncol":1,"stepX":100,"stepY":100,"zoom":true,"edges":{"label":["is imported by","is dependency of","is suggested by","enhances","is linked by"],"color":["#D33F6A","#D33F6A","#C7CEF5","#C7CEF5","#F9C2CB"],"dashes":[false,false,false,true,true],"arrows":["to","to","to","to","to"],"font.align":["top","top","top","top","top"]},"edgesToDataframe":true,"nodes":{"label":["Generation 0","Generation -1","Generation -2"],"color":["#D33F6A","#4A6FE3","#E0F9FF"],"font.align":["top","top","top"]},"nodesToDataframe":true},"tooltipStay":300,"tooltipStyle":"position: fixed;visibility:hidden;padding: 5px;white-space: nowrap;font-family: verdana;font-size:14px;font-color:#000000;background-color: #f5f4ed;-moz-border-radius: 3px;-webkit-border-radius: 3px;border-radius: 3px;border: 1px solid #808074;box-shadow: 3px 3px 10px rgba(0, 0, 0, 0.2);","export":{"type":"png","css":"float:right;","background":"#fff","name":"cranly_dependence_tree-31-May-2018-xts.png","label":"PNG snapshot"}},"evals":[],"jsHooks":[]}

As a final example, consider how the package_with() function might be used to search for Bayesian packages by searching for packages with “Bayes” or “MCMC” in the description. I don’t believe that this exhausts the possibilities of cranly, but it should be clear that the package is a very useful tool for looking into the mysteries of CRAN.

Bayesian_packages <- package_with(package_network, name = c("Bayes", "MCMC")) plot(package_network, package = Bayesian_packages, legend=FALSE)

_____='https://rviews.rstudio.com/2018/05/31/exploring-r-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Harry Potter and rankings with comperank

Thu, 05/31/2018 - 02:00

(This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

Ranking Harry Potter books with comperank package.

Prologue

Package comperank is on CRAN now. It offers consistent implementations of several ranking and rating methods. Originally, it was intended to be my first CRAN package when I started to build it 13 months ago. Back then I was very curious to learn about different ranking and rating methods that are used in sport. This led me to two conclusions:

• There is an amazing book “Who’s #1” by Langville and Meyer which describes several ideas in great detail.
• Although there are some CRAN packages dedicated specifically to ranking methods (for example, elo, mvglmmRank), I didn’t find them to be tidy enough.

These discoveries motivated me to write my first ever CRAN package. Things didn’t turn out the way I was planning, and now comperank is actually my fourth. After spending some time writing it I realized that most of the package will be about storing and manipulating competition results in consistent ways. That is how comperes was born.

After diverging into creating this site and writing ruler in pair with keyholder, a few months ago I returned to competition results and rankings. Gained experience helped me to improve functional API of both packages which eventually resulted into submitting them to CRAN.

Overview

This post, as one of the previous ones, has two goals:

We will cover the following topics:

• Short notes about functionality of comperank.
• Exploration ranking with ranking based on mean book score. No comperank package functionality is required.
• Rankings with fixed Head-to-Head structure. This will cover Massey and Colley ranking methods.
• Rankings with variable Head-to-Head structure. This will cover Keener, Markov and Offense-Defense ranking methods.
• Combined rankings in which average ranks will be computed using all described comperank methods.

Another very interesting set of ranking methods implemented in comperank are methods with iterative nature. However, their usage with mentioned Harry Potter Books Survey dataset is meaningless as temporal ordering of games (acts of book scoring by one person) should make sense, which it doesn’t.

The idea behind converting survey results into competition results is described in aforementioned post. We will need the following setup:

library(dplyr) library(purrr) library(rlang) # This will automatically load {comperes} library(comperank) # Create competition results from hp_survey hp_cr <- hp_survey %>% transmute( game = person, player = book, score = as.integer(gsub("[^0-9].*\$", "", score)) ) %>% as_longcr() Functionality of comperank

Rating is considered to be a list (in the ordinary sense) of numerical values, one for each player, or the numerical value itself. Its interpretation depends on rating method: either bigger value indicates better player performance or otherwise.

Ranking is considered to be a rank-ordered list (in the ordinary sense) of players: rank 1 indicates player with best performance.

comperank leverages the tidyverse ecosystem of R packages. Among other things, it means that the main output format is tibble.

There are three sets of functions:

• rate_*() (* stands for ranking method short name). Its output is a tibble with columns player (player identifier) and at least one rating_* (rating value). Names of rating columns depend on rating method.
• rank_*(). Its default output is similar to previous one, but with ranking_* instead of rating columns. It runs rate_*() and does ranking with correct direction. One can use option keep_rating = TRUE to keep rating columns in the output.
• add_*_ratings(). These functions are present only for algorithms with iterative nature and competition results with games only between two players. They return tibble with row corresponding to a game and extra columns indicating ratings of players before and after the game.
Exploration ranking

Previously we established that “Harry Potter and the Prisoner of Azkaban” seems to be “the best” book and “Harry Potter and the Chamber of Secrets” comes last. This was evaluated by mean score:

hp_rank_explore <- hp_cr %>% summarise_player(rating_explore = mean(score)) %>% # round_rank() is a function from {comperank} package for doing ranking mutate(ranking_explore = round_rank(rating_explore)) hp_rank_explore ## # A tibble: 7 x 3 ## player rating_explore ranking_explore ## ## 1 HP_1 3.91 5 ## 2 HP_2 3.55 7 ## 3 HP_3 4.19 1 ## 4 HP_4 4 3 ## 5 HP_5 3.90 6 ## 6 HP_6 4.13 2 ## 7 HP_7 3.96 4

As simple as it is, this approach might leave some available information unused. Survey originally was designed to obtain information not only about books performance as separate objects, but also to learn about possible pair relationships between them. Maybe some book is considered generally “not the best” but it “outperforms” some other “better” book. This was partially studied in “Harry Potter and competition results with comperes” by computing different Head-to-Head values and manually studying them.

Here we will attempt to summarise books performance based on their Head-to-Head relationships.

Rankings with fixed H2H structure

In comperank there are two methods which operate on fixed Head-to-Head structure: Massey and Colley. Both of them are designed for competitions where:

• Games are held only between two players.
• It is assumed that score is numeric and higher values indicate better player performance in a game.

Being very upset for moment, we realize that in dataset under study there are games with different number of players. Fortunately, comperes package comes to rescue: it has function to_pairgames() just for this situation. It takes competition results as input and returns completely another (strictly speaking) competition results where “crowded” games are split into small ones. More strictly, games with one player are removed and games with three and more players are converted to multiple games between all unordered pairs of players. The result is in wide format (as opposed to long one of hp_cr):

hp_cr_paired <- to_pairgames(hp_cr) # For example, second game was converted to a set of 10 games hp_cr %>% filter(game == 2) ## # A longcr object: ## # A tibble: 5 x 3 ## game player score ## ## 1 2 HP_1 3 ## 2 2 HP_4 5 ## 3 2 HP_5 2 ## 4 2 HP_6 4 ## 5 2 HP_7 5 hp_cr_paired %>% slice(2:11) ## # A widecr object: ## # A tibble: 10 x 5 ## game player1 score1 player2 score2 ## ## 1 2 HP_1 3 HP_4 5 ## 2 3 HP_1 3 HP_5 2 ## 3 4 HP_1 3 HP_6 4 ## 4 5 HP_1 3 HP_7 5 ## 5 6 HP_4 5 HP_5 2 ## 6 7 HP_4 5 HP_6 4 ## 7 8 HP_4 5 HP_7 5 ## 8 9 HP_5 2 HP_6 4 ## 9 10 HP_5 2 HP_7 5 ## 10 11 HP_6 4 HP_7 5 Massey method

Idea of Massey method is that difference in ratings should be proportional to score difference in direct confrontations. Bigger value indicates better player competition performance.

hp_cr_massey <- hp_cr_paired %>% rank_massey(keep_rating = TRUE) hp_cr_massey ## # A tibble: 7 x 3 ## player rating_massey ranking_massey ## ## 1 HP_1 -0.00870 5 ## 2 HP_2 -0.514 7 ## 3 HP_3 0.293 1 ## 4 HP_4 0.114 3 ## 5 HP_5 0.00195 4 ## 6 HP_6 0.124 2 ## 7 HP_7 -0.00948 6 Colley method

Idea of Colley method is that ratings should be proportional to share of player’s won games. Bigger value indicates better player performance.

hp_cr_colley <- hp_cr_paired %>% rank_colley(keep_rating = TRUE) hp_cr_colley ## # A tibble: 7 x 3 ## player rating_colley ranking_colley ## ## 1 HP_1 0.497 5 ## 2 HP_2 0.326 7 ## 3 HP_3 0.599 1 ## 4 HP_4 0.534 3 ## 5 HP_5 0.505 4 ## 6 HP_6 0.542 2 ## 7 HP_7 0.497 6

Both Massey and Colley give the same result differing from Exploration ranking in treating “HP_5” (“Order of the Phoenix”) and “HP_7” (“Deathly Hallows”) differently: “HP_5” moved up from 6-th to 4-th place.

Rankings with variable H2H structure

All algorithms with variable Head-to-Head structure depend on user supplying custom Head-to-Head expression for computing quality of direct confrontations between all pairs of players of interest.

There is much freedom in choosing Head-to-Head structure appropriate for ranking. For example, it can be “number of wins plus half the number of ties” (implemented in h2h_funs[["num_wins2"]] from comperes) or “mean score difference from direct matchups” (h2h_funs[["mean_score_diff"]]). In this post we will use the latter one. Corresponding Head-to-Head matrix looks like this:

hp_h2h <- hp_cr %>% h2h_mat(!!! h2h_funs[["mean_score_diff"]]) %>% round(digits = 2) # Value indicates mean score difference between "row-player" and # "column-player". Positive - "row-player" is better. hp_h2h ## # A matrix format of Head-to-Head values: ## HP_1 HP_2 HP_3 HP_4 HP_5 HP_6 HP_7 ## HP_1 0.00 0.50 -0.39 0.04 0.00 -0.14 -0.06 ## HP_2 -0.50 0.00 -0.77 -0.58 -0.72 -0.62 -0.45 ## HP_3 0.39 0.77 0.00 0.05 0.51 0.11 0.25 ## HP_4 -0.04 0.58 -0.05 0.00 -0.04 0.09 0.20 ## HP_5 0.00 0.72 -0.51 0.04 0.00 -0.17 -0.04 ## HP_6 0.14 0.62 -0.11 -0.09 0.17 0.00 0.15 ## HP_7 0.06 0.45 -0.25 -0.20 0.04 -0.15 0.00 Keener method

Keener method is based on the idea of “relative strength” – the strength of the player relative to the strength of the players he/she has played against. This is computed based on provided Head-to-Head values and some flexible algorithmic adjustments to make method more robust. Bigger value indicates better player performance.

hp_cr_keener <- hp_cr %>% rank_keener(!!! h2h_funs["mean_score_diff"], keep_rating = TRUE) hp_cr_keener ## # A tibble: 7 x 3 ## player rating_keener ranking_keener ## ## 1 HP_1 0.147 5 ## 2 HP_2 0.0816 7 ## 3 HP_3 0.191 1 ## 4 HP_4 0.150 4 ## 5 HP_5 0.153 3 ## 6 HP_6 0.155 2 ## 7 HP_7 0.122 6

Results for Keener method again raised “HP_5” one step up to third place.

Markov method

The main idea of Markov method is that players “vote” for other players’ performance. Voting is done with Head-to-Head values and the more value the more “votes” gives player2 (“column-player”) to player1 (“row-player”). For example, if Head-to-Head value is “number of wins” then player2 “votes” for player1 proportionally to number of times player1 won in a matchup with player2.

Actual “voting” is done in Markov chain fashion: Head-to-Head values are organized in stochastic matrix which vector of stationary probabilities is declared to be output ratings. Bigger value indicates better player performance.

hp_cr_markov <- hp_cr %>% rank_markov(!!! h2h_funs["mean_score_diff"], keep_rating = TRUE) hp_cr_markov ## # A tibble: 7 x 3 ## player rating_markov ranking_markov ## ## 1 HP_1 0.140 5 ## 2 HP_2 0.0500 7 ## 3 HP_3 0.196 1 ## 4 HP_4 0.168 2 ## 5 HP_5 0.135 6 ## 6 HP_6 0.167 3 ## 7 HP_7 0.143 4

We can see that Markov method put “HP_4” (“Goblet of Fire”) on second place. This is due to its reasonably good performance against the leader “HP_3” (“Prisoner of Azkaban”): mean score difference is only 0.05 in “HP_3” favour. Doing well against the leader in Markov method has a great impact on output ranking, which somewhat resonates with common sense.

Offense-Defense method

The idea of Offense-Defense (OD) method is to account for different abilities of players by combining different ratings:

• For player which can achieve high Head-to-Head value (even against the player with strong defense) it is said that he/she has strong offense which results into high offensive rating.
• For player which can force their opponents into achieving low Head-to-Head value (even if they have strong offense) it is said that he/she has strong defense which results into low defensive rating.

Offensive and defensive ratings describe different skills of players. In order to fully rate players, OD ratings are computed: offensive ratings divided by defensive. The more OD rating the better player performance.

hp_cr_od <- hp_cr %>% rank_od(!!! h2h_funs["mean_score_diff"], keep_rating = TRUE) print(hp_cr_od, width = Inf) ## # A tibble: 7 x 7 ## player rating_off rating_def rating_od ranking_off ranking_def ## ## 1 HP_1 5.42 1.03 5.29 5 5 ## 2 HP_2 1.45 1.88 0.771 7 7 ## 3 HP_3 7.91 0.522 15.1 1 1 ## 4 HP_4 6.51 0.869 7.49 3 3 ## 5 HP_5 5.30 0.888 5.97 6 4 ## 6 HP_6 6.59 0.809 8.14 2 2 ## 7 HP_7 5.54 1.05 5.29 4 6 ## ranking_od ## ## 1 5 ## 2 7 ## 3 1 ## 4 3 ## 5 4 ## 6 2 ## 7 6

All methods give almost equal results again differing only in ranks of “HP_5” and “HP_7”.

Combined rankings

To obtain averaged, and hopefully less “noisy”, rankings we will combine rankings produced with comperank by computing their mean.

list(hp_cr_massey, hp_cr_colley, hp_cr_keener, hp_cr_markov, hp_cr_od) %>% # Extract ranking column map(. %>% select(., player, starts_with("ranking"))) %>% # Join all ranking data in one tibble reduce(left_join, by = "player") %>% # Compute mean ranking transmute(player, ranking_combined = rowMeans(select(., -player))) %>% # Join exploration rankings for easy comparison left_join(y = hp_rank_explore %>% select(-rating_explore), by = "player") ## # A tibble: 7 x 3 ## player ranking_combined ranking_explore ## ## 1 HP_1 5 5 ## 2 HP_2 7 7 ## 3 HP_3 1 1 ## 4 HP_4 3 3 ## 5 HP_5 4.43 6 ## 6 HP_6 2.14 2 ## 7 HP_7 5.43 4

As we can see, although different ranking methods handle results differently for books with “middle performance”, combined rankings are only slightly different from exploration ones. Only notable difference is in switched rankings of “Order of the Phoenix” and “Deathly Hallows”.

Conclusion
• “Harry Potter and the Prisoner of Azkaban” still seems to be considered “best” among R users. And yet “Harry Potter and the Chamber of Secrets” still suffers the opposite fate.
• Using different ranking methods is a powerful tool in analyzing Head-to-Head performance. This can be done in very straightforward manner with new addition to CRAN – comperank package.

sessionInfo()

sessionInfo() ## R version 3.4.4 (2018-03-15) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 16.04.4 LTS ## ## Matrix products: default ## BLAS: /usr/lib/openblas-base/libblas.so.3 ## LAPACK: /usr/lib/libopenblasp-r0.2.18.so ## ## locale: ## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8 ## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8 ## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] bindrcpp_0.2.2 comperank_0.1.0 comperes_0.2.0 rlang_0.2.0 ## [5] purrr_0.2.4 dplyr_0.7.5 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.17 knitr_1.20 bindr_0.1.1 magrittr_1.5 ## [5] tidyselect_0.2.4 R6_2.2.2 stringr_1.3.1 tools_3.4.4 ## [9] xfun_0.1 utf8_1.1.3 cli_1.0.0 htmltools_0.3.6 ## [13] yaml_2.1.19 rprojroot_1.3-2 digest_0.6.15 assertthat_0.2.0 ## [17] tibble_1.4.2 crayon_1.3.4 bookdown_0.7 tidyr_0.8.1 ## [21] glue_1.2.0 evaluate_0.10.1 rmarkdown_1.9 blogdown_0.6 ## [25] stringi_1.2.2 compiler_3.4.4 pillar_1.2.2 backports_1.1.2 ## [29] pkgconfig_2.0.1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...