### Using clusterlab to benchmark clustering algorithms

(This article was first published on ** R – intobioinformatics**, and kindly contributed to R-bloggers)

Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below.

I was interested to see clusterExperiment, a relatively new clustering package on the Bioconductor, cannot detect the ground truth K in my testing so far. Instead yielding solutions with many more clusters than there are in reality. On the other hand, the package I developed with David Watson here at QMUL, does work rather well. It is a derivative of the Monti et al. (2003) consensus clustering algorithm, fitted with a Monte Carlo reference procedure to eliminate overfitting. We called this algorithm M3C.

library(clusterExperiment) library(clusterlab) library(M3C)**Experiment 1: Simulate a positive control dataset (K=5)**

**Experiment 2: Test RSEC (https://bioconductor.org/packages/release/bioc/html/clusterExperiment.html)
**

**Experiment 3: Test M3C (https://bioconductor.org/packages/release/bioc/html/M3C.html)
**

Interesting isn't it R readers.

Well all I can say is I recommend comparing different machine learning methods and if in doubt, run your own control data-sets (positive and negative controls) to test various methods. In the other post we showed a remarkable bias in optimism corrected bootstrapping compared with LOOCV under certain conditions, simply by simulating null data-sets and passing them into the method.

**If you are clustering omic’ data from a single platform (RNAseq, protein, methylation, etc) I have tested and recommend the following packages:**

CLEST: https://www.rdocumentation.org/packages/RSKC/versions/2.4.2/topics/Clest

M3C: https://bioconductor.org/packages/release/bioc/html/M3C.html

And to be honest, that is about it. I have also tested PINSplus, GAP-statistic, and SNF, but they did not provide satisfactory results in my experiments on single platform clustering (currently unpublished data). Multi-omic and single cell RNA-seq analysis is another story, there will be more on that to come in the future R readers.

Remember there is a darkness out there R readers, not just in Washington, but in the world of statistics. It is there because of the positive results bias in science, because of people not checking methods and comparing them with one another, and because of several other reasons I can’t even be bothered to mention.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – intobioinformatics**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Selecting ‘special’ photos on your phone

(This article was first published on ** R – Longhow Lam's Blog**, and kindly contributed to R-bloggers)

At the beginning of the new year I always want to clean up my photos on my phone. It just never happens.

So now (like so many others I think) I have a lot of photos on my phone from the last 3.5 years. The iPhone photos app helps you a bit to go through your photos. But which ones are really special and you definitely want to keep?

Well, just apply some machine learning.

- I run all my photos through a VGG16 deep learning model to generate high dimensional features per photo (on my laptop without GPU this takes about 15 minutes for 2000 photos).
- The dimension is 25.088, which is difficult to visualize. I apply a UMAP dimension reduction to bring it back to 3.
- In R you can create an interactive 3D plot with plotly where each point corresponds to a photo. Using crosstalk, you can link it to the corresponding image. The photo appears when you hover over the point.

Well a “special” outlying picture in the scatter plot are my two children with a dog during a holiday a few years ago. I had never found it that fast. There are some other notable things that I can see, but I won’t bother you with it here

Link GitHub repo with to two R scripts to cluster your own photos. Cheers, Longhow

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Longhow Lam's Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Mango Solutions contributes to technology partners RStudio conference

(This article was first published on ** RBlog – Mango Solutions**, and kindly contributed to R-bloggers)

As leading advanced analytics partner for RStudio, Mango Solutions are delighted to be contributing to the upcoming rstudio::conf programme with a workshop and a talk.

Two of Mango’s senior consultants, Aimée Gott, Education Practice Lead and Mark Sellors, Head of Data Engineering will be sharing their R expertise with delegates.

Aimée Gott will be delivering the Intermediate course on Shiny to the RStudio delegates. Used to training the world over, Mango has successfully delivered R and Python training courses, upskilling data science teams giving them the essential skills to enable delivery of data-driven insight. Mark Sellors, Mango’s Head of Data Engineering will be delivering a presentation on ‘R in Production’, as part of the conference programme.

Mark Sellors has years of expertise in Data Engineering and works closely with RStudio to install and configure RStudio’s commercial products. Mango’s relationship with RStudio forms part of their professional services partnership network in data science with R, easing the development environment and ensuring an effective adoption by Data Science teams.

The strategic partnership between Mango and RStudio is based on supporting the functionality and use of the R language and has been in existence for over 10 years. Matt Aldridge, Mango Solutions CEO said of the relationship, ‘it has been pivotal in proving the effectiveness of R as a corporate analytic tool and we are proud at Mango to work alongside their talented and committed team ‘.

Dave Hurst, Director of Business Development at RStudio said ‘We are pleased to have both Aimée and Mark contributing at rstudio::conf (2019). Mango Solutions is an RStudio Full Service Certified Partner, so we know first-hand the depth of expertise and practical experience that Mango offers to our mutual customers who are looking to successfully scale R with RStudio products in production.‘

Mark’s talk on R in production will focus on the increasing trend for data science models and other analytic code and in particular those associated with the application of R. Mark will look at some of the misinformation around the idea of what ‘putting something into production’ actually means as well as provide tips on overcoming the obstacles.

Shiny is an open-source R package that provides an elegant and powerful web framework for building web applications using R. The Shiny training course as part of the conference, has been designed by Shiny author Joe Cheng has been aimed at the experienced Shiny developer. The workshop will improve delegates’ understanding of shiny’s foundations and help them make the most of reactive programming techniques for extending and improving UI and debugging and tools for modularising applications. By the end of the two days, delegates will be able to push the envelope of what a Data Scientist and organisations and can do with Shiny.

Together, RStudio and Mango work with some of the world’s largest companies and are committed to supporting the growth and adoption of R for advanced analytics.

We hope to see you there!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** RBlog – Mango Solutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Neural Text Modelling with R package ruimtehol

(This article was first published on ** bnosac :: open analytical helpers - bnosac :: open analytical helpers**, and kindly contributed to R-bloggers)

Last week the R package ruimtehol was released on CRAN (https://github.com/bnosac/ruimtehol) allowing R users to easily build and apply neural embedding models on text data.

It wraps the ‘StarSpace’ library ">https://github.com/facebookresearch/StarSpace allowing users to calculate** word, sentence, article, document, webpage, link and entity ’embeddings’. By using the ’embeddings’, you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph ’embeddings’ as well as perform semi-supervised learning and multi-task learning on plain text**. The techniques are explained in detail in the paper: ‘StarSpace: Embed All The Things!’ by Wu et al. (2017), available at https://arxiv.org/abs/1709.03856.

You can get started with some common text analytical use cases by using the presentation we have built below. Enjoy!

{aridoc engine=”pdfjs” width=”100%” height=”550″}images/bnosac/blog/R_TextMining_Starspace.pdf{/aridoc}

If you like it, give it a star at https://github.com/bnosac/ruimtehol and if you need commercial support on text mining, get in touch.

**Upcoming training schedule**

Note also that you might be interested in the following courses held in Belgium

- 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
- 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
- 15/03/2019: Image Recognition with R and Python: Subscribe here
**01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here**

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To **leave a comment** for the author, please follow the link and comment on their blog: ** bnosac :: open analytical helpers - bnosac :: open analytical helpers**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Understanding the Magic of Neural Networks

(This article was first published on ** R-Bloggers – Learning Machines**, and kindly contributed to R-bloggers)

Everything “neural” is (again) the latest craze in machine learning and artificial intelligence. Now what is the magic here?

Let us dive directly into a (supposedly little silly) example: we have three protagonists in the fairy tail little red riding hood, the wolf, the grandmother and the woodcutter. They all have certain qualities and little red riding hood reacts in certain ways towards them. For example the grandmother has big eyes, is kindly and wrinkled – little red riding hood will approach her, kiss her on the cheek and offer her food (the behavior “flirt with” towards the woodcutter is a little sexist but we kept it to reproduce the original example from Jones, W. & Hoskins, J.: Back-Propagation, Byte, 1987). We will build and train a neural network which gets the qualities as inputs and little red riding wood’s behaviour as output, i.e. we train it to learn the adequate behaviour for each quality.

Have a look at the following code and its output including the resulting plot:

library(neuralnet) library(NeuralNetTools) # code qualities and actions qualities <- matrix (c(1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1), byrow = TRUE, nrow = 3) colnames(qualities) <- c("big_ears", "big_eyes", "big_teeth", "kindly", "wrinkled", "handsome") rownames(qualities) <- c("wolf", "grannie", "woodcutter") qualities ## big_ears big_eyes big_teeth kindly wrinkled handsome ## wolf 1 1 1 0 0 0 ## grannie 0 1 0 1 1 0 ## woodcutter 1 0 0 1 0 1 actions <- matrix (c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1), byrow = TRUE, nrow = 3) colnames(actions) <- c("run_away", "scream", "look_for_woodcutter", "kiss_on_cheek", "approach", "offer_food", "flirt_with") rownames(actions) <- rownames(qualities) actions ## run_away scream look_for_woodcutter kiss_on_cheek approach offer_food flirt_with ## wolf 1 1 1 0 0 0 0 ## grannie 0 0 0 1 1 1 0 ## woodcutter 0 0 0 1 0 1 1 data <- cbind(qualities, actions) # train the neural network (NN) set.seed(123) # for reproducibility neuralnetwork <- neuralnet(run_away + scream+look_for_woodcutter + kiss_on_cheek + approach + offer_food + flirt_with ~ big_ears + big_eyes + big_teeth + kindly + wrinkled + handsome, data = data, hidden = 3, exclude = c(1, 8, 15, 22, 26, 30, 34, 38, 42, 46), lifesign = "minimal", linear.output = FALSE) ## hidden: 3 thresh: 0.01 rep: 1/1 steps: 48 error: 0.01319 time: 0.01 secs # plot the NN par_bkp <- par(mar = c(0, 0, 0, 0)) # set different margin to minimize cutoff text plotnet(neuralnetwork, bias = FALSE) par(par_bkp) # predict actions round(compute(neuralnetwork, qualities)$net.result) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## wolf 1 1 1 0 0 0 0 ## grannie 0 0 0 1 1 1 0 ## woodcutter 0 0 0 1 0 1 1First the qualities and the actions are coded as binary variables in a data frame. After that the neural network is being trained with the qualities as input and the resulting behaviour as output (using the standard formula syntax). In the neuralnet function a few additional technical arguments are set which details won’t concern us here, they just simplify the process in this context). Then we plot the learned net and test it by providing it with the respective qualities: in all three cases it predicts the right actions. How did it learn those?

Let us look at the plot of the net. We see that there are two basic building blocks: neurons and weighted connections between them. We have one neuron for each quality and one neuron for each action. Between both layers we have a so called hidden layer with three neurons in this case. The learned strength between the neurons is shown by the thickness of the lines (whereby ‘black’ means positive and ‘grey’ negative weights). Please have a thorough look at those weights.

You might have noticed that although the net didn’t know anything about the three protagonists in our little story it nevertheless correctly built a representation of them: ‘H1’ (for Hidden 1) represents the wolf because its differentiating quality is ‘big teeth’ which leads to ‘run away’, ‘scream’ and ‘look for woodcutter’, by the same logic ‘H2’ is the woodcutter and ‘H3’ is the grandmother. So the net literally learned to connect the qualities with respective actions of little red riding hood by creating a representation of the three protagonists!

So an artificial neural network is obviously a network of neurons… so let us have a look at those neurons! Basically they are mathematical abstractions of real neurons in your brain. They consist of inputs and an output. The biologically inspired idea is that when the activation of the inputs surpasses a certain threshold the neuron fires. To be able to learn the neuron must, before summing up the inputs, adjust the inputs so that the output is not just arbitrary but matches some sensible result. What is ‘sensible’ you might ask. In a biological environment the answer is not always so clear cut but in our simple example here the neuron has just to match the output we provide it with (= supervised learning).

The following abstraction has all we need, inputs, weights, the sum function, a threshold after that and finally the output of the neuron:

Simple artificial neuronLet us talk a little bit about what is going on here intuitively. First every input is taken, multiplied by its weight and all of this is summed up. Some of you might recognize this mathematical operation as a scalar product (also called dot product). Another mathematical definition of a scalar product is the following:

That is we multiply the length of two vectors by the cosine of the angle of those two vectors. What has cosine to do with it? The cosine of an angle becomes one when both vectors point into the same direction, it becomes zero when they are orthogonal and minus one when both point into opposite directions. Does this make sense? Well, I give you a litte (albeit crude) parable. When growing up there are basically three stages: first you are totally dependent on your parents, then comes puberty and you are against whatever they say or think and after some years you are truly independent (some never reach that stage…). What does Independent mean here? It means that you agree with some of the things your parents say and think and you disagree with some other things. During puberty you are as dependent on your parents as during being a toddler – you just don’t realize that but in reality you, so to speak, only multiply everything your parents say or think times minus one!

What is the connection with cosine? Well, as a toddler both you and your parents tend to be aligned which gives one, during puberty both of you are aligned but in opposing directions which gives minus one and only as a grown up you are both independent which mathematically means that your vector in a way points in both directions at the same time which is only possible when it is orthogonal on the vector of your parents (you entered a new dimension, literally) – and that gives zero for the cosine.

So cosine is nothing but a measure of dependence – as is correlation by the way. So this setup ensures that the neuron learns the dependence (or correlation) structure between the inputs and the output! The step function is just a way to help it to decide on which side of the fence it wants to sit, to make the decision clearer whether to fire or not. To sum it up, an artificial neuron is a non-linear function (in this case a step function) on a scalar product of the inputs (fixed) and the weights (adaptable to be able to learn). By adapting the weights the neuron learns the dependence structure between inputs and output.

In R you code this idea of an artificial neuron as follows:

neuron <- function(input) ifelse(weights %*% input > 0, 1, 0)Now let us use this idea in R by training an artificial neuron to classify points in a plane. Have a look at the following table:

Input 1 Input 2 Output 1 0 0 0 0 1 1 1 0 0 1 1If you plot those points with the colour coded pattern you get the following picture:

The task for the neuron is to find a separating line and thereby classify the two groups. Have a look at the following code:

# inspired by Kubat: An Introduction to Machine Learning, p. 72 plot_line <- function(w, col = "blue", add = TRUE) curve(-w[1] / w[2] * x - w[3] / w[2], xlim = c(-0.5, 1.5), ylim = c(-0.5, 1.5), col = col, lwd = 3, xlab = "Input 1", ylab = "Input 2", add = add) neuron <- function(input) as.vector(ifelse(input %*% weights > 0, 1, 0)) # step function on scalar product of weights and input eta <- 0.5 # learning rate # examples input <- matrix(c(1, 0, 0, 0, 1, 1, 0, 1), ncol = 2, byrow = TRUE) input <- cbind(input, 1) # bias for intercept of line output <- c(0, 1, 0, 1) weights <- c(0.25, 0.2, 0.35) # random initial weights plot_line(weights, add = FALSE); grid() points(input[ , 1:2], pch = 16, col = (output + 2)) # training of weights of neuron for (example in 1:length(output)) { weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ] plot_line(weights) } plot_line(weights, col = "black") # test: applying neuron on input apply(input, 1, neuron) ## [1] 0 1 0 1As you can see the result matches the desired output, graphically the black line is the end result and as you can see it separates the green from the red points: the neuron has learned this simple classification task. The blue lines are where the neuron starts from and where it is during training – they are not able to classify the points correctly.

The training, i.e. adapting the weights, takes places in this line:

weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ]The idea is to compare the current output of the neuron with the wanted output, scale that by some learning factor (eta) and modify the weights accordingly. So if the output is too big make the weights smaller and vice versa. Do this for all examples (sometimes you need another loop to train the neuron with the examples several times) and that’s it. That is the core idea behind the ongoing revolution of neural networks!

Ok, so far we had a closer look at one part of neural networks, namely the neurons, let us now turn to the network structure (also called network topology). First, why do we need a whole network anyway when the neurons are already able to solve classification tasks? The answer is that they can do that only for very simple problems. For example the neuron above can only distinguish between linearly separable points, i.e. it can only draw lines. It fails in case of the simple problem of four points that are coloured green, red, red, green from top left to bottom right (try it yourself). We would need a non-linear function to separate the points. We have to combine several neurons to solve more complicated problems.

The biggest problem you have to overcome when you combine several neurons is how to adapt all the weights. You need a system how to attribute the error at the output layer to all the weights in the net. This had been a profound obstacle until an algorithm called backpropagation (also abbreviated backprop) was invented (or found). We won’t get into the details here but the general idea is to work backwards from the output layers through all of the hidden layers till one reaches the input layer and modify the weights according to their respective contribution to the resulting error. This is done several (sometimes millions of times) for all training examples until one achieves an acceptable error rate for the training data.

The result is that you get several layers of abstraction, so when you e.g. want to train a neural network to recognize certain faces you start with the raw input data in the form of pixels, these are automatically combined into abstract geometrical structures, after that the net detects certain elements of faces, like eyes and noses, and finally abstractions of certain faces are being rebuilt by the net. See the following picture (from nivdul.wordpress.com) for an illustration:

So far we have only coded very small examples of a neural networks. Real-world examples often have dozens of layers with thousands of neurons so that much more complicated patterns can be learned. The more layers there are the ‘deeper’ a net becomes… which is the reason why the current revolution in this field is called “deep learning” because there are so many hidden layers involved. Let us now look at a more realistic example: predicting whether a breast cell is malignant or benign.

Have a look at the following code:

library(OneR) data(breastcancer) data <- breastcancer colnames(data) <- make.names(colnames(data)) data$Class <- as.integer(as.numeric(data$Class) - 1) # for compatibility with neuralnet data <- na.omit(data) # Divide training (80%) and test set (20%) set.seed(12) # for reproducibility random <- sample(1:nrow(data), 0.8 * nrow(data)) data_train <- data[random, ] data_test <- data[-random, ] # Train NN on training set f <- reformulate(setdiff(colnames(data), "Class"), response = "Class") # for compatibility with neuralnet (instead of Class ~.) model_train <- neuralnet(f, data = data_train, hidden = c(9, 9), lifesign = "minimal") ## hidden: 9, 9 thresh: 0.01 rep: 1/1 steps: 3784 error: 0.00524 time: 3.13 secs # Plot net plot(model_train, rep = "best") # Use trained model to predict test set prediction <- round(compute(model_train, data_test[ , -10])$net.result) eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 0 1 Sum ## 0 93 2 95 ## 1 4 38 42 ## Sum 97 40 137 ## ## Confusion matrix (relative): ## Actual ## Prediction 0 1 Sum ## 0 0.68 0.01 0.69 ## 1 0.03 0.28 0.31 ## Sum 0.71 0.29 1.00 ## ## Accuracy: ## 0.9562 (131/137) ## ## Error rate: ## 0.0438 (6/137) ## ## Error rate reduction (vs. base rate): ## 0.85 (p-value = 0.0000000000001297637)So you see that a relatively simple net achieves an accuracy of about 95% out of sample. The code itself should be mostly self-explanatory. For the actual training the neuralnet function from the package with the same name is being used. The input method is the R formula interface with the twist that the normally used shortcut ~. (for all variables except the given dependent one) isn’t supported and a workaround has to used (which makes part of the code a little clumsy). Also, a little bit unconventional is the name of the predict function, it is called compute. And you have to make sure that the input only consists of the variables you used for building the model, otherwise you will get an error.

When you look at the net one thing might strike you as odd: there are three neurons at the top with a fixed value of 1. These are so called bias neurons and they serve a similar purpose as the intercept in a linear regression: they kind of shift the model as a whole in n-dimensional feature space just as a regression line is being shifted by the intercept. In case you were attentive we also smuggled in a bias neuron in the above example of a single neuron: it is the last column of the input matrix which contains only ones.

Another thing: as can even be seen in this simple example it is very hard to find out what a neural network has actually learned – the following well-known anecdote (urban legend?) shall serve as a warning: some time ago the military built a system which had the aim to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and lorries on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!

Just for comparison the same example with the OneR package:

data(breastcancer) data <- breastcancer # Divide training (80%) and test set (20%) set.seed(12) # for reproducibility random <- sample(1:nrow(data), 0.8 * nrow(data)) data_train <- optbin(data[random, ], method = "infogain") ## Warning in optbin.data.frame(data[random, ], method = "infogain"): 12 ## instance(s) removed due to missing values data_test <- data[-random, ] # Train OneR model on training set model_train <- OneR(data_train, verbose = TRUE) ## ## Attribute Accuracy ## 1 * Uniformity of Cell Size 92.32% ## 2 Uniformity of Cell Shape 91.59% ## 3 Bare Nuclei 90.68% ## 4 Bland Chromatin 90.31% ## 5 Normal Nucleoli 90.13% ## 6 Single Epithelial Cell Size 89.4% ## 7 Marginal Adhesion 85.92% ## 8 Clump Thickness 84.28% ## 9 Mitoses 78.24% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' # Show model and diagnostics summary(model_train) ## ## Call: ## OneR.data.frame(x = data_train, verbose = TRUE) ## ## Rules: ## If Uniformity of Cell Size = (0.991,2] then Class = benign ## If Uniformity of Cell Size = (2,10] then Class = malignant ## ## Accuracy: ## 505 of 547 instances classified correctly (92.32%) ## ## Contingency table: ## Uniformity of Cell Size ## Class (0.991,2] (2,10] Sum ## benign * 318 30 348 ## malignant 12 * 187 199 ## Sum 330 217 547 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 381.78243, df = 1, p-value < 0.00000000000000022204 # Plot model diagnostics plot(model_train) # Use trained model to predict test set prediction <- predict(model_train, data_test) # Evaluate model performance on test set eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction benign malignant Sum ## benign 92 0 92 ## malignant 8 40 48 ## Sum 100 40 140 ## ## Confusion matrix (relative): ## Actual ## Prediction benign malignant Sum ## benign 0.66 0.00 0.66 ## malignant 0.06 0.29 0.34 ## Sum 0.71 0.29 1.00 ## ## Accuracy: ## 0.9429 (132/140) ## ## Error rate: ## 0.0571 (8/140) ## ## Error rate reduction (vs. base rate): ## 0.8 (p-value = 0.000000000007992571)As you can see the accuracy is only slightly worse but you have full interpretability of the model… and you would only need to measure one value (“Uniformity of Cell Size”) instead of 9 to get a prediction!

On the other hand making neural networks interpretable is one of the big research challenges at the moment.

To end this rather long post: there is a real revolution going on at the moment with all kinds of powerful neural networks. Especially promising is a combination of reinforcement learning (the topic of an upcoming post) and neural networks, where the reinforcement learning algorithm uses a neural network as its memory. For example the revolutionary AlphaGo Zero is built this way: it just received the rules of Go, one of the most demanding strategy games humanity has ever invented, and grew superhuman strength after just three days! The highest human rank in Go has an ELO value of 2940 – AlphaGo Zero achieves 5185! Even the best players don’t stand a chance against this monster of a machine. The neural network technology that is used for AlphaGo Zero and many other deep neural networks is called Tensorflow, which can also easily be integrated into the R environment. To find out more go here: https://tensorflow.rstudio.com/

In this whole area there are many mind-blowing projects underway, so stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R-Bloggers – Learning Machines**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Scaling H2O analytics with AWS and p(f)urrr (Part 2)

(This article was first published on ** Digital Age Economist on Digital Age Economist**, and kindly contributed to R-bloggers)

This is the second installment in a three part series on integrating H2O, AWS and p(f)urrr. In Part II, I will showcase how we can combine purrr and h2o to train and stack ML models. In the first post we looked at starting up an AMI on AWS which acts as the infrastructure upon which we will model.

Part one of the post can be found here

The packages we will use in this post:

- library(tidyverse)
- library(h2oEnsemble)
- library(caret)
- library(mlbench)

In this post I will be using the r3.4xlarge machine we started up in Part I of this series. The code below starts up a H2O instance that uses all available cores and limits the instance’s memory to 100GB.

h2o.init(nthreads = -1, max_mem_size = '100g')

We can see that once the server is up, we have 88.89 GB RAM and 16 Cores available. Also take note of the very handy logging file: /tmp/RtmpKLRj1u/h2o_rstudio_started_from_r.out.

We will use the Higgs data to experiment on:

Using simulated data with features characterizing events detected by ATLAS, your task is to classify events into “tau decay of a Higgs boson” versus “background.”

As per all machine learning application, we need to split our data. We will use the h2o.splitFrame function to split the data into training, validation and test sets. What is very important to notice in the function is the destination_frame parameter as the datasets are being transformed to .hex format which is the required format for the H2O server.

train <- read_csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv") test <- read_csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") higgs <- rbind(train, test) %>% mutate(response = as.factor(response)) set.seed(107) splits <- h2o.splitFrame( data = as.h2o(higgs, destination_frame = "higgs.hex"), ratios = c(0.6, 0.2), ## only need to specify 2 fractions, the 3rd is implied destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1234 ) training <- splits[[1]] valid <- splits[[2]] testing <- splits[[3]]The split frames are now of the correct class H2OFrame. Next we need to specify the outcome and predictor columns:

Y <- 'response' X <- setdiff(names(training), Y)To test the basic functionality of H2O, lets train a gbm to predict signal against background noise for the Higgs dataset.

higgs.gbm <- h2o.gbm(y = Y, x = X, training_frame = training, ntrees = 5, max_depth = 3, min_rows = 2, learn_rate = 0.2, distribution = "AUTO") higgs.gbm

This very basic specification of the model, tuning = {ntrees = 5, max_depth = 3, learning_rate = 0.2}, gives us an AUC of ~77%. Lets see how the model does on the validation set:

h2o.performance(higgs.gbm, valid)

From the validation set we can see that the models don’t perform too differently. It correctly predicts signal vs background with about 35.5% prediction error and AUC of 73%. For the rest of this post we will not be concerning ourselves with models performance, but rather focus on how we can build nice pipelines using H2O and purrr.

Having a basic understanding of the H2O api, lets ramp it up to train multiple models using purrr. The first step is to create a function that contains all the H2O models we want to use in our modeling design. Our wrapper should contain basic parameters: training_frame, validation_frame, nfolds and Y/X variable names.

A very important feature in our new wrapper you should notice is fold_assignment = "Modulo" and keep_cross_validation_predictions = TRUE parameters. We need these as we want to stack the models at a later stage to check if it improves performance1.

h2o.models <- function(df, validation, nfolds, model, Y, X){ if(model == "gbm"){ return(h2o.gbm(y = Y, x = X, training_frame = df, validation_frame = validation, distribution = "bernoulli", nfolds = nfolds, fold_assignment = "Modulo", keep_cross_validation_predictions = TRUE)) } if(model == "rf"){ return(h2o.randomForest(y = Y, x = X, training_frame = df, validation_frame = validation, distribution = "bernoulli", nfolds = nfolds, fold_assignment = "Modulo", keep_cross_validation_predictions = TRUE)) } if(model == "xgboost"){ return(h2o.xgboost(y = Y, x = X, training_frame = df, validation_frame = validation, distribution = "bernoulli", nfolds = nfolds, fold_assignment = "Modulo", keep_cross_validation_predictions = TRUE)) } if(model == "deeplearning"){ return(h2o.deeplearning(y = Y, x = X, training_frame = df, validation_frame = validation, distribution = "bernoulli", nfolds = nfolds, fold_assignment = "Modulo", keep_cross_validation_predictions = TRUE)) } }This gives a nice wrapper for use with purrr’s map function. Lets create a tibble that has the necessary layout so we can apply our new function across the dataset. You do not necessarily need to add the training, validation and testing frames into the tibble, but it helps in the long-run, especially if they start differing somehow2. It also helps for auditing the models:

trained_ML <- tibble(Models = c("gbm", "rf", "deeplearning", "xgboost")) %>% mutate( training = rep(list(training), nrow(.)), valid = rep(list(valid), nrow(.)) ) trained_ML

Here we see that we have a special column, `Models’, specifically naming the algorithm assigned to a training and test set.

Lets now use the h2o.models function by training all the different models and evaluating their performance.

trained_ML <- tibble(Models = c("gbm", "rf", "deeplearning", "xgboost")) %>% mutate( training = rep(list(training), nrow(.)), valid = rep(list(valid), nrow(.)) ) %>% # pmap to apply the h2o functional across each row mutate(trained_models = pmap(list(training, valid, Models), ~h2o.models(..1, ..2, ..3, nfolds = 5, Y, X))) %>% # once trained, predict against the test set with respective model gbm etc mutate(predictions = map(trained_models, h2o.predict, testing)) %>% # create new column containing in-sample confusionMatrix mutate(confusion = map(trained_models, h2o.confusionMatrix)) %>% # create new column containing test set performance metrics mutate(performance = map2(trained_models, valid, ~h2o.performance(.x, .y))) trained_ML

As a last measure, we might not want to have the performance metrics nested in the tibble frame. So to lift the measures out of the performance list, we can build our own function to extract the performance metrics we might need in evaluating each of the model’s performance:

f_perf <- function(performance){ tibble(errors = tail(performance@metrics$cm$table$Error, 1), logloss = performance@metrics$logloss, rmse = performance@metrics$RMSE, r_sqaured = performance@metrics$r2, pr_auc = performance@metrics$pr_auc, gini = performance@metrics$Gini) } trained_ML <- trained_ML %>% bind_cols(map_df(.$performance, f_perf))

This new tibble gives us a very good idea on which model performs the best, if we base it off the logloss measure of fit: gbm. From here we could build a wrapper around different gbm models and use the same framework to train multiple different types of gbm specified models if we wanted to do so.

The last bit of performance improvement we attempt in any model process is stacking. Model stacking is the idea that using the predictions of the base learners (in our case the rf, gbm etc), and then using a super learner or meta learner that takes as input, the base learners predictions/CV results and runs a model across those results as the final prediction. There are some interesting papers which talk to this idea, one by Leo Breiman3 himself or another one by van der Laan, Polley, and Hubbard 2007 that statistically proves why these *Super Learners* work.

As my metalearner in this example I will use a glm:

models <- trained_ML %>% pull(trained_models) metalearner <- c("h2o.glm.wrapper") stack <- h2o.stack(models = models, response_frame = training[,Y], metalearner = metalearner, seed = 1, keep_levelone_data = TRUE) # Compute test set performance: perf <- h2o.ensemble_performance(stack, newdata = testing)$ensemble stack_df <- tibble(Models = "stack", trained_models = list(stack), performance = list(perf)) %>% bind_cols(map_df(.$performance, f_perf)) bind_rows(trained_ML,stack_df)

Our stack of the models do not seem to improve the logloss performance measure in this specific case. That being said, we haven’t spent a lot of time trying to tune any of the models.

The last bit of code we need to finish this pipeline is to shutdown the H2O server:

h2o.ls() h2o.removeAll() h2o.shutdown() ConclusionAs you can see, combining the Many-models idea with the engineering of H2O and AWS, gives us a very nice framework with which we can run a crazy amount of experiments at a low cost and within a short time period. In writing this blog, I spent a total of 2 hours and 40 minutes running the models multiple times, getting the stack to work and overall just reading some H2O documentation. While all this was happening our r3.4xlarge – 16 core, 100GB RAM machine was running, resulting in a total cost for this blog at around $0.42, give or take. Which I think isn’t too bad for the resources that we used.

In the last and final part of the series, we will explore how to initialize a H2O AWS server using furrr and a basic script. This means that you can essentially write the code on your home setup, use boto3 from python to spin up an EC2 instance, conduct the analysis and only return to your local environment the most important results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Digital Age Economist on Digital Age Economist**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5

(This article was first published on ** R – Giga thoughts …**, and kindly contributed to R-bloggers)

This is the next set of presentations on “Elements of Neural Networks and Deep Learning”. In the 4th presentation I discuss and derive the generalized equations for a multi-unit, multi-layer Deep Learning network. The 5th presentation derives the equations for a Deep Learning network when performing multi-class classification along with the derivations for cross-entropy loss. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave‘

1. **Elements of Neural Network and Deep Learning – Part 4**

This presentation is a continuation of my 3rd presentation in which I derived the equations for a simple 3 layer Neural Network with 1 hidden layer. In this video presentation, I discuss step-by-step the derivations for a L-Layer, multi-unit Deep Learning Network, with any activation function g(z)

The implementations of L-Layer, multi-unit Deep Learning Network in vectorized R, Python and Octave are available in my post Deep Learning from first principles in Python, R and Octave – Part 3

2.** Elements of Neural Network and Deep Learning – Part 5**

This presentation discusses multi-class classification using the Softmax function. The detailed derivation for the Jacobian of the Softmax is discussed, and subsequently the derivative of cross-entropy loss is also discussed in detail. Finally the final set of equations for a Neural Network with multi-class classification is derived.

The corresponding implementations in vectorized R, Python and Octave are available in the following posts

a. Deep Learning from first principles in Python, R and Octave – Part 4

b. Deep Learning from first principles in Python, R and Octave – Part 5

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

Also see

1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

2. Big Data-2: Move into the big league:Graduate from R to SparkR

3. Introducing QCSimulator: A 5-qubit quantum computing simulator in R

4. My TEDx talk on the “Internet of Things

5. Rock N’ Roll with Bluemix, Cloudant & NodeExpress

6. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables

7. Literacy in India – A deepR dive

8. Fun simulation of a Chain in Android

To see all posts click Index of Posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Giga thoughts …**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration

(This article was first published on ** R – rud.is**, and kindly contributed to R-bloggers)

The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days).

The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with the pure-R stevedore package which will make it loads more compatible across the landscape of R installs as it removes a somewhat heavy dependency on a working Python environment (something quite challenging to consistently achieve in that fragmented language ecosystem).

Another addition is a set of new user agents for Android, Kindle, Apple TV & Chromecast as an increasing number of sites are changing what type of HTML (et. al.) they send to those and other alternative glowing rectangles. A more efficient/sane user agent system will also be introduced prior to the CRAN. Now’s the time to vote on existing issues or file new ones if there is a burning desire for new or modified functionality.

Since the Travis tests now work (they were failing miserably because of they Python dependency) I’ve integrated the changes from the 0.6.0 to the master branch but you can follow the machinations of the 0.6.0 branch up until CRAN release.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – rud.is**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Tip: Use Inline Operators For Legibility

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

R Tip: use inline operators for legibility.

A Python feature I miss when working in R is the convenience of Python‘s inline + operator. In Python, + does the right thing for some built in data types:

- It concatenates lists: [1,2] + [3] is [1, 2, 3].
- It concatenates strings: 'a' + 'b' is 'ab'.

And, of course, it adds numbers: 1 + 2 is 3.

The inline notation is very convenient and legible. In this note we will show how to use a related notation R.

To be clear: when working in a language it is important to learn to write code that is idiomatic for that language. Otherwise you are fighting the language, and writing code that may be hard for other users to work with (as it won’t match the learnable expectations of the language). The Python community has formalized this concept as “Pythonic”, which means Python Enhancement Proposal (PEP) 8‘s style recommendations plus a number of community conventions. The R situation is less formal, but “R-like” can include some important concepts such as: writing in a functional style, working vectorized, and a number of other concepts.

My note on Timing the Same Algorithm in R, Python, and C++ was a deliberate example of “writing C/C++ style code” in C++ (where that makes sense) plus R and Python (where that can be awkward). In fact I left the semi-colons in the C-style (scalar oriented) to R transliteration to emphasize how alien to R this code is (and later removed them in the more “R-like” vectorized translation).

However, if a good idea from one language works well in another language, then there is a good argument for implementing an analogue. The is no strong reason to leave one language less convenient than another.

For example: in Python range(a, b) returns an iterator that enumerates the integers from a through b-1 if b > a, and the empty iterator otherwise. This is the exact right iterator in a zero-indexed language (such as Python) for driving for-loops and list-comprehensions. R doesn’t have an operator so closely adapted to its indexing needs (the closest being seq_len() and seq_along()). So R is missing a bit of the convenience of this Python feature. However it is easy to add an R-version of such a feature, and this is found in wrapr::seqi(). Note wrapr::seqi() is not a direct translation of Python’s range(); wrapr::seqi(a, b) generates the range of integers a through b *inclusive* (if b >= a), as *this* is the convenient interval notation for a one-indexed language (such as R).

Now back to Python’s + features.

The wrapr package (available from CRAN) supplies some nice related inline operators including:

- %c%: c(1,2) %c% 3 is 1, 2, 3 (named after R’s c() function).
- %p%: "a" %p% "b" is "ab" (named after R’s paste0() function).

The above code is assuming you have the wrapr package attached via already having run library('wrapr').

Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t to make R more like Python, but to adapt a good idea from Python to improve R.

The general purpose of wrapr package is to provide extensions to make working in R incrementally more convenient while preserving an “R-like” style. It *might* not seem worth it to bring in a whole package for one our two such features. However, wrapr is a very lightweight low-dependency package. And wrapr includes *many* useful extensions- all documented with examples (and many of which are covered in earlier tips).

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### ggeffects 0.8.0 now on CRAN: marginal effects for regression models #rstats

(This article was first published on ** R – Strenge Jacke!**, and kindly contributed to R-bloggers)

I’m happy to announce that version 0.8.0 of my **ggeffects**-package is on CRAN now. The update has fixed some bugs from the previous version and comes along with many new features or improvements. One major part that was addressed in the latest version are fixed and improvements for mixed models, especially zero-inflated mixed models (fitted with the **glmmTMB**-package).

In this post, I want to demonstrate the different options to calculate and visualize marginal effects from mixed models.

Marginal effects for mixed effects modelsBasically, the type of predictions, i.e. whether to account for the uncertainty of random effects or not, can be set with the type-argument.

Marginal effects conditioned on fixed effectsThe default, type = "fe", means that predictions are on the *population-level* and do not account for the random effect variances.

When type = "re", the predicted values *are still on the population-level*. However, the random effect variances are taken into account, meaning that the prediction interval becomes larger. More technically speaking, type = "re" accounts for the uncertainty of the fixed effects *conditional on the estimates* of the random-effect variances and conditional modes (BLUPs).

The random-effect variance is the mean random-effect variance. Calculation is based on the proposal from *Johnson et al. 2014*, which is applicable for mixed models with more complex random effects structures.

As can be seen, compared to the previous example with type = "fe", predicted values are identical (both on the population-level). However, standard errors, and thus the resulting confidence (or prediction) intervals are much larger .

pr <- ggpredict(m, "Days", type = "re") pr #> #> # Predicted values of Reaction #> # x = Days #> #> x predicted std.error conf.low conf.high #> 0 251.405 41.769 169.539 333.271 #> 1 261.872 41.763 180.019 343.726 #> 2 272.340 41.814 190.386 354.293 #> 3 282.807 41.922 200.642 364.972 #> 5 303.742 42.307 220.822 386.661 #> 6 314.209 42.582 230.749 397.669 #> 7 324.676 42.912 240.571 408.781 #> 9 345.611 43.727 259.907 431.315 #> #> Adjusted for: #> * Subject = 0 (population-level) plot(pr)The reason why both type = "fe" and type = "re" return predictions at population-level is because ggpredict() returns predicted values of the response *at specific levels* of given model predictors, which are defined in the data frame that is passed to the newdata-argument (of predict()). The data frame requires data from all model terms, including random effect terms. This again requires to choose certain levels or values also for each random effect term, or to set those terms to zero or NA (for population-level). Since there is no general rule, which level(s) of random effect terms to choose in order to represent the random effects structure in the data, using the population-level seems the most clear and consistent approach.

To get predicted values for a specific level of the random effect term, simply define this level in the condition-argument.

ggpredict(m, "Days", type = "re", condition = c(Subject = 330)) #> #> # Predicted values of Reaction #> # x = Days #> #> x predicted std.error conf.low conf.high #> 0 275.096 41.769 193.230 356.961 #> 1 280.749 41.763 198.895 362.602 #> 2 286.402 41.814 204.448 368.355 #> 3 292.054 41.922 209.889 374.220 #> 5 303.360 42.307 220.440 386.280 #> 6 309.013 42.582 225.554 392.473 #> 7 314.666 42.912 230.561 398.772 #> 9 325.972 43.727 240.268 411.676 Marginal effects based on simulationsFinally, it is possible to obtain predicted values by simulating from the model, where predictions are based on simulate().

pr <- ggpredict(m, "Days", type = "sim") pr #> #> # Predicted values of Reaction #> # x = Days #> #> x predicted conf.low conf.high #> 0 251.440 200.838 301.996 #> 1 261.860 212.637 311.678 #> 2 272.157 221.595 321.667 #> 3 282.800 233.416 332.738 #> 5 303.770 252.720 353.472 #> 6 314.146 264.651 363.752 #> 7 324.606 273.460 374.462 #> 9 345.319 295.069 394.735 #> #> Adjusted for: #> * Subject = 0 (population-level) plot(pr) Marginal effects for zero-inflated mixed modelsFor zero-inflated mixed effects models, typically fitted with the **glmmTMB**-package, predicted values can be conditioned on

- the fixed effects of the conditional model only (type = "fe")
- the fixed effects and zero-inflation component (type = "fe.zi")
- the fixed effects of the conditional model only (population-level), taking the random-effect variances into account (type = "re")
- the fixed effects and zero-inflation component (population-level), taking the random-effect variances into account (type = "re.zi")
- all model parameters (type = "sim")

Similar to mixed models without zero-inflation component, type = "fe" and type = "re" for glmmTMB-models (with zero-inflation) both return predictions on the population-level, where the latter option accounts for the uncertainty of the random effects. In short, predict(..., type = "link") is called (however, predictions are finally back-transformed to the response scale).

pr <- ggpredict(m, "spp") pr #> #> # Predicted counts of count #> # x = spp #> #> x predicted std.error conf.low conf.high #> 1 0.935 0.206 0.624 1.400 #> 2 0.555 0.308 0.304 1.015 #> 3 1.171 0.192 0.804 1.704 #> 4 0.769 0.241 0.480 1.233 #> 5 1.786 0.182 1.250 2.550 #> 6 1.713 0.182 1.200 2.445 #> 7 0.979 0.196 0.667 1.437 #> #> Adjusted for: #> * mined = yes #> * site = NA (population-level) plot(pr)For models with log-link, it make sense to use a log-transformed y-axis as well, to get proportional confidence intervals for the plot. You can do this by using the log.y-argument:

plot(pr, log.y = TRUE) Marginal effects conditioned on the count model with random effects uncertainty ggpredict(m, "spp", type = "re") #> #> # Predicted counts of count #> # x = spp #> #> x predicted std.error conf.low conf.high #> 1 0.935 0.309 0.510 1.714 #> 2 0.555 0.384 0.261 1.180 #> 3 1.171 0.300 0.650 2.107 #> 4 0.769 0.333 0.400 1.478 #> 5 1.786 0.294 1.004 3.175 #> 6 1.713 0.294 0.964 3.045 #> 7 0.979 0.303 0.541 1.772 #> #> Adjusted for: #> * mined = yes #> * site = NA (population-level) Marginal effects conditioned on the count and zero-inflation modelFor type = "fe.zi", the predicted response value is the expected value mu*(1-p) *without conditioning* on random effects. Since the zero inflation and the conditional model are working in “opposite directions”, a higher expected value for the zero inflation means a lower response, but a higher value for the conditional model means a higher response. While it is possible to calculate predicted values with predict(..., type = "response"), standard errors and confidence intervals can not be derived directly from the predict()-function. Thus, confidence intervals for type = "fe.zi" are based on quantiles of simulated draws from a multivariate normal distribution (see also *Brooks et al. 2017, pp.391-392* for details).

For type = "re.zi", the predicted response value is the expected value mu*(1-p), accounting for the random-effect variances. Prediction intervals are calculated in the same way as for type = "fe.zi", except that the mean random effect variance is considered for the confidence intervals.

ggpredict(m, "spp", type = "re.zi") #> #> # Predicted counts of count #> # x = spp #> #> x predicted std.error conf.low conf.high #> 1 0.138 0.235 0.032 0.354 #> 2 0.017 0.231 0.000 0.054 #> 3 0.245 0.243 0.065 0.609 #> 4 0.042 0.231 0.002 0.126 #> 5 0.374 0.257 0.098 0.932 #> 6 0.433 0.263 0.122 1.060 #> 7 0.205 0.239 0.054 0.510 #> #> Adjusted for: #> * mined = yes #> * site = NA (population-level) Marginal effects simulated from zero-inflated modelsFinally, it is possible to obtain predicted values by simulating from the model, where predictions are based on simulate() (see *Brooks et al. 2017, pp.392-393* for details). To achieve this, use type = "sim".

- Brooks ME, Kristensen K, Benthem KJ van, Magnusson A, Berg CW, Nielsen A, et al. glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. The R Journal. 2017;9: 378–400.
- Johnson PC, O’Hara RB. 2014. Extension of Nakagawa & Schielzeth’s R2GLMM to random slopes models. Methods Ecol Evol, 5: 944-946. (doi: 10.1111/2041-210X.12225)

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Strenge Jacke!**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### pcLasso: a new method for sparse regression

(This article was first published on ** R – Statistical Odds & Ends**, and kindly contributed to R-bloggers)

I’m excited to announce that my first package has been accepted to CRAN! The package pcLasso implements * principal components lasso*, a new method for sparse regression which I’ve developed with Rob Tibshirani and Jerry Friedman. In this post, I will give a brief overview of the method and some starter code. (For an in-depth description and elaboration of the method, please see our arXiv preprint. For more details on how to use the package, please see the package’s vignette.)

Let’s say we are in the standard supervised learning setting, with design matrix and response . Let the singular value decomposition (SVD) of be , and let the diagonal entries of be . Principal components lasso solves the optimization problem

where and are non-negative hyperparameters, and is the diagonal matrix with entries . The predictions this model gives for new data are .

This optimization problem seems a little complicated so let me try to motivate it. Notice that if we replace with the identity matrix, since is orthogonal the optimization problem reduces to

which we recognize as the optimization problem that elastic net solves. So we are doing something similar to elastic net.

To be more specific: we can think of as the coordinates of the coefficient vector in the standard basis . Then would be the coordinates of this same coefficient vector, but where the basis comprises the principal component (PC) directions of the design matrix . Since we have the matrix , with entries increasing down its diagonal, instead of the identity matrix sandwiched between and in the quadratic penalty, it means that we are doing shrinkage in the principal components space in a way that (i) leaves the component along the first PC direction unchanged, and (ii) shrinks components along larger PC directions more to 0.

This method extends easily to groups (whether overlapping or non-overlapping). Assume that our features come in groups. For each , let represent the reduced design matrix corresponding to group , and let its SVD be . Let the diagonal entries of be , and let be the diagonal matrix with diagonal entries . Let be the reduced coefficient vector corresponding to the features in group . Then pcLasso solves the optimization problem

Now for some basic code. Let’s make some fake data:

set.seed(1) n <- 100; p <- 10 X <- matrix(rnorm(n * p), nrow = n) y <- rnorm(n)Just like glmnet in the glmnet package, the pcLasso function fits the model for a sequence of values which do not have to be user-specified. The user however, does have to specify the parameter:

library(pcLasso) fit <- pcLasso(X, y, theta = 10)We can use the generic predict function to obtain predictions this fit makes on new data. For example, the following code extracts the predictions that pcLasso makes on the 5th value for the first 3 rows of our training data:

predict(fit, X[1:3, ])[, 5] # [1] 0.002523773 0.004959471 -0.014095065The code above assumes that all our features belong to one big group. If our features come in groups, pcLasso can take advantage of that by specifying the groups option. groups should be a list of length , with groups[[k]] being a vector of column indices which belong to group . For example, if features 1-5 belong to one group and features 6-10 belong to another group:

> groups <- list(1:5, 6:10) > groups # [[1]] # [1] 1 2 3 4 5 # # [[2]] # [1] 6 7 8 9 10 fit <- pcLasso(X, y, theta = 10, groups = groups)The function cv.pcLasso fits pcLasso and picks optimal values via cross-validation. The output of the cv.pcLasso function can also be used to predict on new data:

fit <- cv.pcLasso(X, y, theta = 10) predict(fit, X[1:3, ], s = "lambda.min") # [1] -0.01031697 -0.01031697 -0.01031697The vignette contains significantly more detail on how to use this package. If you spot bugs, have questions, or have features that you would like to see implemented, get in touch with us!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Statistical Odds & Ends**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### rOpenSci’s new Code of Conduct

(This article was first published on ** rOpenSci - open tools for open science**, and kindly contributed to R-bloggers)

We are pleased to announce the release of our new Code of Conduct. rOpenSci’s community is our best asset and it’s important that we put strong mechanisms in place before we have to act on a report.

As before, our Code applies equally to members of the rOpenSci team and to anyone from the community at large participating in in-person or online activities.

**What’s new?**

- A Code of Conduct Committee: Stefanie Butland (rOpenSci Community Manager), Scott Chamberlain (rOpenSci Co-founder and Technical Lead) and Kara Woo (independent community member). We are responsible for receiving, investigating, deciding, enforcing and reporting on all reports of potential violations of our Code.
- Greater detail about acceptable and unacceptable behaviors
- Clear instructions on how to make a report
- Information on how reports will be handled
- A commitment to transparency with our community while upholding the privacy of victims

Our new Code of Conduct has been influenced by and adapted from many sources including the Open Source and Feelings talk by Audrey Eschright, the R Consortium Community Diversity and Inclusion Working Group’s draft Code of Conduct, the Geek Feminism anti-harassment policy, our own Community Call, How do I create a Code of Conduct for my event/lab/codebase?, incident reporting forms from NumFOCUS & Jupyter, and perhaps most importantly, by community members from whom we learn so much.

We welcome your feedback in the comments and we thank you for working with us to keep rOpenSci a safe, enjoyable, friendly, and enriching experience for everyone who participates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** rOpenSci - open tools for open science**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### colorspace: New Tools for Colors and Palettes

(This article was first published on ** Achim Zeileis**, and kindly contributed to R-bloggers)

A major update (version 1.4.0) of the R package colorspace has been released to CRAN, enhancing many of the package’s capabilities, e.g., more refined palettes, named palettes, ggplot2 color scales, visualizations for assessing palettes, shiny and Tcl/Tk apps, color vision deficiency emulation, and much more.

OverviewThe *colorspace* package provides a broad toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in various kinds of visualizations. Version 1.4.0 has just been released on CRAN, containing many new features and contributions from new co-authors. A new web site presenting and documenting the package has been launched at http://colorspace.R-Forge.R-project.org/.

At the core of the package there are various utilities for computing with color spaces (as the name conveys). Thus, the package helps to map various three-dimensional representations of color to each other. A particularly important mapping is the one from the perceptually-based and device-independent color model HCL (Hue-Chroma-Luminance) to standard Red-Green-Blue (sRGB) which is the basis for color specifications in many systems based on the corresponding hex codes (e.g., in HTML but also in R). For completeness further standard color models are included as well in the package: polarLUV() (= HCL), LUV(), polarLAB(), LAB(), XYZ(), RGB(), sRGB(), HLS(), HSV().

The HCL space (= polar coordinates in CIELUV) is particularly useful for specifying individual colors and color palettes as its three axes match those of the human visual system very well: Hue (= type of color, dominant wavelength), chroma (= colorfulness), luminance (= brightness).

The *colorspace* package provides three types of palettes based on the HCL model:

*Qualitative:*Designed for coding categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight. Function: qualitative_hcl().*Sequential:*Designed for coding ordered/numeric information, i.e., where colors go from high to low (or vice versa). Function: sequential_hcl().*Diverging:*Designed for coding numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes. Function: diverging_hcl().

To aid choice and application of these palettes there are: scales for use with *ggplot2*; shiny (and tcltk) apps for interactive exploration; visualizations of palette properties; accompanying manipulation utilities (like desaturation, lighten/darken, and emulation of color vision deficiencies).

More detailed overviews and examples are provided in the articles:

- Color Spaces: S4 Classes and Utilities
- HCL-Based Color Palettes
- HCL-Based Color Scales for
*ggplot2* - Palette Visualization and Assessment
- Apps for Choosing Colors and Palettes Interactively
- Color Vision Deficiency Emulation
- Color Manipulation and Utilities
- Approximating Palettes from Other Packages

The stable release version of *colorspace* is hosted on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=colorspace and can be installed via

The development version of *colorspace* is hosted on R-Forge at https://R-Forge.R-project.org/projects/colorspace/ in a Subversion (SVN) repository. It can be installed via

For Python users a beta re-implementation of the full *colorspace* package in Python 2/Python 3 is also available, see https://github.com/retostauffer/python-colorspace.

The *colorspace* package ships with a wide range of predefined color palettes, specified through suitable trajectories in the HCL (hue-chroma-luminance) color space. A quick overview can be gained easily with hcl_palettes():

Using the names from the plot above and a desired number of colors in the palette a suitable color vector can be easily computed, e.g.,

q4 <- qualitative_hcl(4, "Dark 3") q4 ## [1] "#E16A86" "#909800" "#00AD9A" "#9183E6"The functions sequential_hcl(), and diverging_hcl() work analogously. Additionally, their hue/chroma/luminance parameters can be modified, thus allowing to easily customize each palette. Moreover, the choose_palette()/hclwizard() app provide convenient user interfaces to do the customization interactively. Finally, even more flexible diverging HCL palettes are provided by divergingx_hcl().

Usage with base graphicsThe color vectors returned by the HCL palette functions can usually be passed directly to most base graphics function, typically through the col argument. Here, the q4 vector created above is used in a time series display:

plot(log(EuStockMarkets), plot.type = "single", col = q4, lwd = 2) legend("topleft", colnames(EuStockMarkets), col = q4, lwd = 3, bty = "n")As another example for a sequential palette a spine plot is created below, displaying the proportion of Titanic passengers that survived per class. The Purples 3 palette is used which is quite similar to the **ColorBrewer.org** palette Purples. Here, only two colors are employed, yielding a dark purple and light gray.

*ggplot2*

To plug the HCL color palettes into *ggplot2* graphics suitable discrete and/or continuous *gglot2* color scales are provided. The scales are called via the scheme scale___(), where is the name of the aesthetic (fill, color, colour), is the type of the variable plotted (discrete or continuous) and sets the type of the color scale used (qualitative, sequential, diverging, divergingx).

To illustrate their usage two simple examples are shown using the qualitative Dark 3 and sequential Purples 3 palettes that were also employed above. First, semi-transparent shaded densities of the sepal length from the iris data are shown, grouped by species.

library("ggplot2") ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density(alpha = 0.6) + scale_fill_discrete_qualitative(palette = "Dark 3")The sequential palette is used to code the cut levels in a scatter of price by carat in the diamonds data (or rather a small subsample thereof). The scale function first generates six colors but then drops the first color because the light gray is too light in this display. (Alternatively, the chroma and luminance parameters could also be tweaked.)

dsamp <- diamonds[1 + 1:1000 * 50, ] ggplot(dsamp, aes(carat, price, color = cut)) + geom_point() + scale_color_discrete_sequential(palette = "Purples 3", nmax = 6, order = 2:6) Palette visualization and assessmentThe *colorspace* package also provides a number of functions that aid visualization and assessment of its palettes.

- demoplot() can display a palette (with arbitrary number of colors) in a range of typical and somewhat simplified statistical graphics.
- hclplot() converts the colors of a palette to the corresponding hue/chroma/luminance coordinates and displays them in HCL space with one dimension collapsed. The collapsed dimension is the luminance for qualitative palettes and the hue for sequential/diverging palettes.
- specplot() also converts the colors to hue/chroma/luminance coordinates but draws the resulting spectrum in a line plot.

For the qualitative Dark 3 palette from above the following plots can be obtained.

demoplot(q4, "bar") hclplot(q4) specplot(q4, type = "o")A bar plot would be another typical application for a qualitative palette (instead of the time series and density plot used above). However, a lighter and less colorful palette might be preferable in this situation (e.g., Pastel 1 or Set 3).

The other two displays show that luminance is (almost) constant in the palette while the hue changes linearly along the color “wheel”. Ideally, chroma would have also been constant to completely balance the colors. However, at this luminance the maximum chroma differs across hues so that the palette is fixed up to use less chroma for the yellow and green elements.

Subsequently, the same assessment is carried out for the sequential Purples 3 palette as employed above.

s9 <- sequential_hcl(9, "Purples 3") demoplot(s9, "heatmap") hclplot(s9) specplot(s9, type = "o")Here, a heatmap (based on the well-known Maunga Whau volcano data) is used as a typical application for a sequential palette. The elevation of the volcano is brought out clearly, focusing with the dark colors on the higher elevations.

The other two displays show that hue is constant in the palette while luminance and chroma vary. Luminance increases monotonically from dark to light (as required for a proper sequential palette). Chroma is triangular-shaped which allows to better distinguish the middle colors in the palette (compared to a monotonic chroma trajectory).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Achim Zeileis**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Travis CI for R — Advanced guide

(This article was first published on ** Stories by Sebastian Wolf on Medium**, and kindly contributed to R-bloggers)

Travis CI is a common tool to build R packages. It is in my opinion the best platform to use R in continuous integration. Some of the most downloaded R packages built at this platform. These are for example *testthat*, *magick *or *covr*. I also built my package *RTest *at this platform. During the setup I ran into some trouble. The knowledge I gained I’m sharing with you in this guide.

- Basics from “Building an R project”
- Modifying R CMD build
- Multiple operating systems
- Run scripts with user interfaces
- Code coverage
- Build and deploy a pkgdown page to github pages
- ImageMagick and Travis CI
- Further reading

The article “Building an R Project” from Travis CI tells you about the basics. It allows setting up a build for an R-package or R project. The main take away comes with this .travis.yml file.

# Use R languagelanguage: r #Define multiple R-versions, one from bioconductor

r:

- oldrel

- release

- devel

- bioc-devel # Set one of you dependencies from github

r_github_packages: r-lib/testthat # Set one of your dependencies from CRAN

r_packages: RTest # set a Linux system dependency

apt_packages:

- libxml2-dev

The tutorial explains to you that you should setup your type language as **R. **You can use different **R-versions.** Those R-Versions are:

[oldrel, release, devel, bioc-devel, bioc-release]

Additionally you can load any package from github by r_github_packages . Or you can get any package from CRAN by r_packages . A list of multiple packages can be created using the standard yml format:

r_packages:- RTest

- testthat

In case you have a Linux dependency, it needs to be mentioned. The RTest package uses XML test cases. The XML Linux library needed is libxml2 . It can be added by:

apt_packages:- libxml2-dev

You are done with the basics. In case you have this .travis.yml file inside your repository, it will use R CMD build and R CMD check to check your project.

Modifying R CMD commandsTo build my project I wanted to build it like on CRAN. Therefore I needed to change the script of the package check. Therefore I added:

script:- R CMD build . --compact-vignettes=gs+qpdf

- R CMD check *tar.gz --as-cran

Inside this script you can changeR CMD build orR CMD check arguments. For a list of arguments to R CMD see this tutorial from RStudio.

To run vignette compression get gs+qpdf by:

addons:apt:

update: true

packages:

- libgs-dev

- qpdf

- ghostscript Multiple operating systems

Travis CI offers two different operating systems right now (Jan 2019). Those are macOS and Linux. The standard way of testing is Linux. For my project RTest I needed to test in macOS, too. To test in two operating systems use the matrix parameter of Travis CI.

The matrix parameter allows adjusting certain parameters for certain builds. To have the exact same build in Linux and macOS I used the following structure:

matrix:include:

- r: release

script:

- R CMD build . --compact-vignettes=gs+qpdf

- R CMD check *tar.gz --as-cran

- r: release

os: osx

osx_image: xcode7.3

before_install:

- sudo tlmgr install framed titling

script:

- R CMD build . --compact-vignettes=gs+qpdf

- R CMD check *tar.gz --as-cran

The matrix function splits the build into different operating systems. For macOS I used the image xcode7.3 as it is proposed by rOpenSCI. An extra point for this version is that it is close to the current CRAN macOS version. As you can see you should install the Latex packages framed and titling to create vignettes.

Run scripts with User interfacesMy package RTest uses Tcl/Tk user interfaces. To test such user interfaces you need enable user interfaces in Linux and macOS separately. Travis CI provides the xvfb package for Linux. For macOS you need to reinstall xquartz and tcl-tk with homebrew .

User interfaces for LinuxTo enable user interfaces in Linux install xvfb .

addons:apt:

update: true

packages:

- x11proto-xf86vidmode-dev

- xvfb

- libxxf86vm-dev

You can run all R scripts with a user interface using the xvfb-run command in front of the R command.

script:- R CMD build . --compact-vignettes=gs+qpdf

- xvfb-run R CMD check *tar.gz --as-cran User interfaces for macOS

For macOS the installation of a user interface is more difficult. You need to add xquart and tcl-tk to the image provided in xcode7.3 .

before_install:- brew update

- brew cask reinstall xquartz

- brew install tcl-tk --with-tk

- brew link --overwrite --force tcl-tk; brew unlink tcl-tk

To use xquartz there is no xvfb-run command under macOS. In a github issue I found a solution that still makes user interfaces work with xquartz .

before_script:- "export DISPLAY=:99.0"

- if [ "${TRAVIS_OS_NAME}" = "osx" ]; then ( sudo Xvfb :99 -ac -screen 0 1024x768x8; echo ok ) & fi

You create a display before running any R script that can be used by the R session. It is important to export the DISPLAY variable. This variable is read by the tcktk R package.

In macOS you do not need to change the script

script:- R CMD build . --compact-vignettes=gs+qpdf

- R CMD check *tar.gz --as-cran Addon

For more information on user interfaces you can read these two github issues:

Code coverageFor code coverage I would suggest to use one specific version of your builds. I decided for Linux + r-release to test the code coverage. First of all I added the *covr* package to my build script:

- r-lib/covr

Secondly I wanted to test my package using *covr*. This can be done in Travis using the after_success step. To use *covr *inside this step you need to define how your package tarball will be named. You can write this directly into your script. A better way to do it is to write it into the env part of you .travis.yml file. The name of your tarball will always be *PackageName + “_” + PackageVersion + “.tar.gz”. *Inside your DESCRIPTION file you defined *PackageName* and *PackageVersion. *I used CODECOV to store the results of my coverage tests.

- PKG_TARBALL=RTest_1.2.3.1000.tar.gz after_success:

- tar -C .. -xf $PKG_TARBALL

- xvfb-run Rscript -e 'covr::codecov(type=c("tests", "vignettes", "examples"))'

The setup I’m using for my package includes code coverage for all my examples, vignettes and tests. To deploy the results of the code coverage you must define the global variable CODECOV_TOKEN . The token can be found under https://codecov.io/gh///settings . You can insert it secretly into your Travis CI build. Add tokens inside https://travis-ci.org///settings . The section environment variables stores variables secretly for you.

To use COVERALLS instead of CODECOV use the covr::coveralls function and define a COVERALLS_TOKEN inside your environment.

Build and deploy a pkgdown page to github pagesBuilding a *pkgdown *page can be really useful to document your code. On my github repository I also host the *pkgdown *page of my package RTest. You can find the page here: https://zappingseb.github.io/RTest/index.html

To allow deployment to github pages I activated this feature at: https://github.com///settings. You have to use the *gh-pages* branch. If you do not have such a branch you need to create it.

Inside the .travis.yml you start by installing *pkgdown.*

- r-lib/pkgdown

You will have to build the page from your package tarball. The name of the package tarball has to be defined. Please see the section code coverage for how this is done. After unpacking the tarball you should delete any leftovers from checking the package by rm -rf .Rcheck .

after_success:- tar -C .. -xf $PKG_TARBALL

- rm -rf RTest.Rcheck

- Rscript -e 'pkgdown::build_site()'

The Rscript will produce the website inside adocs folder. This folder must be deployed on github pages.

First go to to https://github.com/settings/tokens when you’re logged into github. There you have to create a token with public_repo or repo scope. Now store this token inside your Travis CI build. Therefore go to https://travis-ci.org///settings and store it as a global variable named GITHUB_TOKEN . The website will now be deployed on every successful build using this script:

deploy:- provider: pages

skip-cleanup: true

github-token: $GITHUB_TOKEN

keep-history: false

local-dir: docs

on:

branch: master

for more info on deploying pages you can check the Travis CI guide on pages.

ImageMagick and Travis CIInside the travis-ci-community there was a question on how to install the magick package on Travis-CI. The answer is simple. You need to have all system dependencies of ImageMagick. Install these for Linux by:

addons:apt:

update: true

sources:

- sourceline: 'ppa:opencpu/imagemagick'

- sourceline: 'ppa:ubuntugis/ppa'

packages:

- libmagick++-dev

- librsvg2-dev

- libwebp-dev

- libpoppler-cpp-dev

- libtesseract-dev

- libleptonica-dev

- tesseract-ocr-eng

- r-cran-rgdal

- libfftw3-dev

- cargo

This also worked for macOS for me.

*Dear Reader: It’s always a pleasure to write about my work on continuous integrations. I thank you for reading until the end of this article. If you liked the article, you can clap for it on **Medium **or star the repository on **github**. In case of any comment, leave it **here **or on my **LinkedIn **profile **http://linkedin.com/in/zappingseb.*

- Building R projects on Travis CI (Basics)
- Travis CI community for R (Forum)
- RTest package .travis.yml file
- The RTest package — 99% code coverage with Travis-CI

Travis CI for R — Advanced guide was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Stories by Sebastian Wolf on Medium**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Showing a difference in means between two groups

(This article was first published on ** R – On unicorns and genes**, and kindly contributed to R-bloggers)

Visualising a difference in mean between two groups isn’t as straightforward as it should. After all, it’s probably the most common quantitative analysis in science. There are two obvious options: we can either plot the data from the two groups separately, or we can show the estimate of the difference with an interval around it.

A swarm of dots is good because it shows the data, but it obscures the difference, and has no easy way to show the uncertainty in the difference. And, unfortunately, the uncertainty of the means within groups is not the same thing as the uncertainty of the difference between means.

An interval around the difference is good because it makes the plausible range of the difference very clear, but it obscures the range and distribution of the data.

Let’s simulate some fake data and look at these plots:

library(broom) library(egg) library(ggplot2) data <- data.frame(group = rep(0:1, 20)) data$response <- 4 + data$group * 2 + rnorm(20)We start by making two clouds of dots. Then we estimate the difference with a simple linear model, and plot the difference surrounded by an approximate confidence interval. We can plot them separately or the *egg* package to put them together in two neat panels:

Here it is:

But I had another idea. I am not sure whether it’s a good idea or not, but here it is: We put in the dots, and then we put in two lines that represent the smallest and the greatest difference from the approximate confidence interval:

offset <- (2 * result$estimate[1] + result$estimate[2])/2 shortest <- result$estimate[2] - 2 * result$std.error[2] longest <- result$estimate[2] + 2 * result$std.error[2] plot_both <- plot_points + geom_linerange(aes(ymin = offset - shortest/2, ymax= offset + shortest/2, x = 1.25)) + geom_linerange(aes(ymin = offset - longest/2, ymax= offset + longest/2, x = 1.75)) + theme_bw()I think it looks pretty good, but it’s not self-explanatory, and I’m not sure whether it is misleading in any way.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – On unicorns and genes**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Medium + r-bloggers — How to integrate?

(This article was first published on ** Stories by Sebastian Wolf on Medium**, and kindly contributed to R-bloggers)

I started my blog about R on Medium. Medium is a wonderful platform with a great user interface. The idea to use it came by reading a blog post from Joe Cheng. After I wrote about two or three articles about **R** I wanted to publish them on r-bloggers.com. r-bloggers.com requires an RSS feed of your blog. This is a simple feature if you are using WordPress or Bloggers. In Medium I noticed my feed contains everything that I’m writing. It also includes short comments of mine (https://medium.com/feed/@zappingseb):

https://medium.com/media/5d891965d505b3207a479c7ab6caf61e/href

The maintainers of r-bloggers.com did not want such comments inside their website. So I started reading on filtering my articles by certain tags. The only way I found to realize this was inside a PHP script.

Read RSS to PHPTo read my RSS feed into a PHP variable I used the built-in XML reader:

header('Content-Type: text/xml'); $xml = new DOMDocument;$url = "https://medium.com/feed/@zappingseb";

$xml->load($url); Iterate over the RSS file

The next step was to get every single item (blog entry) out of the XML. To start an iteration I first needed to get the items into an array. Therefore I made the XML a book. The documentElement function will create such a book. From this book, I extracted all items inside the channel tag.

$book = $xml->documentElement; // we retrieve the chapter and remove it from the book$channel= $book->getElementsByTagName('channel')->item(0);

$items = $channel->getElementsByTagName('item');

To store all filtered items, I created an empty array.

$domArray = array(); //set up an array to catch all our nodeI found a wonderful stackoverflow entry on how to filter keywords from RSS. It provided me with a function to lookup keywords. So I looped over all items of my Medium RSS feed. For each item, I got the categories inside the tag category . Iterating over the categories I checked if the category “r” was inside. If so I set the variable $found to true and stop the iteration on categories. If the $found variable was never set to true, I added the article to the domArray . All items in the domArray must be filtered out.

//For each article you can derive whether it contains a keyword in//it’s tags or not

foreach ($items as $i){

$categories = $i->getElementsByTagName('category');

$found = false; foreach($categories as $category){

if($category->nodeValue == "r"){

$found = true;

break;

}

}

if(!$found){

$domArray[] = $i;

}

} Filtering out articles

It took me a while to understand how PHP handles XML files internally. After some time I found out I can just delete all items within one loop. It was not possible to do it in my items loop. So I inserted this loop at the end of my file:

foreach($domArray as $i){$channel->removeChild($i);

}

It takes the $channel variable where all items are stored. Afterwards, it deletes the ones that did not contain the keyword r .

Printing out the feedA difference between PHP and R is that $channel was a copy by reference from $xml . This means the code did not only change the $channel variable, but also the $xml variable. This is the reason I can immediately write out the $xml variable. This command produces the filtered RSS feed.

echo $xml->saveXML(); Final learningFor me, it was a hard task to get to r-bloggers.com. I simply had to pull out all my old PHP skills and google a lot. Finally, I made it. I was really happy to see my post. I hope more people can benefit from this blog entry. It would be a pleasure to see more Medium content on r-bloggers.com.

*Dear Reader: It’s always a pleasure to write about my work on **R programming**. I thank you for reading until the end of this article. If you liked the article, you can clap for it on **Medium **or star the repository on **github**. In case of any comment, leave it **here **or on my **LinkedIn **profile **http://linkedin.com/in/zappingseb.*

I’m hosting the script of this tutorial on my PHP server. You can find it in the gist below. It is hosted at: http://mail-wolf.de/documents/medium/

https://medium.com/media/7632c20b97c0b9a85fc20559e98a9bff/hrefhttps://medium.com/media/0707f5c806284d01a4a13c7b13a91ce3/href

Medium + r-bloggers — How to integrate? was originally published in Data Driven Investor on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Stories by Sebastian Wolf on Medium**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### XmR Chart | Step-by-Step Guide by Hand and with R

(This article was first published on ** R – R-BAR**, and kindly contributed to R-bloggers)

Is your process in control? The XmR chart is a great statistical process control (SPC) tool that can help you answer this question, reduce waste, and increase productivity. We'll cover the concepts behind XmR charting and explain the XmR control constant with some super simple R code. Lastly, we'll cover how to make the XmR plot by hand, with base R code, and the ggQC package.

**Objectives:**When you’re done with this post, you should:

- Grasp the XmR chart concept and utility
- Understand the math necessary to create the XmR chart by hand, and why moving range should be used instead of standard deviation to determine control limits.
- Be able to create an XmR chart from scratch
- Be comfortable using a piece of software to help you make one.

Think about something you do for yourself regularly. Do you do it in a very particular way? And, when you do it just *that* way is the result amazzzzing? Maybe it's how you brew that pot of coffee in the morning, how you style your hair, or how you spice your favorite dish. Whatever it is, you have a desired outcome and you know just how to make it happen – time and time again. You have a process, a method of making or doing something with a repeatable outcome.

Now, suppose you decide to start a business sharing that special outcome with the world. You'll need to train others to execute your process. But, how will you know your workers are successful? You'll need feedback from your process, a measurable to make sure your process yields the desired outcome. By monitoring this outcome, you'll know your process is in control. This is what an XmR control chart allows us to do.

XmR Core ConceptsTo illustrate how an XmR chart helps us understand our process, let's use a simple example. For the moment, pretend you own and operate the world famous ACME Hammer and Nail Company. Your claim to fame is making the world's finest nails, and right now you are starting a new **process** for 0.75 inch nails. To monitor the process over time, you measure nail length **(the process measure)** to insure the mean nail length **(the process center)** produced is indeed 0.75 inches plus or minus some natural wiggle in the data **(the process variation)**. Using an XmR chart, as shown below, you can bring all these process terms together. The XmR chart becomes the *voice* of your process.

Starting with the graph title, we see that this XmR chart is about the 0.75 inch nail **process** and that the data was collected form product line 1 between 9:00am – 9:05am. During that 5 minute period, 20 nails were produced and each nail was given a sequential ID. The order that data appears in the plot, must be sequential. Thus, the sequential ID given to each nail defines the x-axis. The y-axis is the actual measured nail length **(the process measure)** in inches. We see that the data is centered around the blue center line in the graph, **(the process center)**. The chart also shows that the data is bouncing randomly around the blue center line due to inherent **process variation**. Finally, we see two red lines labeled lower control limit (LCL) and upper control limit (UCL). If your process is in statistical control, ~99% of the nails produced will measure within these control limits.

So, the bare minimum you need to make an XmR chart is:

- a process (that ideally isn’t changing)
- a sequential output from the process (products)
- a relevant measurement or key performance indicator for your process that tracks an important quality.
- and some math to put the chart together

Let's dig into the math.

XmR Control LimitsIn the XmR chart, the center line represents the mean of your data. Nothing special here, just the mean of your data. But, there is also the red lines. These define the upper and lower control limits. To determine these, you need to determine your random process variation. In the XmR context, taking the standard deviation of your data does not necessary yield the random process variation. There is a better way.

So how do you determine the random process variation and control limits?

Good question. To answer it, we need a second piece of information from our process data – the mean moving range (mR). Notice that the abbreviation “mR” is part of the XmR chart title. The X stands for the individual data points and the mR is how we determine the variability. From this variability metric, we determine the process's lower and upper control limits.

Mean Moving Range and Control LimitsThere are 3 steps to determining XmR Control Limits

- Determine the mean(mR)
- Convert the mean(mR) to a
*sequential deviation* - Use the sequential deviation to calculate the control limits.

The first step is to determine your process's mean moving range, mean(mR). To determining the mean(mR), find the absolute difference between sequential pairwise measurements. Then take the mean of ranges you've calculated. The calculation of mean(mR) is demonstrated below on five data points.

For the second step, we need to convert the mean(mR) to a sequential deviation. If you're just looking for a quick answer: divide the mean(mR) by the control constant 1.128 to calculate the sequential deviation and move on to the calculating XmR control limits section. If you need more than divide and trust me, read on. Whatever you do, **Resist the temptation to just use the standard deviation of your data points to determine your control limits**. “Why?”, you ask. Because, the moving range is **sequence sensitive** and demphasizes systematic variation, allowing us to more clearly measure the inherent random process variation. The standard deviation is a measure of total variation (ie., systematic variation & random variation). To see more on this important but slightly advanced topic check out my article: XmR Control Limits | Why Moving Range, not Standard Deviation.

If you're like me, you get a little uncomfortable when someone tells you, “just divide these numbers and trust me.” That said, you probably want to know more about that 1.128 number I used to convert the mean(mR) to a sequential deviation in the previous section. To explain the 1.128, we are going to do a little math and a super simple simulation in R. When we're are done, we'll return to the task of calculating the XmR control limits. Let's get started.

Simulation SetupOur goal is to develop a sequence sensitive deviation based on the mean(mR) of our data. I call it sequential deviation. Determining this value will allow us to better quantify the random variation in our process and minimize systematic effects. We start with the relationship that the mean(mR) is proportional to the standard deviation of individual data points sampled from a standard normal distribution with mean = 0 and standard deviation = 1 (no systematic variation, all random variation). Next, we state that so long as the data are distributed normally the Standard Deviation is equal to the Sequential Deviation. This is because all variation is random; none is systematic. The relationships are expressed in Equations 1 and 1.1 below.

From here, we simplify things by substituting the standard deviation with sequential deviation in equation 1. We also get rid of the proportionality symbol by inserting a proportionality constant, d, between the mean(mR) and the sequential deviation of individual samples. The new expression is shown in equation 2:

SimulationNow, we just need to determine the constant, d. To do this we are going to run a very simple R simulation. The simulation will do the following:

- Draw 10 million samples from the standard normal distribution (mean=0, sd=1),
- Determine the absolute difference between every consecutive pair of data points, yielding a series of ranges
- Take the average of all the ranges, yielding the mean(mR)
- From EQ 1.1, we know the sequential deviation = standard deviation = 1 under these conditions.
- thus from Equation 2, the mean(mR) determined in step 3 is the value of d (the constant we are looking for…)

OK here is a very simple R code to run our simulation.

1 2 3 4 5 6 7 8 9 10 11 require(dplyr) # to get the easy to read pipe operator %>% set.seed(5555) # So you can reproduce my sampling of the normal distribution rnorm(n=1e7, mean=0, sd=1) %>% # 10 million samples form normal distribution diff() %>% # Difference between each point abs() %>% # Absolute value of each difference (moving range...) mean(na.rm=T) -> # The mean of the moving ranges d # assigning to the constant d d # printing d to screen #>> [1] 1.128346 ResultSo from our simulation, we have determined that:

OK, I told you *d* would be 1.128 earlier, but now you know why. If you have some previous experience with making control charts or have looked at a table of control chart constants recently, the number 1.128 may be familiar. It is the value for d2 when n = 2. If you've never heard of control chart constants before. Don't worry, 1.128 is all you need for XmR. Other constants, become more import when you're working with XbarR or XbarS charts. For now though, we are going to keep our focus on XmR.

Using our new found knowledge of d and rearranging equation 2, we can easily determine our sequential deviation from the mean(mR) as shown in equation 3.

Using this relationship, if we have a nail production process that yields a mean(mR) of 0.125 inches, equation 3 tells us we have a sequential deviation of 0.111 inches.

Great, we understand where the 1.128 came from. Now, we need to use it to calculate the XmR control limits. To determine our upper and lower control limits in the XmR chart, we want to use ± 3 sequential deviations around our process mean. Why ± 3 sequential deviations? Because, if our process is in statistical control, ~99.7 of our product will measure within the control limits.

Here are the expressions for the upper and lower control limits in terms of mean(mR):

continuing with our 0.75 inch nail scenario, if our mean(mR) is 0.125 inches then

- The process mean should be 0.75" (hopefully).
- The upper control limit will be 0.75 + 3 * (0.125/1.128)
- the lower control limit will be 0.75 – 3 * (0.125/1.128)

Awesome. here's what we've covered:

- The parts of an XmR plot,
- How to determine moving range.
- How to convert moving range to a sequential deviation.
- Why you should use sequential deviation and not standard deviation.
- How to determine the control limits from the process mean and the sequential deviation.

Let's use what you've learned in an example.

Step Wise XmR ExampleMake an XmR chart for the following data involving 3 inch screw lengths.

ScrewID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Length 2.92 2.96 2.86 3.04 3.07 2.85 3.00 2.92 2.97 2.97 3.09 3.07 2.99 3.06 3.05 3.02 3.07 2.91 3.07 3.20 mR NA 0.04 0.10 0.18 0.03 0.22 0.15 0.08 0.05 0.00 0.12 0.02 0.08 0.07 0.01 0.03 0.05 0.16 0.16 0.13- Determine the mean, which is: 3.0045
- Determine the mean moving range, which is: 0.0884211
- Calculate the sequential deviation, which is: 0.0884211 / 1.128 = 0.0783875
- Calculate the upper and lower control limits which are:
- Lower Control Limit = 3.0045 – 3 * 0.0783875 = 2.7693376
- Upper Control Limit = 3.0045 + 3 * 0.0783875 = 3.2396624

- Create the Plot

Below is some temperature data in Fahrenheit taken from a coffee brewing setup.

CupID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Temp 198 199 196 201 202 196 200 198 199 199 202 202 200 201 201 200 202 198 202 205 mR NA 1 3 5 1 6 4 2 1 0 3 0 2 1 0 1 2 4 4 3Calculate the following quantities: (To check your answers, mouse-over the word “answer” or click)

- mean – Answer
- mean moving range – Answer
- the sequential deviation – Answer
- the upper and lower control limits – Answer
- Make a plot in your favorite plotting software

For those who like to see all the work of specifying the lines and plot details, here is an example in R's base plotting system.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #This code was used to generate the example data in the example above... set.seed(5555) example.data <- data.frame( ScrewID = as.factor(1:20), Length = round(rnorm(n = 20, mean = 3, sd = 0.1),2) ) #Caluate the mean and limits the_Mean <- mean(example.data$Length) the_mR <- mean(abs(diff(example.data$Length))) the_Sigma <- the_mR/1.128 the_LCL <- the_Mean - 3 * the_Sigma the_UCL <- the_Mean + 3 * the_Sigma #Make with Base Plotting System plot(x = 1:20, y = example.data$Length, #give plot the x,y data type="b", pch=20, # specify point and line chart with solid points ylim=c(the_LCL*.9,the_UCL*1.1), # Set the chart's y-limits slightly larger than control limits xlab="ScrewID", ylab="Screw Length (inches)" # Label the axis ) abline(h = c(the_LCL,the_Mean,the_UCL), col=c("red", "blue", "red")) #Draw The Lines text(2, the_UCL+.05, labels = paste0("UCL: ", round(the_UCL,4))) #Draw the line labels text(2, the_LCL-.05, labels = paste0("LCL: ", round(the_LCL,4))) #Draw the line labels Make an XmR Plot with R’s ggQC PackageTypically our job is to make decisions based on an XmR plot and not to get hung up on creating them. ggQC is one of several quality control packages in R that make creating XmR charts simpler.

Below is quick example of building an XmR chart with ggQC. Notice that all the calculations are done for you.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #This code was used to generate the example data in the nail example above... set.seed(5555) example.data <- data.frame(ScrewID = as.factor(1:20), Length = round(rnorm(n = 20, mean = 3, sd = 0.1),2) ) #Load the Necessary Packages require(ggQC) require(ggplot2) #Make the Plot ggplot(example.data, aes(x=ScrewID, y=Length, group=1)) + geom_point() + geom_line() + stat_QC(method="XmR", auto.label = T, label.digits = 4) + ylab("Screw Length (inches)") + scale_x_discrete(expand = expand_scale(mult = .15))Visit rcontrolcharts.com to learn more about ggQC and its capabilities.

SummaryXmR charts are one of the simplest control charts to prepare, but you need to know a thing or two about them before implementing. The key inputs for an XmR chart are a process, a detail you want to control, and a measurable on that detail you can track sequentially. Once you have these pieces, you can apply everything you've learned here regarding the XmR chart and the math need to make one. Before we end, some food for thought on the term “process”. A Process is a way of making or doing a set of actions that yield a desired outcome. With this definition, a process could be **making** nails or it could be **measuring** nail lengths. The XmR chart can provide information about the act of making as well as the act of measuring. Happy charting.

- XmR Control Limits | Why Moving Range, not Standard Deviation
- Control Chart Constants | Tables and Brief Explanation
- ggQC | ggplot Quality Control Charts

The post XmR Chart | Step-by-Step Guide by Hand and with R appeared first on R-BAR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – R-BAR**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Generating Synthetic Data Sets with ‘synthpop’ in R

(This article was first published on ** R – Daniel Oehm | Gradient Descending**, and kindly contributed to R-bloggers)

Synthpop – A great music genre and an aptly named R package for synthesising population data. I recently came across this package while looking for an easy way to synthesise unit record data sets for public release. The goal is to generate a data set which contains no real units, therefore safe for public release and retains the structure of the data. From which, any inference returns the same conclusion as the original. This will be a quick look into synthesising data, some challenges that can arise from common data structures and some things to watch out for.

Sythesising dataThis example will use the same data set as in the synthpop documentation and will cover similar ground, but perhaps an abridged version with a few other things that weren’t mentioned. The SD2011 contains 5000 observations and 35 variables on social characteristics of Poland. A subset of 12 of these variables are considered.

suppressPackageStartupMessages(library(synthpop)) suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(sampling)) suppressPackageStartupMessages(library(partykit)) mycols <- c("darkmagenta", "turquoise") options(xtable.floating = FALSE) options(xtable.timestamp = "") myseed <- 20190110 # filtering the dataset original.df <- SD2011 %>% dplyr::select(sex, age, socprof, income, marital, depress, sport, nofriend, smoke, nociga, alcabuse, bmi) head(original.df) ## sex age socprof income marital depress sport nofriend smoke ## 1 FEMALE 57 RETIRED 800 MARRIED 6 NO 6 NO ## 2 MALE 20 PUPIL OR STUDENT 350 SINGLE 0 NO 4 NO ## 3 FEMALE 18 PUPIL OR STUDENT NA SINGLE 0 NO 20 NO ## 4 FEMALE 78 RETIRED 900 WIDOWED 16 YES 0 NO ## 5 FEMALE 54 SELF-EMPLOYED 1500 MARRIED 4 YES 6 YES ## 6 MALE 20 PUPIL OR STUDENT -8 SINGLE 5 NO 10 NO ## nociga alcabuse bmi ## 1 -8 NO 30.79585 ## 2 -8 NO 23.44934 ## 3 -8 NO 18.36547 ## 4 -8 NO 30.46875 ## 5 20 NO 20.02884 ## 6 -8 NO 23.87511The objective of synthesising data is to generate a data set which resembles the original as closely as possible, warts and all, meaning also preserving the missing value structure. There are two ways to deal with missing values 1) impute/treat missing values before synthesis 2) synthesise the missing values and deal with the missings later. The second option is generally better since the purpose the data is supporting may influence how the missing values are treated.

Missing values can be simply NA or some numeric code specified by the collection. A useful inclusion is the syn function allows for different NA types, for example income, nofriend and nociga features -8 as a missing value. A list is passed to the function in the following form.

# setting continuous variable NA list cont.na.list <- list(income = c(NA, -8), nofriend = c(NA, -8), nociga = c(NA, -8))By not including this the -8’s will be treated as a numeric value and may distort the synthesis.

After synthesis, there is often a need to post process the data to ensure it is logically consistent. For example, anyone who is married must be over 18 and anyone who doesn’t smoke shouldn’t have a value recorded for ‘number of cigarettes consumed’. These rules can be applied during synthesis rather than needing adhoc post processing.

# apply rules to ensure consistency rules.list <- list( marital = "age < 18", nociga = "smoke == 'NO'") rules.value.list <- list( marital = "SINGLE", nociga = -8)The variables in the condition need to be synthesised before applying the rule otherwise the function will throw an error. In this case age should be synthesised before marital and smoke should be synthesised before nociga.

There is one person with a bmi of 450.

SD2011[which.max(SD2011$bmi),] ## sex age agegr placesize region edu ## 3953 FEMALE 58 45-59 URBAN 20,000-100,000 Lubelskie SECONDARY ## eduspec socprof unempdur income ## 3953 economy and administration LONG-TERM SICK/DISABLED -8 1300 ## marital mmarr ymarr msepdiv ysepdiv ls depress ## 3953 MARRIED 4 1982 NA NA MOSTLY SATISFIED 1 ## trust trustfam trustneigh sport nofriend smoke ## 3953 ONE CAN`T BE TOO CAREFUL YES YES YES 2 NO ## nociga alcabuse alcsol workab wkabdur wkabint wkabintdur emcc englang ## 3953 -8 NO NO NO -8 NO NONE ## height weight bmi ## 3953 149 NA 449.9797Their weight is missing from the data set and would need to be for this to be accurate. I don’t believe this is correct! So, any bmi over 75 (which is still very high) will be considered a missing value and corrected before synthesis.

# getting around the error: synthesis needs to occur before the rules are applied original.df$bmi <- ifelse(original.df$bmi > 75, NA, original.df$bmi)Consider a data set with variables. In a nutshell, synthesis follows these steps:

- Take a simple random sample of and set as
- Fit model and draw from
- Fit model and draw from
- And so on, until

The data can now be synthesised using the following code.

# synthesise data synth.obj <- syn(original.df, cont.na = cont.na.list, rules = rules.list, rvalues = rules.value.list, seed = myseed) ## ## Unexpected values (not obeying the rules) found for variable(s): nociga. ## Rules have been applied but make sure they are correct. ## Synthesis ## ----------- ## sex age socprof income marital depress sport nofriend smoke nociga ## alcabuse bmi synth.obj ## Call: ## ($call) syn(data = original.df, rules = rules.list, rvalues = rules.value.list, ## cont.na = cont.na.list, seed = myseed) ## ## Number of synthesised data sets: ## ($m) 1 ## ## First rows of synthesised data set: ## ($syn) ## sex age socprof income marital depress sport ## 1 FEMALE 45 EMPLOYED IN PUBLIC SECTOR 1500 SINGLE 5 YES ## 2 MALE 65 OTHER ECONOMICALLY INACTIVE 500 SINGLE 5 YES ## 3 FEMALE 17 PUPIL OR STUDENT NA SINGLE 3 NO ## 4 FEMALE 48 EMPLOYED IN PRIVATE SECTOR 1000 MARRIED 0 NO ## 5 MALE 50 EMPLOYED IN PRIVATE SECTOR 1300 MARRIED 0 NO ## 6 FEMALE 65 RETIRED 1200 WIDOWED 6 NO ## nofriend smoke nociga alcabuse bmi ## 1 3 NO -8 NO 26.12245 ## 2 30 NO -8 NO 29.32099 ## 3 7 NO -8 NO 22.40588 ## 4 10 NO -8 NO 25.81663 ## 5 12 YES 20 YES 27.17063 ## 6 15 NO -8 NO 27.51338 ## ... ## ## Synthesising methods: ## ($method) ## sex age socprof income marital depress sport nofriend ## "sample" "cart" "cart" "cart" "cart" "cart" "cart" "cart" ## smoke nociga alcabuse bmi ## "cart" "cart" "cart" "cart" ## ## Order of synthesis: ## ($visit.sequence) ## sex age socprof income marital depress sport nofriend ## 1 2 3 4 5 6 7 8 ## smoke nociga alcabuse bmi ## 9 10 11 12 ## ## Matrix of predictors: ## ($predictor.matrix) ## sex age socprof income marital depress sport nofriend smoke ## sex 0 0 0 0 0 0 0 0 0 ## age 1 0 0 0 0 0 0 0 0 ## socprof 1 1 0 0 0 0 0 0 0 ## income 1 1 1 0 0 0 0 0 0 ## marital 1 1 1 1 0 0 0 0 0 ## depress 1 1 1 1 1 0 0 0 0 ## sport 1 1 1 1 1 1 0 0 0 ## nofriend 1 1 1 1 1 1 1 0 0 ## smoke 1 1 1 1 1 1 1 1 0 ## nociga 1 1 1 1 1 1 1 1 1 ## alcabuse 1 1 1 1 1 1 1 1 1 ## bmi 1 1 1 1 1 1 1 1 1 ## nociga alcabuse bmi ## sex 0 0 0 ## age 0 0 0 ## socprof 0 0 0 ## income 0 0 0 ## marital 0 0 0 ## depress 0 0 0 ## sport 0 0 0 ## nofriend 0 0 0 ## smoke 0 0 0 ## nociga 0 0 0 ## alcabuse 1 0 0 ## bmi 1 1 0The compare function allows for easy checking of the sythesised data.

# compare the synthetic and original data frames compare(synth.obj, original.df, nrow = 3, ncol = 4, cols = mycols)$plotSolid. The distributions are very well preserved. Did the rules work on the smoking variable?

# checking rules worked table(synth.obj$syn[,c("smoke", "nociga")]) ## nociga ## smoke -8 1 2 3 4 5 6 7 8 10 12 14 15 ## YES 13 13 13 32 5 49 17 12 28 308 26 3 135 ## NO 3777 0 0 0 0 0 0 0 0 0 0 0 0 ## nociga ## smoke 16 18 20 21 22 24 25 26 29 30 35 40 50 ## YES 5 7 418 2 1 2 31 2 2 51 1 33 1 ## NO 0 0 0 0 0 0 0 0 0 0 0 0 0 ## nociga ## smoke 60 ## YES 1 ## NO 0They did. All non-smokers have missing values for the number of cigarettes consumed.

compare can also be used for model output checking. A logistic regression model will be fit to find the important predictors of depression. The depression variable ranges from 0-21. This will be converted to

- 0-7 – no evidence of depression (0)
- 8-21 – evidence of depression (1)

This split leaves 3822 (0)’s and 1089 (1)’s for modelling.

# ------------ MODEL COMPARISON glm1 <- glm.synds(ifelse(depress > 7, 1, 0) ~ sex + age + log(income) + sport + nofriend + smoke + alcabuse + bmi, data = synth.obj, family = "binomial") ## Warning in log(income): NaNs produced summary(glm1) ## Warning: Note that all these results depend on the synthesis model being correct. ## ## Fit to synthetic data set with a single synthesis. ## Inference to coefficients and standard errors that ## would be obtained from the observed data. ## ## Call: ## glm.synds(formula = ifelse(depress > 7, 1, 0) ~ sex + age + log(income) + ## sport + nofriend + smoke + alcabuse + bmi, family = "binomial", ## data = synth.obj) ## ## Combined estimates: ## xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|) ## (Intercept) -1.00819811 0.71605764 -1.4080 0.1591356 ## sexFEMALE 0.35681507 0.10010909 3.5643 0.0003649 *** ## age 0.09004527 0.00384758 23.4031 < 2.2e-16 *** ## log(income) -0.68041355 0.08829602 -7.7060 1.298e-14 *** ## sportNO -0.66446597 0.11880451 -5.5929 2.233e-08 *** ## nofriend 0.00028325 0.00679104 0.0417 0.9667305 ## smokeNO 0.08544511 0.11894243 0.7184 0.4725269 ## alcabuseNO -0.72124437 0.21444108 -3.3634 0.0007700 *** ## bmi 0.00644972 0.01036016 0.6226 0.5335801 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # compare to the original data set compare(glm1, original.df, lcol = mycols) ## Warning in log(income): NaNs produced ## ## Call used to fit models to the data: ## glm.synds(formula = ifelse(depress > 7, 1, 0) ~ sex + age + log(income) + ## sport + nofriend + smoke + alcabuse + bmi, family = "binomial", ## data = synth.obj) ## ## Differences between results based on synthetic and observed data: ## Std. coef diff p value CI overlap ## (Intercept) -1.26517500 0.206 0.6772453 ## sexFEMALE 0.27373709 0.784 0.9301678 ## age 0.85530291 0.392 0.7818065 ## log(income) 0.98568572 0.324 0.7485449 ## sportNO 0.16485706 0.869 0.9579439 ## nofriend 1.31926470 0.187 0.6634467 ## smokeNO -0.07386025 0.941 0.9811578 ## alcabuseNO 0.02501199 0.980 0.9936193 ## bmi -0.17971185 0.857 0.9541543 ## ## Measures for one synthesis and 9 coefficients ## Mean confidence interval overlap: 0.8542318 ## Mean absolute std. coef diff: 0.5714007 ## Lack-of-fit: 5.49732; p-value 0.789 for test that synthesis model is compatible ## with a chi-squared test with 9 degrees of freedom ## ## Confidence interval plot:While the model needs more work, the same conclusions would be made from both the original and synthetic data set as can be seen from the confidence interavals. Occaisonally there may be contradicting conclusions made about a variable, accepting it in the observed data but not in the synthetic data for example. This scenario could be corrected by using different synthesis methods (see documentation) or altering the visit sequence.

Preserving count dataReleased population data are often counts of people in geographical areas by demographic variables (age, sex, etc). Some cells in the table can be very small e.g. <5. For privacy reasons these cells are suppressed to protect peoples identity. With a synthetic data, suppression is not required given it contains no real people, assuming there is enough uncertainty in how the records are synthesised.

The existence of small cell counts opens a few questions,

- If very few records exist in a particular grouping (1-4 records in an area) can they be accurately simulated by synthpop?
- Is the structure of the count data preserved?

To test this 200 areas will be simulated to replicate possible real world scenarios. Area size will be randomly allocated ensuring a good mix of large and small population sizes. Population sizes are randomly drawn from a Poisson distribution with mean . If large, is drawn from a uniform distribution on the interval [20, 40]. If small, is set to 1.

# ---------- AREA # add area flag to the data frame area.label <- paste0("area", 1:200) a <- sample(0:1, 200, replace = TRUE, prob = c(0.5, 0.5)) lambda <- runif(200, 20, 40)*(1-a) + a prob.dist <- rpois(200, lambda) area <- sample(area.label, 5000, replace = TRUE, prob = prob.dist) # attached to original data frame original.df <- SD2011 %>% dplyr::select(sex, age, socprof, income, marital, depress, sport, nofriend, smoke, nociga, alcabuse, bmi) original.df$bmi <- ifelse(original.df$bmi > 75, NA, original.df$bmi) original.df <- cbind(original.df, area) %>% arrange(area)The sequence of synthesising variables and the choice of predictors is important when there are rare events or low sample areas. If Synthesised very early in the procedure and used as a predictor for following variables, it’s likely the subsequent models will over-fit. Synthetic data sets require a level of uncertainty to reduce the risk of statistical disclosure, so this is not ideal.

Fortunately syn allows for modification of the predictor matrix. To avoid over-fitting, ‘area’ is the last variable to by synthesised and will only use sex and age as predictors. This is reasonable to capture the key population characteristics. Additionally, syn throws an error unless maxfaclevels is changed to the number of areas (the default is 60). This is to prevent poorly synthesised data for this reason and a warning message suggest to check the results, which is good practice.

# synthesise data # m is set to 0 as a hack to set the synds object and the predictor matrix synth.obj.b <- syn(original.df, cont.na = cont.na.list, rules = rules.list, rvalues = rules.value.list, maxfaclevels = 200, seed = myseed, m = 0) ## ## Unexpected values (not obeying the rules) found for variable(s): nociga. ## Rules have been applied but make sure they are correct. # changing the predictor matrix to predict area with only age and sex new.pred.mat <- synth.obj.b$predictor.matrix new.pred.mat["area",] <- 0 new.pred.mat["area",c("age", "sex")] <- 1 new.pred.mat ## sex age socprof income marital depress sport nofriend smoke ## sex 0 0 0 0 0 0 0 0 0 ## age 1 0 0 0 0 0 0 0 0 ## socprof 1 1 0 0 0 0 0 0 0 ## income 1 1 1 0 0 0 0 0 0 ## marital 1 1 1 1 0 0 0 0 0 ## depress 1 1 1 1 1 0 0 0 0 ## sport 1 1 1 1 1 1 0 0 0 ## nofriend 1 1 1 1 1 1 1 0 0 ## smoke 1 1 1 1 1 1 1 1 0 ## nociga 1 1 1 1 1 1 1 1 1 ## alcabuse 1 1 1 1 1 1 1 1 1 ## bmi 1 1 1 1 1 1 1 1 1 ## area 1 1 0 0 0 0 0 0 0 ## nociga alcabuse bmi area ## sex 0 0 0 0 ## age 0 0 0 0 ## socprof 0 0 0 0 ## income 0 0 0 0 ## marital 0 0 0 0 ## depress 0 0 0 0 ## sport 0 0 0 0 ## nofriend 0 0 0 0 ## smoke 0 0 0 0 ## nociga 0 0 0 0 ## alcabuse 1 0 0 0 ## bmi 1 1 0 0 ## area 0 0 0 0 # synthesising with new predictor synth.obj.b <- syn(original.df, cont.na = cont.na.list, rules = rules.list, rvalues = rules.value.list, maxfaclevels = 200, seed = myseed, proper = TRUE, predictor.matrix = new.pred.mat) ## ## Unexpected values (not obeying the rules) found for variable(s): nociga. ## Rules have been applied but make sure they are correct. ## Synthesis ## ----------- ## sex age socprof income marital depress sport nofriend smoke nociga ## alcabuse bmi area # compare the synthetic and original data frames compare(synth.obj.b, original.df, vars = "area", nrow = 1, ncol = 1, cols = c("darkmagenta", "turquoise"), stat = "counts")$plotThe area variable is simulated fairly well on simply age and sex. It captures the large and small areas, however the large areas are relatively more variable. This could use some fine tuning, but will stick with this for now.

tab.syn <- synth.obj.b$syn %>% dplyr::select(area, sex) %>% table() tab.orig <- original.df %>% dplyr::select(area, sex) %>% table() ## ## synthetic ## sex ## area MALE FEMALE ## area1 2 0 ## area10 15 19 ## area101 0 6 ## area103 35 22 ## area105 3 0 ## area106 30 31 ## area107 0 3 ## area108 28 11 ## area110 17 37 ## area112 23 24 ## area113 0 2 ## area114 21 52 ## area115 30 28 ## ## original ## sex ## area MALE FEMALE ## area1 1 0 ## area10 19 27 ## area101 1 3 ## area103 29 26 ## area105 1 0 ## area106 22 33 ## area107 0 4 ## area108 28 18 ## area110 18 33 ## area112 23 25 ## area113 1 2 ## area114 28 39 ## area115 23 44 d <- data.frame(difference = as.numeric(tab.syn - tab.orig), sex = c(rep("Male", 150), rep("Female", 150))) ggplot(d, aes(x = difference, fill = sex)) + geom_histogram() + facet_grid(sex ~ .) + scale_fill_manual(values = mycols) ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.The method does a good job at preserving the structure for the areas. How much variability is acceptable is up to the user and intended purpose. Using more predictors may provide a better fit. The errors are distributed around zero, a good sign no bias has leaked into the data from the synthesis.

tab.syn <- synth.obj.b$syn %>% dplyr::select(marital, sex) %>% table() tab.orig <- original.df %>% dplyr::select(marital, sex) %>% table() ## ## synthetic ## sex ## marital MALE FEMALE ## SINGLE 667 565 ## MARRIED 1352 1644 ## WIDOWED 76 398 ## DIVORCED 81 169 ## LEGALLY SEPARATED 2 0 ## DE FACTO SEPARATED 6 36 ## ## original ## sex ## marital MALE FEMALE ## SINGLE 657 596 ## MARRIED 1382 1597 ## WIDOWED 66 465 ## DIVORCED 62 137 ## LEGALLY SEPARATED 6 1 ## DE FACTO SEPARATED 4 18At higher levels of aggregation the structure of tables is more maintained.

Build your own methods‘synthpop’ is built with a similar function to the ‘mice’ package where user defined methods can be specified and passed to the syn function using the form syn.newmethod. To demonstrate this we’ll build our own neural net method.

As a minimum the function takes as input

- y – observed variable to be synthesised
- x – observed predictors
- xp – synthesised predictors

Set the method vector to apply the new neural net method for the factors, ctree for the others and pass to syn.

# methods vector meth.vec <- c("sample", ifelse(sapply(original.df[,-1], is.factor), "nn", "ctree")) meth.vec[13] <- "ctree" # synthesise synth.obj.c <- syn(original.df, method = meth.vec, cont.na = cont.na.list, rules = rules.list, rvalues = rules.value.list, maxfaclevels = 200, seed = myseed, predictor.matrix = new.pred.mat) ## ## Unexpected values (not obeying the rules) found for variable(s): nociga. ## Rules have been applied but make sure they are correct. ## Synthesis ## ----------- ## sex age socprof income marital depress sport nofriend smoke nociga ## alcabuse bmi area # compare the synthetic and original data frames compare(synth.obj.c, original.df, vars = colnames(original.df)[-13], nrow = 3, ncol = 4, cols = c("darkmagenta", "turquoise"), stat = "counts")$plotThe results are very similar to above with the exception of ‘alcabuse’, but this demonstrates how new methods can be applied.

TakeawaysThe ‘synthpop’ package is great for synthesising data for statistical disclosure control or creating training data for model development. Other things to note,

- Synthesising a single table is fast and simple.
- Watch out for over-fitting particularly with factors with many levels. Ensure the visit sequence is reasonable.
- You are not constrained by only the supported methods, you can build your own!

Following posts tackle complications that arise when there are multiple tables at different grains that are to be synthesised. Further complications arise when their relationships in the database also need to be preserved. Ideally the data is synthesised and stored alongside the original enabling any report or analysis to be conducted on either the original or synthesised data. This will require some trickery to get synthpop to do the right thing, but is possible.

The post Generating Synthetic Data Sets with ‘synthpop’ in R appeared first on Daniel Oehm | Gradient Descending.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Daniel Oehm | Gradient Descending**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Making sense of the METS and ALTO XML standards

(This article was first published on ** Econometrics and Free Software**, and kindly contributed to R-bloggers)

Last week I wrote a blog post where I analyzed

one year of newspapers ads from 19th century newspapers. The data is made available by the

national library of Luxembourg.

In this blog post, which is part 1 of a 2 part series, I extract data from the 257gb archive, which

contains 10 years of publications of the *L’Union*, another 19th century Luxembourguish newspaper

written in French. As I explained in the previous post, to make life easier to data scientists,

the national library also included ALTO and METS files (which are a XML files used to

describe the layout and contents of physical text sources, such as pages of a book or newspaper)

which can be easily parsed by R.

This is how a ALTO file looks like:

Each page of the newspaper of a given day has one ALTO file.

This is how a METS file looks like:

For each daily issue of the newspaper, there is a METS file. So 1 METS file for 4 ALTO files.

In my last blog post, I only extracted the words from the ALTO file (red rectangles of the first

screenshot) and did not touch the METS file.

The problem of doing this is that I get all the words for each page, without knowing which

come from the same article. If I want to know which words come from the same article, I need to use

the info from the METS file. From the METS file I have the ID of the article, and some other

metadata, such as the title of the article and the type of the article (which can be *article*,

*advertisement*, etc). The information highlighted with the green rectangles in the METS file

can be linked to the green rectangles from the ALTO files. My goal is to get the following data

frame from the METS file:

and this data frame from the ALTO files:

As you can see, by combining both data frames I can know which words come from the same article,

which will be helpful for further analysis.

A lot of things happened in the 1860s.

I am really curious to see if and how these events where reported in a Luxembourguish newspaper.

I am particularly curious about how long it took to report certain news from far away, such as the

assassination of Abraham Lincoln. But before that I need to extract the data!

I will only focus on the METS file. The logic for the ALTO file is the same. All the source code

will be in the appendix of this blog post.

First, let’s take a look at a METS file:

library(tidyverse) mets <- read_file("1533660_newspaper_lunion_1860-11-14/1533660_newspaper_lunion_1860-11-14-mets.xml")This is how it looks like:

"<?xml version='1.0' encoding='utf-8'?>\r\n\r\n \r\n \r\n CCS docWORKS/METAe Version 6.4-3\r\n docWORKS-ID: 101636\r\n \r\n \r\n \r\n \r\n \r\n \r\n lunion\r\n \r\n L'UNION.\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n Chemins de fer. — Service d'hiver.\r\n \r\n \r\n fr\r\n ...."As usual when you import text files like this, it’s always a good idea to split the file. I will

split at the "DMDID" character. Take a look back at the second screenshot. The very first tag,

first row, first word after div is "DMDID". By splitting at this level, I will get back a list,

where each element is the content of this div DMDID block. This is exactly what I need, since

this block contains the information from the green rectangles.

So let’s split the mets variable at this level:

Let’s take a look at mets_articles:

str(mets_articles) chr [1:25] "<?xml version='1.0' encoding='utf-8'?>\r\nDoesn’t seem to be very helpful, but actually it is. We can see that mets_articles is a now a list

of 25 elements.

This means that for each element of mets_articles, I need to get the identifier, the label, the type

(the red rectangles from the screenshot), but also the information from the "BEGIN" element (the green

rectangle).

To do this, I’ll be using regular expressions. In general, I start by experimenting in the console,

and then when things start looking good, I write a function. Here is this function:

This function may seem complicated, but it simply encapsulates some pretty standard steps to get

the data I need. I had to consider two cases. The first case is when I need to extract all the

elements with str_extract_all(), or only the first occurrence, with str_extract().

Let’s test it on the first article of the mets_articles list:

Let’s see what happens with all = TRUE:

extractor(mets_articles_1, "ID", all = TRUE) ## [1] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [15] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [29] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [43] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [57] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [71] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [85] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [99] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [113] "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" "ID" ## [127] "ID" "ID" "ID" "ID" "ID"This seems to work as intended. Since I need to call this function several times, I’ll be writing

another function that extracts all I need:

This function uses complex regular expressions to extract the strings I need, and then puts

the result into a data frame, with the tibble() function. I then use unnest(), because label,

type, begins and id are not the same length. label, type and id are of length 1, while

begins is longer. This means that when I put them into a data frame it looks like this:

With unnest(), I get a nice data frame:

tribble(~a, ~b, "a", rep("b", 4)) %>% unnest() ## # A tibble: 4 x 2 ## a b ## ## 1 a b ## 2 a b ## 3 a b ## 4 a bNow, I simply need to map this function to all the files and that’s it! For this, I will write yet

another helper function:

This function takes the path to a METS file as input, and processes it using the steps I explained

above. The only difference is that I add a column containing the name of the file that was processed,

and write the resulting data frame directly to disk as a data frame. Finally, I can map this function to all the METS

files:

I use {furrr} to extract the data from all the files in parallel, by putting 8 cores of my CPU to

work. This took around 3 minutes and 20 seconds to finish.

That’s it for now, stay tuned for part 2 where I will analyze this fresh data!

Hope you enjoyed! If you found this blog post useful, you might want to follow

me on twitter for blog post updates and

buy me an espresso or paypal.me.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Econometrics and Free Software**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Practical Data Science with R, 2nd Edition discount!

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

Please help share our news and this discount.

The second edition of our best-selling book *Practical Data Science with R2*, Zumel, Mount is featured as deal of the day at Manning.

The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished chapters 5 and 6 which are now in production at Manning (so they should be available soon). The authors are hard at work on chapters 7 and 8 right now.

The discount gets you half off. Also the 2nd edition comes with a free e-copy the first edition (so you can jump ahead).

Here are the details in Tweetable form:

Deal of the Day January 13: Half off Practical Data Science with R, Second Edition. Use code dotd011319au at http://bit.ly/2SKAxe9.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...