Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 14 hours ago

Image Classification on Small Datasets with Keras

Thu, 12/14/2017 - 01:00

(This article was first published on TensorFlow for R, and kindly contributed to R-bloggers)

Deep Learning with R

This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at

Training a convnet with a small dataset

Having to train an image-classification model using very little data is a common situation, which you’ll likely encounter in practice if you ever do computer vision in a professional context. A “few” samples can mean anywhere from a few hundred to a few tens of thousands of images. As a practical example, we’ll focus on classifying images as dogs or cats, in a dataset containing 4,000 pictures of cats and dogs (2,000 cats, 2,000 dogs). We’ll use 2,000 pictures for training – 1,000 for validation, and 1,000 for testing.

In Chapter 5 of the Deep Learning with R book we review three techniques for tackling this problem. The first of these is training a small model from scratch on what little data you have (which achieves an accuracy of 82%). Subsequently we use feature extraction with a pretrained network (resulting in an accuracy of 90% to 96%) and fine-tuning a pretrained network (with a final accuracy of 97%). In this post we’ll cover only the second and third techniques.

The relevance of deep learning for small-data problems

You’ll sometimes hear that deep learning only works when lots of data is available. This is valid in part: one fundamental characteristic of deep learning is that it can find interesting features in the training data on its own, without any need for manual feature engineering, and this can only be achieved when lots of training examples are available. This is especially true for problems where the input samples are very high-dimensional, like images.

But what constitutes lots of samples is relative – relative to the size and depth of the network you’re trying to train, for starters. It isn’t possible to train a convnet to solve a complex problem with just a few tens of samples, but a few hundred can potentially suffice if the model is small and well regularized and the task is simple. Because convnets learn local, translation-invariant features, they’re highly data efficient on perceptual problems. Training a convnet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any custom feature engineering. You’ll see this in action in this section.

What’s more, deep-learning models are by nature highly repurposable: you can take, say, an image-classification or speech-to-text model trained on a large-scale dataset and reuse it on a significantly different problem with only minor changes. Specifically, in the case of computer vision, many pretrained models (usually trained on the ImageNet dataset) are now publicly available for download and can be used to bootstrap powerful vision models out of very little data. That’s what you’ll do in the next section. Let’s start by getting your hands on the data.

Downloading the data

The Dogs vs. Cats dataset that you’ll use isn’t packaged with Keras. It was made available by Kaggle as part of a computer-vision competition in late 2013, back when convnets weren’t mainstream. You can download the original dataset from (you’ll need to create a Kaggle account if you don’t already have one – don’t worry, the process is painless).

The pictures are medium-resolution color JPEGs. Here are some examples:

Unsurprisingly, the dogs-versus-cats Kaggle competition in 2013 was won by entrants who used convnets. The best entries achieved up to 95% accuracy. Below you’ll end up with a 97% accuracy, even though you’ll train your models on less than 10% of the data that was available to the competitors.

This dataset contains 25,000 images of dogs and cats (12,500 from each class) and is 543 MB (compressed). After downloading and uncompressing it, you’ll create a new dataset containing three subsets: a training set with 1,000 samples of each class, a validation set with 500 samples of each class, and a test set with 500 samples of each class.

Following is the code to do this:

original_dataset_dir <- "~/Downloads/kaggle_original_data" base_dir <- "~/Downloads/cats_and_dogs_small" dir.create(base_dir) train_dir <- file.path(base_dir, "train") dir.create(train_dir) validation_dir <- file.path(base_dir, "validation") dir.create(validation_dir) test_dir <- file.path(base_dir, "test") dir.create(test_dir) train_cats_dir <- file.path(train_dir, "cats") dir.create(train_cats_dir) train_dogs_dir <- file.path(train_dir, "dogs") dir.create(train_dogs_dir) validation_cats_dir <- file.path(validation_dir, "cats") dir.create(validation_cats_dir) validation_dogs_dir <- file.path(validation_dir, "dogs") dir.create(validation_dogs_dir) test_cats_dir <- file.path(test_dir, "cats") dir.create(test_cats_dir) test_dogs_dir <- file.path(test_dir, "dogs") dir.create(test_dogs_dir) fnames <- paste0("cat.", 1:1000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(train_cats_dir)) fnames <- paste0("cat.", 1001:1500, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(validation_cats_dir)) fnames <- paste0("cat.", 1501:2000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(test_cats_dir)) fnames <- paste0("dog.", 1:1000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(train_dogs_dir)) fnames <- paste0("dog.", 1001:1500, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(validation_dogs_dir)) fnames <- paste0("dog.", 1501:2000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(test_dogs_dir)) Using a pretrained convnet

A common and highly effective approach to deep learning on small image datasets is to use a pretrained network. A pretrained network is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computer-vision problems, even though these new problems may involve completely different classes than those of the original task. For instance, you might train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as identifying furniture items in images. Such portability of learned features across different problems is a key advantage of deep learning compared to many older, shallow-learning approaches, and it makes deep learning very effective for small-data problems.

In this case, let’s consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1,000 different classes). ImageNet contains many animal classes, including different species of cats and dogs, and you can thus expect to perform well on the dogs-versus-cats classification problem.

You’ll use the VGG16 architecture, developed by Karen Simonyan and Andrew Zisserman in 2014; it’s a simple and widely used convnet architecture for ImageNet. Although it’s an older model, far from the current state of the art and somewhat heavier than many other recent models, I chose it because its architecture is similar to what you’re already familiar with and is easy to understand without introducing any new concepts. This may be your first encounter with one of these cutesy model names – VGG, ResNet, Inception, Inception-ResNet, Xception, and so on; you’ll get used to them, because they will come up frequently if you keep doing deep learning for computer vision.

There are two ways to use a pretrained network: feature extraction and fine-tuning. We’ll cover both of them. Let’s start with feature extraction.

Feature extraction

Feature extraction consists of using the representations learned by a previous network to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch.

As you saw previously, convnets used for image classification comprise two parts: they start with a series of pooling and convolution layers, and they end with a densely connected classifier. The first part is called the convolutional base of the model. In the case of convnets, feature extraction consists of taking the convolutional base of a previously trained network, running the new data through it, and training a new classifier on top of the output.

Why only reuse the convolutional base? Could you reuse the densely connected classifier as well? In general, doing so should be avoided. The reason is that the representations learned by the convolutional base are likely to be more generic and therefore more reusable: the feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computer-vision problem at hand. But the representations learned by the classifier will necessarily be specific to the set of classes on which the model was trained – they will only contain information about the presence probability of this or that class in the entire picture. Additionally, representations found in densely connected layers no longer contain any information about where objects are located in the input image: these layers get rid of the notion of space, whereas the object location is still described by convolutional feature maps. For problems where object location matters, densely connected features are largely useless.

Note that the level of generality (and therefore reusability) of the representations extracted by specific convolution layers depends on the depth of the layer in the model. Layers that come earlier in the model extract local, highly generic feature maps (such as visual edges, colors, and textures), whereas layers that are higher up extract more-abstract concepts (such as “cat ear” or “dog eye”). So if your new dataset differs a lot from the dataset on which the original model was trained, you may be better off using only the first few layers of the model to do feature extraction, rather than using the entire convolutional base.

In this case, because the ImageNet class set contains multiple dog and cat classes, it’s likely to be beneficial to reuse the information contained in the densely connected layers of the original model. But we’ll choose not to, in order to cover the more general case where the class set of the new problem doesn’t overlap the class set of the original model.

Let’s put this in practice by using the convolutional base of the VGG16 network, trained on ImageNet, to extract interesting features from cat and dog images, and then train a dogs-versus-cats classifier on top of these features.

The VGG16 model, among others, comes prepackaged with Keras. Here’s the list of image-classification models (all pretrained on the ImageNet dataset) that are available as part of Keras:

  • Xception
  • Inception V3
  • ResNet50
  • VGG16
  • VGG19
  • MobileNet

Let’s instantiate the VGG16 model.

library(keras) conv_base <- application_vgg16( weights = "imagenet", include_top = FALSE, input_shape = c(150, 150, 3) )

You pass three arguments to the function:

  • weights specifies the weight checkpoint from which to initialize the model.
  • include_top refers to including (or not) the densely connected classifier on top of the network. By default, this densely connected classifier corresponds to the 1,000 classes from ImageNet. Because you intend to use your own densely connected classifier (with only two classes: cat and dog), you don’t need to include it.
  • input_shape is the shape of the image tensors that you’ll feed to the network. This argument is purely optional: if you don’t pass it, the network will be able to process inputs of any size.

Here’s the detail of the architecture of the VGG16 convolutional base. It’s similar to the simple convnets you’re already familiar with:

summary(conv_base) Layer (type) Output Shape Param # ================================================================ input_1 (InputLayer) (None, 150, 150, 3) 0 ________________________________________________________________ block1_conv1 (Convolution2D) (None, 150, 150, 64) 1792 ________________________________________________________________ block1_conv2 (Convolution2D) (None, 150, 150, 64) 36928 ________________________________________________________________ block1_pool (MaxPooling2D) (None, 75, 75, 64) 0 ________________________________________________________________ block2_conv1 (Convolution2D) (None, 75, 75, 128) 73856 ________________________________________________________________ block2_conv2 (Convolution2D) (None, 75, 75, 128) 147584 ________________________________________________________________ block2_pool (MaxPooling2D) (None, 37, 37, 128) 0 ________________________________________________________________ block3_conv1 (Convolution2D) (None, 37, 37, 256) 295168 ________________________________________________________________ block3_conv2 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_conv3 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_pool (MaxPooling2D) (None, 18, 18, 256) 0 ________________________________________________________________ block4_conv1 (Convolution2D) (None, 18, 18, 512) 1180160 ________________________________________________________________ block4_conv2 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_conv3 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_pool (MaxPooling2D) (None, 9, 9, 512) 0 ________________________________________________________________ block5_conv1 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv2 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv3 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_pool (MaxPooling2D) (None, 4, 4, 512) 0 ================================================================ Total params: 14,714,688 Trainable params: 14,714,688 Non-trainable params: 0

The final feature map has shape (4, 4, 512). That’s the feature on top of which you’ll stick a densely connected classifier.

At this point, there are two ways you could proceed:

  • Running the convolutional base over your dataset, recording its output to an array on disk, and then using this data as input to a standalone, densely connected classifier similar to those you saw in part 1 of this book. This solution is fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. But for the same reason, this technique won’t allow you to use data augmentation.

  • Extending the model you have (conv_base) by adding dense layers on top, and running the whole thing end to end on the input data. This will allow you to use data augmentation, because every input image goes through the convolutional base every time it’s seen by the model. But for the same reason, this technique is far more expensive than the first.

In this post we’ll cover the second technique in detail (in the book we cover both). Note that this technique is so expensive that you should only attempt it if you have access to a GPU – it’s absolutely intractable on a CPU.

Feature extraction with data augmentation

Because models behave just like layers, you can add a model (like conv_base) to a sequential model just like you would add a layer.

model <- keras_model_sequential() %>% conv_base %>% layer_flatten() %>% layer_dense(units = 256, activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")

This is what the model looks like now:

summary(model) Layer (type) Output Shape Param # ================================================================ vgg16 (Model) (None, 4, 4, 512) 14714688 ________________________________________________________________ flatten_1 (Flatten) (None, 8192) 0 ________________________________________________________________ dense_1 (Dense) (None, 256) 2097408 ________________________________________________________________ dense_2 (Dense) (None, 1) 257 ================================================================ Total params: 16,812,353 Trainable params: 16,812,353 Non-trainable params: 0

As you can see, the convolutional base of VGG16 has 14,714,688 parameters, which is very large. The classifier you’re adding on top has 2 million parameters.

Before you compile and train the model, it’s very important to freeze the convolutional base. Freezing a layer or set of layers means preventing their weights from being updated during training. If you don’t do this, then the representations that were previously learned by the convolutional base will be modified during training. Because the dense layers on top are randomly initialized, very large weight updates would be propagated through the network, effectively destroying the representations previously learned.

In Keras, you freeze a network using the freeze_weights() function:

length(model$trainable_weights) [1] 30 freeze_weights(conv_base) length(model$trainable_weights) [1] 4

With this setup, only the weights from the two dense layers that you added will be trained. That’s a total of four weight tensors: two per layer (the main weight matrix and the bias vector). Note that in order for these changes to take effect, you must first compile the model. If you ever modify weight trainability after compilation, you should then recompile the model, or these changes will be ignored.

Using data augmentation

Overfitting is caused by having too few samples to learn from, rendering you unable to train a model that can generalize to new data. Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit. Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images. The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better.

In Keras, this can be done by configuring a number of random transformations to be performed on the images read by an image_data_generator(). For example:

train_datagen = image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" )

These are just a few of the options available (for more, see the Keras documentation). Let’s quickly go over this code:

  • rotation_range is a value in degrees (0–180), a range within which to randomly rotate pictures.
  • width_shift and height_shift are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
  • shear_range is for randomly applying shearing transformations.
  • zoom_range is for randomly zooming inside pictures.
  • horizontal_flip is for randomly flipping half the images horizontally – relevant when there are no assumptions of horizontal asymmetry (for example, real-world pictures).
  • fill_mode is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.

Now we can train our model using the image data generator:

# Note that the validation data shouldn't be augmented! test_datagen <- image_data_generator(rescale = 1/255) train_generator <- flow_images_from_directory( train_dir, # Target directory train_datagen, # Data generator target_size = c(150, 150), # Resizes all images to 150 × 150 batch_size = 20, class_mode = "binary" # binary_crossentropy loss for binary labels ) validation_generator <- flow_images_from_directory( validation_dir, test_datagen, target_size = c(150, 150), batch_size = 20, class_mode = "binary" ) model %>% compile( loss = "binary_crossentropy", optimizer = optimizer_rmsprop(lr = 2e-5), metrics = c("accuracy") ) history <- model %>% fit_generator( train_generator, steps_per_epoch = 100, epochs = 30, validation_data = validation_generator, validation_steps = 50 )

Let’s plot the results. As you can see, you reach a validation accuracy of about 96%.


Another widely used technique for model reuse, complementary to feature extraction, is fine-tuning Fine-tuning consists of unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in this case, the fully connected classifier) and these top layers. This is called fine-tuning because it slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant for the problem at hand.

I stated earlier that it’s necessary to freeze the convolution base of VGG16 in order to be able to train a randomly initialized classifier on top. For the same reason, it’s only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained. If the classifier isn’t already trained, then the error signal propagating through the network during training will be too large, and the representations previously learned by the layers being fine-tuned will be destroyed. Thus the steps for fine-tuning a network are as follows:

  • Add your custom network on top of an already-trained base network.
  • Freeze the base network.
  • Train the part you added.
  • Unfreeze some layers in the base network.
  • Jointly train both these layers and the part you added.

You already completed the first three steps when doing feature extraction. Let’s proceed with step 4: you’ll unfreeze your conv_base and then freeze individual layers inside it.

As a reminder, this is what your convolutional base looks like:

summary(conv_base) Layer (type) Output Shape Param # ================================================================ input_1 (InputLayer) (None, 150, 150, 3) 0 ________________________________________________________________ block1_conv1 (Convolution2D) (None, 150, 150, 64) 1792 ________________________________________________________________ block1_conv2 (Convolution2D) (None, 150, 150, 64) 36928 ________________________________________________________________ block1_pool (MaxPooling2D) (None, 75, 75, 64) 0 ________________________________________________________________ block2_conv1 (Convolution2D) (None, 75, 75, 128) 73856 ________________________________________________________________ block2_conv2 (Convolution2D) (None, 75, 75, 128) 147584 ________________________________________________________________ block2_pool (MaxPooling2D) (None, 37, 37, 128) 0 ________________________________________________________________ block3_conv1 (Convolution2D) (None, 37, 37, 256) 295168 ________________________________________________________________ block3_conv2 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_conv3 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_pool (MaxPooling2D) (None, 18, 18, 256) 0 ________________________________________________________________ block4_conv1 (Convolution2D) (None, 18, 18, 512) 1180160 ________________________________________________________________ block4_conv2 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_conv3 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_pool (MaxPooling2D) (None, 9, 9, 512) 0 ________________________________________________________________ block5_conv1 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv2 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv3 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_pool (MaxPooling2D) (None, 4, 4, 512) 0 ================================================================ Total params: 14714688

You’ll fine-tune the last three convolutional layers, which means all layers up to block4_pool should be frozen, and the layers block5_conv1, block5_conv2, and block5_conv3 should be trainable.

Why not fine-tune more layers? Why not fine-tune the entire convolutional base? You could. But you need to consider the following:

  • Earlier layers in the convolutional base encode more-generic, reusable features, whereas layers higher up encode more-specialized features. It’s more useful to fine-tune the more specialized features, because these are the ones that need to be repurposed on your new problem. There would be fast-decreasing returns in fine-tuning lower layers.
  • The more parameters you’re training, the more you’re at risk of overfitting. The convolutional base has 15 million parameters, so it would be risky to attempt to train it on your small dataset.

Thus, in this situation, it’s a good strategy to fine-tune only the top two or three layers in the convolutional base. Let’s set this up, starting from where you left off in the previous example.

unfreeze_weights(conv_base, from = "block5_conv1")

Now you can begin fine-tuning the network. You’ll do this with the RMSProp optimizer, using a very low learning rate. The reason for using a low learning rate is that you want to limit the magnitude of the modifications you make to the representations of the three layers you’re fine-tuning. Updates that are too large may harm these representations.

model %>% compile( loss = "binary_crossentropy", optimizer = optimizer_rmsprop(lr = 1e-5), metrics = c("accuracy") ) history <- model %>% fit_generator( train_generator, steps_per_epoch = 100, epochs = 100, validation_data = validation_generator, validation_steps = 50 )

Let’s plot our results:

You’re seeing a nice 1% absolute improvement in accuracy, from about 96% to above 97%.

Note that the loss curve doesn’t show any real improvement (in fact, it’s deteriorating). You may wonder, how could accuracy stay stable or improve if the loss isn’t decreasing? The answer is simple: what you display is an average of pointwise loss values; but what matters for accuracy is the distribution of the loss values, not their average, because accuracy is the result of a binary thresholding of the class probability predicted by the model. The model may still be improving even if this isn’t reflected in the average loss.

You can now finally evaluate this model on the test data:

test_generator <- flow_images_from_directory( test_dir, test_datagen, target_size = c(150, 150), batch_size = 20, class_mode = "binary" ) model %>% evaluate_generator(test_generator, steps = 50) $loss [1] 0.2840393 $acc [1] 0.972

Here you get a test accuracy of 97.2%. In the original Kaggle competition around this dataset, this would have been one of the top results. But using modern deep-learning techniques, you managed to reach this result using only a small fraction of the training data available (about 10%). There is a huge difference between being able to train on 20,000 samples compared to 2,000 samples!

Take-aways: using convnets with small datasets

Here’s what you should take away from the exercises in the past two sections:

  • Convnets are the best type of machine-learning models for computer-vision tasks. It’s possible to train one from scratch even on a very small dataset, with decent results.
  • On a small dataset, overfitting will be the main issue. Data augmentation is a powerful way to fight overfitting when you’re working with image data.
  • It’s easy to reuse an existing convnet on a new dataset via feature extraction. This is a valuable technique for working with small image datasets.
  • As a complement to feature extraction, you can use fine-tuning, which adapts to a new problem some of the representations previously learned by an existing model. This pushes performance a bit further.

Now you have a solid set of tools for dealing with image-classification problems – in particular with small datasets.

main { hyphens: inherit; }

Deep Learning with R

This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R in the Windows Subsystem for Linux

Wed, 12/13/2017 - 23:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R has been available for Windows since the very beginning, but if you have a Windows machine and want to use R within a Linux ecosystem, that's easy to do with the new Fall Creator's Update (version 1709). If you need access to the gcc toolchain for building R packages, or simply prefer the bash environment, it's easy to get things up and running.

Once you have things set up, you can launch a bash shell and run R at the terminal like you would in any Linux system. And that's because this is a Linux system: the Windows Subsystem for Linux is a complete Linux distribution running within Windows. This page provides the details on installing Linux on Windows, but here are the basic steps you need and how to get the latest version of R up and running within it.

First, Enable the Windows Subsystem for Linux option. Go to Control Panel > Programs > Turn Windows Features on or off (or just type "Windows Features" into the search box), and select the "Windows Subsystem for Linux" option. You'll need to reboot, just this once. 

Next, you'll need to install your preferred distribution of Linux from the Microsoft Store. If you search for "Linux" in the store, you'll find an entry "Run Linux on Windows" which will provide you with the available distributions. I'm using "Ubuntu", which as of this writing is Ubuntu 16.04 (Xenial Xerus).

Once that's installed you can launch Ubuntu from the Start menu (just like any other app) to open a new bash shell window. The first time you launch, it will take a few minutes to install various components, and you'll also need to create a username and password. This is your Linux username, different from your Windows username. You'll automatically log in when you launch new Ubuntu sessions, but make sure you remember the password — you'll need it later.

From here you can go ahead and install R, but if you use the default Ubuntu repository you'll get an old version of R (R 3.2.3, from 2015). You probably want the latest version of R, so add CRAN as a new package repository for Ubuntu. You'll need to run these three commands as root, so enter the password you created above here if requested:

sudo echo "deb xenial/" | sudo tee -a /etc/apt/sources.list

sudo apt-key adv --keyserver --recv-keys E084DAB9

sudo apt-get update

(Don't be surprised by the message key E084DAB9: public key "Michael Rutter " imported. That's how Ubuntu signs the R packages.)

Now you're all set to install the latest version of R, which can be done with:

sudo apt-get install r-base

And that's it! (Once all the dependencies install, anyway, which can take a while the first time.) Now you're all ready to run R from the Linux command line:

Note that you can access files on your Windows system from R — you'll find them at /mnt/c/Users/. This FAQ on the WSL provides other useful tips, and for complete details refer to the Windows for Linux Subsystem Documentation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Color palettes derived from the Dutch masters

Wed, 12/13/2017 - 19:33

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Among tulip fields, canals and sampling cheese,
the museums of the Netherlands are one of its biggest tourist attractions. And
for very good reasons! During the seventeenth century, known as the Dutch Golden
Age, there was an abundance of talented painters. If you ever have
the chance to visit the Rijksmuseum you will be in awe by the landscapes,
households and portraits, painted with incredible detail and beautiful colors.

The dutchmasters color palette

Rembrandt van Rijn and Johannes Vermeer are the most famous of the seventeenth
century Dutch masters. Both are renowned for their use of light and color,
making encounters with their subjects feel as being in the room with them.
Recently, during the OzUnconference, the beautiful ochRe package was
developed. This package contains color palettes of the wonderful Australian
landscape (which my wife got to witness during our honeymoon last
year). Drawing colors from both works of art and photographs of Australia.
I shamelessly stole both the idea and package structure of ochRe to create dutchmasters. This package contains six color
palettes derived from my six favourites by Van Rijn and Vermeer.

Vermeer’s paintings are renowned for their vivid and light colors. The package
offers palettes from Vermeer’s The Milkmaid,
Girl with a Pearl Earring,
View of Delft, and
The Little Street.
Rembrandt’s paintings on the other hand are more atmospheric, with a lot of dark
colors and subjects that stare you right into the soul. From him you find
palettes derived from the The Anatomy Lesson of Dr. Nicolaes Tulp
and The “Staalmeesters”.

Like ochRe, the package comprises a list with the color palettes. I have added
names to each of the color codes, reflecting the color it represents and moreover
where on the painting the color was taken from. As mentioned, I shamelessly
converted the functions from ochRe into dutchmasters functions. Why invent
something that smarter people already figured out?

Using the package

Grab the package from github with devtools::install_github("EdwinTh/dutchmasters").
Make sure to install ochRe right away for the handy viz_palette function,
with devtools::install_github("ropenscilabs/ochRe/").

This is what the palette “milkmaid” looks like.

library(dutchmasters) ochRe::viz_palette(dutchmasters[["milkmaid"]])

Now to put those beautiful Vermeer shades into action on a ggplot.

library(ggplot2) ggplot(diamonds, aes(color, fill = clarity)) + geom_bar() + scale_fill_dutchmasters(palette = "milkmaid")

Or maybe the dark “staalmeesters” colors?

ggplot(diamonds, aes(color, fill = clarity)) + geom_bar() + scale_fill_dutchmasters(palette = "staalmeesters")

I leave the other four palettes for you to explore. In the future I will
certainly add more paintings, as the well of the Dutch masters seems bottomless.

A big thanks to the ochRe team, for inspiration and groundwork. I hope that
this package motivates you to further explore the wonderful art of the Dutch
Golden Age.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping oil production by country in R

Wed, 12/13/2017 - 15:10

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

This week, here’s another map.

As I’ve mentioned, projects like this – a few hundred lines of code, with a variety of functions from the tidyverse – are excellent practice for intermediate data scientists.

But if you’re a beginner, don’t be intimidated. You can actually learn these tools much faster than you think.

To get to a point where you really know how this works, you’ll need a few things. You will need to know a few things about ggplot2, dplyr and the tidyverse. I recommend that you break this down, line by line. Do you know all of the functions? Do you know how it all works? If you don’t, you might have some more practice to do.

Ultimately, to be “fluent” in data science in R, you should be able to write this code (or something like it) from scratch, more or less from memory.

In any case, there’s a lot going on in this code.

In a separate post, I will explain a few details about how this works, and expand on it somewhat.

In the meantime, if you want to learn to do projects like this, go through the code below. Break it down into small pieces that you can study and practice.

And if you want to know more about data visualization and data science in R, sign up for our email list.

Sign up now, and discover how to rapidly master data science

To master data visualization and data science, you need to master the essential tools.

Moreover, to make rapid progress, you need to know what to learn, what not to learn, and you need to know how to practice what you learn.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

Sign up now for our email list, and you’ll receive regular tutorials and lessons.

You’ll learn:

  • How to do data visualization in R
  • How to practice data science
  • How to apply data visualization to more advanced topics (like machine learning)
  • … and more

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.



Code: mapping oil production by country #============== # LOAD PACKAGES #============== library(tidyverse) library(sf) library(rvest) library(stringr) library(scales) #library(viridis) #============ # SCRAPE DATA #============ df.oil <- read_html("") %>% html_nodes("table") %>% .[[1]] %>% html_table() #==================== # CHANGE COLUMN NAMES #==================== colnames(df.oil) <- c('rank', 'country', 'oil_bbl_per_day') #============================= # WRANGLE VARIABLES INTO SHAPE #============================= #---------------------------------- # COERCE 'rank' VARIABLE TO INTEGER #---------------------------------- df.oil <- df.oil %>% mutate(rank = as.integer(rank)) df.oil %>% glimpse() #--------------------------------------------------- # WRANGLE FROM CHARACTER TO NUMERIC: oil_bbl_per_day #--------------------------------------------------- df.oil <- df.oil %>% mutate(oil_bbl_per_day = oil_bbl_per_day %>% str_replace_all(',','') %>% as.integer()) # inspect df.oil %>% glimpse() #=========================== #CREATE VARIABLE: 'opec_ind' #=========================== df.oil <- df.oil %>% mutate(opec_ind = if_else(str_detect(country, 'OPEC'), 1, 0)) #========================================================= # CLEAN UP 'country' # - some country names are tagged as being OPEC countries # and this information is in the country name # - we will strip this information out #========================================================= df.oil <- df.oil %>% mutate(country = country %>% str_replace(' \\(OPEC\\)', '') %>% str_replace('\\s{2,}',' ')) # inspect df.oil %>% glimpse() #------------------------------------------ # EXAMINE OPEC COUNTRIES # - here, we'll just visually inspect # to make sure that the names are correct #------------------------------------------ df.oil %>% filter(opec_ind == 1) %>% select(country) #================== # REORDER VARIABLES #================== df.oil <- df.oil %>% select(rank, country, opec_ind, oil_bbl_per_day) df.oil %>% glimpse() #======== # GET MAP #======== <- map_data('world') df.oil #========================== # CHECK FOR JOIN MISMATCHES #========================== anti_join(df.oil,, by = c('country' = 'region')) # rank country opec_ind oil_bbl_per_day # 1 67 Congo, Democratic Republic of the 0 20,000 # 2 47 Trinidad and Tobago 0 60,090 # 3 34 Sudan and South Sudan 0 255,000 # 4 30 Congo, Republic of the 0 308,363 # 5 20 United Kingdom 0 939,760 # 6 3 United States 0 8,875,817 #===================== # RECODE COUNTRY NAMES #===================== %>% group_by(region) %>% summarise() %>% print(n = Inf) # UK # USA # Democratic Republic of the Congo # Trinidad # Sudan # South Sudan df.oil <- df.oil %>% mutate(country = recode(country, `United States` = 'USA' , `United Kingdom` = 'UK' , `Congo, Democratic Republic of the` = 'Democratic Republic of the Congo' , `Trinidad and Tobago` = 'Trinidad' , `Sudan and South Sudan` = 'Sudan' #, `Sudan and South Sudan` = 'South Sudan' , `Congo, Republic of the` = 'Republic of Congo' ) ) #----------------------- # JOIN DATASETS TOGETHER #----------------------- map.oil <- left_join(, df.oil, by = c('region' = 'country')) #===== # PLOT #===== # BASIC (this is a first draft) ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(fill = oil_bbl_per_day)) #======================= # FINAL, FORMATTED DRAFT #======================= df.oil %>% filter(oil_bbl_per_day > 822675) %>% summarise(mean(oil_bbl_per_day)) # 3190373 df.oil %>% filter(oil_bbl_per_day < 822675) %>% summarise(mean(oil_bbl_per_day)) # 96581.08 ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(fill = oil_bbl_per_day)) + scale_fill_gradientn(colours = c('#461863','#404E88','#2A8A8C','#7FD157','#F9E53F') ,values = scales::rescale(c(100,96581,822675,3190373,10000000)) ,labels = comma ,breaks = c(100,96581,822675,3190373,10000000) ) + guides(fill = guide_legend(reverse = T)) + labs(fill = 'bbl/day' ,title = 'Oil Production by Country' ,subtitle = 'Barrels per day, 2016' ,x = NULL ,y = NULL) + theme(text = element_text(family = 'Gill Sans', color = '#EEEEEE') ,plot.title = element_text(size = 28) ,plot.subtitle = element_text(size = 14) ,axis.ticks = element_blank() ,axis.text = element_blank() ,panel.grid = element_blank() ,panel.background = element_rect(fill = '#333333') ,plot.background = element_rect(fill = '#333333') ,legend.position = c(.18,.36) ,legend.background = element_blank() ,legend.key = element_blank() ) + annotate(geom = 'text' ,label = 'Source: U.S. Energy Information Administration\n' ,x = 18, y = -55 ,size = 3 ,family = 'Gill Sans' ,color = '#CCCCCC' ,hjust = 'left' )

And here’s the finalized map that the code produces:

To get more tutorials that explain how this code works, sign up for our email list.


The post Mapping oil production by country in R appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dummy Variable for Examining Structural Instability in Regression: An Alternative to Chow Test

Wed, 12/13/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

One of the fast growing economies in the era of globalization is the Ethiopian economy. Among the lower income group countries, it has emerged as one of the rare countries to achieve a double digit growth rate in Grows Domestic Product (GDP). However, there is a great deal of debate regarding the double digit growth rate, especially during the recent global recession period. So, it becomes a question of empirical research whether there is a structural change in the relationship between the GDP of Ethiopia and the regressor (time).

How do we find out that a structural change has in fact occurred? To answer this question, we consider the GDP of Ethiopia (measured on constant 2010 US$) over the period of 1981 to 2015. Like many other countries in the world, Ethiopia has adopted the policy of regulated globalization during the early nineties of the last century. So, our aim is to whether the GDP of Ethiopia has undergone any structural changes following the major policy shift due to adoption of globalization policy. To answer this question, we have two options in statistical and econometric research. The most important classes of tests on structural change are the tests from the generalized fluctuation test framework (Kuan and Hornik, 1995) on the one hand and tests based on F statistics (Hansen, 1992; Andrews, 1993; Andrews and Ploberger, 1994) on the other. The first class includes in particular the CUSUM and MOSUM tests and the fluctuation test, while the Chow and the supF test belong to the latter. A topic that gained more interest rather recently is to monitor structural change, i.e., to start after a history phase (without structural changes) to analyze new observations and to be able to detect a structural change as soon after its occurrence as possible.

Let us divide the whole study period in two sub-periods

    1. pre-globalization (1981 – 1991)
    2. post-globalization (1992-2015)

$$ \text {Pre – globalization period (1981 – 1991):} ~~~ lnGDP=\beta_{01} +β_{11} t +u_1 \\ \text {Post – globalization period (1992 – 2015):}~~~ lnGDP=\beta_{02} +\beta_{12} t+u_2 \\ \text {Whole period (1981 – 2015):} ~~~ lnGDP=\beta_0 +\beta_1 t +u $$

The regression for the whole period assumes that there is no difference between the two time periods and, therefore, estimates the GDP growth rate for the entire time period. In other words, this regression assumes that the intercept, as well as the slope coefficient, remains the same over the entire period; that is, there is no structural change. If this is, in fact, the situation, then
\( \beta_{01} =\beta_{02}=\beta_0 ~~~~and~~~~ \beta_{11} =\beta_{12} =\beta_1 \)

The first two regression lines assume that the regression parameters are different in two sub-period periods, that is, there is structural instability either because of changes in slope parameters or intercept parameters.

Solution by Chow

To apply Chow test to visualize the structural changes empirically, the following assumptions are required:

    i. The error terms for the two sub-period regressions are normally distributed with the same variance;
    ii. The two error terms are independently distributed.
    iii. The break-point at which the structural stability to be examined should be known apriori.

The Chow test examines the following set of hypotheses:
$$ H_0 : \text {There is no structural change} \\ against \\ H_1 : \text {There is a structural change}

Steps in Chow test

Step 1: Estimate third regression assuming that there is no parameter instability, and obtain the Residual Sum Squares \( RSS_R ~~ with ~~~df =(n_1 +n_2−2k ) \\ where~~k=\text {No of parameters estimated}\\ n_1= \text {No of observations in first period}\\n_2=\text {No of observations in second sub-periods} \)
So, in present case \(k = 2,\) as there are two parameters (intercept term and slope coefficient).

We call this Restricted Residual Sum Squares as it assumes the restriction of structural stability, that is, \( \beta_{01}=\beta_{02}=\beta_{0} ~~~and~~~\beta_{11}=\beta_{12}=\beta_1 \)

Step 2: Estimate the first and second regressions assuming that there is parameter instability and obtain the respective Residual Sum Squares as
\( RSS_1 = \text {Residual Sum Squares for the first sub-period regression with}~~ df =n_1−k \\RSS_2 = \text {Residual Sum Squares for the second sub-period regression with}~~ df =n_2−k \)

Step 3: As the two sets of samples supposed to be independent, we can add $RSS_1$ and $RSS_2$ and the resultant Residual Sum Squares may be called the Unrestricted Residual Sum of Squares, that is, \( RSS_{UR}=RSS_1 + RSS_2 ~~~with ~~~df = (n_1 + n_2 -2k) \)
The test statistic to be used is given as:
\( F= \frac {\left (RSS_R – RSS_{UR} \right)/k}{\left (RSS_{UR}\right)/(n_1 +n_2 -2k)} \)

Under the assumption of true null hypothesis, this follows F-distribution.

Now if this F-test is significant, we can reject the null hypothesis of no structural instability and conclude that the fluctuations is the GGP is high enough to believe that it leads to structural instability in the GDP growth path.

Importing the Data into R

Importing the data saved in csv format gdp_ethiopia

ethio_gdp<-read.csv(file.choose(), header = TRUE, sep = ",", skip = 0) # To set the data as time series gdp_ethio<-ts(ethio_gdp$GDP, frequency = 1, start = c(1981), end = c(2015))

Let us have a look at the scatter plot of GDP against time using ggplot2 package

library(ggplot2) ggplot(ethio_gdp,aes( YEAR, GDP))+geom_line(col="red")+labs(title="GDP of Ethiopia over 1981 to 2015")

Gives this plot:

As it observed from the figure that the growth of GDP gets momentum after the adoption of new policy regime, especially after the 2000. Until 1992-93, the economy was looking as a stagnant as the GDP curve was more or less parallel to the x-axis. It is only after 1993 that the GDP started to grow and the growth became more prominent after 2000. To confirm this, we will apply the Chow test by considering the break-point at the year 1992 when the new economic policy was adopted by the Ethiopian government.
To fit the model of GDP growth path by assuming that there is no structural instability in the model, we need to create the time variable by using the following code to do so:


For estimating the growth of GDP for the whole period

model11<- lm(log(GDP)~t , data = ethio_gdp) summary(model11) Call: lm(formula = log(GDP) ~ t, data = ethio_gdp) Residuals: Min 1Q Median 3Q Max -0.28379 -0.17673 0.01783 0.16134 0.34789 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.495383 0.066477 338.4 <2e-16 *** t 0.050230 0.003221 15.6 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1924 on 33 degrees of freedom Multiple R-squared: 0.8805, Adjusted R-squared: 0.8769 F-statistic: 243.2 on 1 and 33 DF, p-value: < 2.2e-16

Thus, we see that over the entire period, the GDP is growing at an annual growth rate of 5 per cent.
We’ll use this object ‘model1’ to apply Chow test in R. For this, we need to load the package called ‘strucchange’ by using the following code:


The break-point occurs in 1992 which will coincide with 12 in the sequence 1 to 35 corresponding to year 1981 to 2015. The code and results are shown below:

sctest(log(gdp_ethio)~t, type = "Chow", point = 12) Chow test data: log(gdp_ethio) ~ t F = 51.331, p-value = 1.455e-10

The above result shows that the F-value is 51.331 and the corresponding p-value is much less than the level of significance (0.05) implying that the test is significant. This confirms that there is structural instability in the GDP growth path of Ethiopia at the break-point 1992.
However, we cannot tell whether the difference in the two regressions is because of differences in the intercept terms or the slope coefficients or both. Very often this knowledge itself is very useful. For analyzing such situation, we have the following four possibilities:
I. Coincident Regression where both the intercept and the slope coefficients are the same in the two sub-period regressions;
II. Parallel Regression where only the intercepts in the two regressions are different but the slopes are the same;
III. Concurrent Regression where the intercepts in the two regressions are the same, but the slopes are different; and
IV. Dis-similar Regression where both the intercepts and slopes in the two regressions are different.
Let us examine case by case.

II. Parallel Regression where only the intercepts in the two regressions are different but the slopes are the same

Suppose the economic reforms do not influence the slope parameter but instead simply affect the intercept term.
Following the economic reforms in 1992, we have the following regression models in two sub-periods – pre-globalization (1981 – 1991) and post-globalization (1992-2015).
\( \begin {aligned} \text {Pre-globalization (1981 – 1991):}~~~& lnGDP=β_{01} +β_{11} t +u_1\\ \text {Post – globalization period (1992 – 2015):}~~~& lnGDP=β_{02} +β_{12} t+u_2\\ \text {Whole period (1981 – 2015):}~~~& lnGDP=β_0 +β_1 t +u \end {aligned} \)

If there is no structural change:

\( \beta_{01}=\beta_{02}=\beta_0 \\and \\ \beta_{11}=\beta_{12}=\beta_1 \)

If there is structural change and that affects only the intercept terms, not the slope coefficients, we have:

\( \beta_{01}\ne\beta_{02}\\and \\ \beta_{11}=\beta_{12} \)

To capture this effect, the dummy variable \(D\) is included in the following manner:

\( \begin {aligned} lGDP&=\beta_0+\beta_1 t+\beta_2 D+u\\where~~D&=1~~\text {for the post-reform period (sub-period 2)} \\ &=0~~ \text {for the post-reform period (sub-period 2)} \end {aligned} \)

The difference in the logarithm of GDP between two sub periods is given by the coefficient \(\beta_2\).

If \(D = 1\), then for the post-reform period, \( \widehat {lGDP} = \hat\beta_0+\hat\beta_1 t+\hat\beta_2 \)

If \(D=0,\) then for the pre-reform period, \( \widehat {lGDP} = \hat\beta_0+\hat\beta_1 t\)

Using the ‘gdp_ethio.csv’ data-file, this model can be estimated. The R code along with the explanation is produced below:

attach(gdp) gdp$t<-seq(1:35) gdp$lGDP<-log(GDP) gdp$D=1992,1,0) attach(gdp) gdp$D<-factor(D, levels = c(0,1), labels = c("pre","post")) attach(gdp) model5 summary(model5) Call: lm(formula = lGDP ~ t + D, data = gdp) Residuals: Min 1Q Median 3Q Max -0.309808 -0.080889 0.006459 0.087419 0.257020 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.505310 0.046417 484.847 < 2e-16 *** t 0.068431 0.003783 18.089 < 2e-16 *** Dpost -0.492244 0.082302 -5.981 1.15e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1343 on 32 degrees of freedom Multiple R-squared: 0.9436, Adjusted R-squared: 0.9401 F-statistic: 267.6 on 2 and 32 DF, p-value: < 2.2e-16

So, the estimated regression model is given by
\( \widehat {lGDP}= 22.505+0.068t-0.492D \)

This would imply that
For the post-reform period: $$ \widehat {lGDP}= 22.013+0.068t $$

For the pre-reform period: $$ \widehat {lGDP}= 22.505+0.068t $$

To graphically visualize this, use the following code to produce the following graphs:

library(ggplot2) ggplot(data=gdp, mapping=aes(x=t, y=lGDP, color=D)) + geom_point() + geom_line(mapping=aes(y=pred))

Gives this plot:

III. Concurrent Regression where the intercepts in the two regressions are the same, but the slopes are different

Now we assume that the economic reforms do not influence the intercept term \(\beta_0\), but simply affect the slope parameter. That is, in terms of our present example, we can say that
\( β_{01} =β_{02} \\and\\ β_{11} \ne β_{12} \)

To capture the effects of such changes in slope parameters, it is necessary to add the product of time variable \(t\) and the dummy variable \(D\). The new variable \(tD\) is called the interaction variable or slope dummy variable, since it allows for a change in the slope of the relationship.

The modified growth estimation model is:

\( lGDP=\beta_0 +\beta_1t+\beta_2 tD+u \)
Obviously, when \(D = 1,\) then \( \widehat{lGDP}=\hat\beta_0 +\hat\beta_1t+\hat\beta_2 t \\\implies \widehat{lGDP}=\hat\beta_0 +(\hat\beta_1+\hat\beta_2)t \)

when \(D=0\), then \( \widehat{lGDP}=\hat\beta_0+\hat\beta_1t \)

The necessary R codes and the results are produced below:

attach(gdp) gdp$D1=1992,1,0) attach(gdp) gdp$tD attach(gdp) model6<-lm(lGDP~t+tD,data = gdp) summary(model6) Call: lm(formula = lGDP ~ t + tD, data = gdp) Residuals: Min 1Q Median 3Q Max -0.26778 -0.14726 -0.04489 0.13919 0.35627 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.40377 0.09125 245.521 < 2e-16 *** t 0.07072 0.01458 4.851 3.06e-05 *** tD -0.01721 0.01195 -1.440 0.16 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1894 on 32 degrees of freedom Multiple R-squared: 0.8878, Adjusted R-squared: 0.8808 F-statistic: 126.6 on 2 and 32 DF, p-value: 6.305e-16

In the above results, it is evident that the interaction term (tD) is not statistically significant as the corresponding p-value is much higher than the level of significance (5% or 0.05). This would imply that the economic reforms do not affect the slope term. However, for visualizing the results graphically we proceed as follows:
The estimated regression model is:
\( \widehat {lGDP} = 22.404+0.071t-0.017tD \)
This would imply the following growth functions for both the periods.
\( \widehat {lGDP} = 22.404+0.054t~~~~~~\text {for the post-reform period}\\ \widehat {lGDP} = 22.404+0.071t~~~~~~\text {for the pre-reform period} \)

Notice that the slope, rather than the intercept of the growth function changed. This can graphically shown below:

curve((22.404+0.054*x), xlab = "t", ylab = "", xlim = c(1,35), col="red", main="Fig.-12.2: Concurrent Regression Using Dummy Variable") curve(22.404+0.071*x, xlab = "t", ylab = "", xlim = c(1,35), col="blue", add = TRUE) legend( "bottomright", legend = c("post", "pre"), fill = c("red", "blue") )

Gives this plot:

IV. Dis-similar Regression (both the intercepts and slopes are different)

In this case, both the intercept and the slope of the growth function changed simultaneously. The statistical model to capture such changes can be represented by the following equation.
The modified statistical model to capture such changes can be represented by the following equation.

\( lGDP=\beta_0 +\beta_1 t +\beta_2 D+\beta_3 tD+u \)

Obviously, when \(D=1\), then for the post-reform period: \( \widehat{lGDP}=\hat\beta_0 +\hat\beta_1 t+\hat\beta_2 +\hat\beta_3 t +u
\\=(\hat\beta_0 +\hat\beta_2 )+(\hat\beta_1 + \hat\beta_3) t +u \)

when \(D=0\), then for the pre-reform period: \( \widehat{lGDP}= \hat\beta_0 + \hat\beta_1 t +u \)
The model is estimated using the following codes and the results are produced below:

summary(fm3) Call: lm(formula = lGDP ~ t + D + t * D, data = gdp) Residuals: Min 1Q Median 3Q Max -0.21768 -0.05914 0.01033 0.07033 0.13777 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.807568 0.060484 377.084 < 2e-16 *** t 0.018055 0.008918 2.025 0.0516 . Dpost -0.907739 0.090685 -10.010 3.13e-11 *** t:Dpost 0.055195 0.009335 5.913 1.57e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09353 on 31 degrees of freedom Multiple R-squared: 0.9735, Adjusted R-squared: 0.9709 F-statistic: 379.4 on 3 and 31 DF, p-value: < 2.2e-16

Thus, our estimated regression model is written as:
\( \widehat {lGDP} = 22.808 + 0.018 t -0.908 D +0.0552 tD \)

\( \widehat {lGDP} = 21.9+0.0732t~~~~~~\text {for the post-reform period}\\ \widehat {lGDP} = 22.808+0.018t~~~~~~\text {for the pre-reform period} \)

To visualize the matter, the following codes are used to produce the figure below.

curve(21.9+0.0732*x, xlab = "t", ylab = "", xlim = c(1,35), col="blue") curve(22.808+0.018*x, xlab = "t", ylab = "", xlim = c(1,35), col="red", add = TRUE) legend( "topleft", legend = c("post", "pre"), fill = c("blue","red") )

Gives this plot:

Thats all, if you have questions post comments below.

    Related Post

    1. Outliers Detection and Intervention Analysis
    2. Spark DataFrames: Exploring Chicago Crimes
    3. Image Processing and Manipulation with magick in R
    4. Analyzing the Bible and the Quran using Spark
    5. Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Serving shiny apps in the internet with your own server

    Wed, 12/13/2017 - 09:00

    (This article was first published on R and Finance, and kindly contributed to R-bloggers)

    In this post I’ll share my experience in setting up my own virtual
    server for hosting shiny applications in Digital
    . First, context. I’m working in a
    academic project where we build a package for accessing financial data
    and corporate events directly from B3, the Brazilian financial exchange.
    The objective is to set a reproducible standard and facilite data
    acquisition of a large, and very interesting, dataset. The result is
    Since many researchers and students in Brazil are not knowledgeable in
    R, we needed to make it easier for people to use the software. A shiny
    app hosted in the internet is perfect for that. The app is available at

    You can host your own shiny app for free in, but that comes with some
    usage limitations. While searching
    for alternatives, I’ve found this great

    by Dean Attali that clearly explains the
    steps for setting up a web server in a virtual machine from Digital
    Ocean. Despite being a 2015 post, it works perfectly. The best thing is
    that it only costs $5 per month, with the first two months for free.

    Once the server is up and running, I can control it using ssh
    (terminal), send/retrieve files with github and dropbox, and run code
    with Rstudio server, which is basically a Rstudio session in a browser.
    Now I have my own corner in the internet, where I can server all my
    shiny apps with full control. I’m not only using the server for hosting
    web applications, but also running CRON jobs for periodically gather
    data for another project, which has to run a R script every day. No
    longer I have to worry or remember to turn on my computer every day. I’m
    sure I’ll find many more uses to it in the future.

    I’m very happy in choosing the longer, more difficult path in publishing
    a shiny app in the internet. I learned a lot along the way. At first it
    felt overwhelming to configure every aspect of the server. But, if you
    know a bit of Linux, setting up your own webserver is not that
    difficult. I recommend everyone to give it a try.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R and Finance. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introduction to Skewness

    Wed, 12/13/2017 - 01:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    In previous posts here, here, and here, we spent quite a bit of time on portfolio volatility, using the standard deviation of returns as a proxy for volatility. Today we will begin to a two-part series on additional statistics that aid our understanding of return dispersion: skewness and kurtosis. Beyond being fancy words and required vocabulary for CFA level 1, these two concepts are both important and fascinating for lovers of returns distributions. For today, we will focus on skewness.

    Skewness is the degree to which returns are asymmetric around the mean. Since a normal distribution is symmetric around the mean, skewness can be taken as one measure of how returns are not distributed normally. Why does skewness matter? If portfolio returns are right, or positively, skewed, it implies numerous small negative returns and a few large positive returns. If portfolio returns are left, or negatively, skewed, it implies numerous small positive returns and few large negative returns. The phrase “large negative returns” should trigger Pavlovian sweating for investors, even if it’s preceded by a diminutive modifier like “just a few”. For a portfolio manager, a negatively skewed distribution of returns implies a portfolio at risk of rare but large losses. This makes us nervous and is a bit like saying, “I’m healthy, except for my occasional massive heart attack.”

    Let’s get to it.

    First, have a look at one equation for skewness:

    \[Skew=\sum_{t=1}^n (x_i-\overline{x})^3/n \bigg/ (\sum_{t=1}^n (x_i-\overline{x})^2/n)^{3/2}\]

    Skew has important substantive implications for risk, and is also a concept that lends itself to data visualization. In fact, I find the visualizations of skewness more illuminating than the numbers themselves (though the numbers are what matter in the end). In this section, we will cover how to calculate skewness using xts and tidyverse methods, how to calculate rolling skewness, and how to create several data visualizations as pedagogical aids. We will be working with our usual portfolio consisting of:

    + SPY (S&P500 fund) weighted 25% + EFA (a non-US equities fund) weighted 25% + IJS (a small-cap value fund) weighted 20% + EEM (an emerging-mkts fund) weighted 20% + AGG (a bond fund) weighted 10%

    Before we can calculate the skewness, we need to find portfolio monthly returns, which was covered in this post.

    Building off that previous work, we will be working with two objects of portfolio returns:

    + portfolio_returns_xts_rebalanced_monthly (an xts of monthly returns) + portfolio_returns_tq_rebalanced_monthly (a tibble of monthly returns)

    Let’s begin in the xts world and make use of the skewness() function from PerformanceAnalytics.

    library(PerformanceAnalytics) skew_xts <- skewness(portfolio_returns_xts_rebalanced_monthly$returns) skew_xts ## [1] -0.1710568

    Our portfolio is relatively balanced, and a slight negative skewness of -0.1710568 is unsurprising and unworrisome. However, that final number could be omitting important information and we will resist the temptation to stop there. For example, is that slight negative skew being caused by one very large negative monthly return? If so, what happened? Or is it caused by several medium-sized negative returns? What caused those? Were they consecutive? Are they seasonal? We need to investigate further.

    Before doing so and having fun with data visualization, let’s explore the tidyverse methods and confirm consistent results.

    We will make use of the same skewness() function, but because we are using a tibble, we use summarise() as well and call summarise(skew = skewness(returns). It’s not necessary, but we are also going to run this calculation by hand, the same as we have done with standard deviation. Feel free to delete the by-hand section from your code should this be ported to enterprise scripts, but keep in mind that there is a benefit to forcing ourselves and loved ones to write out equations: it emphasizes what those nice built-in functions are doing under the hood. If a client, customer or risk officer were ever to drill into our skewness calculations, it would be nice to have a super-firm grasp on the equation.

    library(tidyverse) library(tidyquant) skew_tidy <- portfolio_returns_tq_rebalanced_monthly %>% summarise(skew_builtin = skewness(returns), skew_byhand = (sum((returns - mean(returns))^3)/length(returns))/ ((sum((returns - mean(returns))^2)/length(returns)))^(3/2)) %>% select(skew_builtin, skew_byhand)

    Let’s confirm that we have consistent calculations.

    skew_xts ## [1] -0.1710568 skew_tidy$skew_builtin ## [1] -0.1710568 skew_tidy$skew_byhand ## [1] -0.1710568

    The results are consistent using xts and our tidyverse, by-hand methods. Again, though, that singular number -0.1710568 does not fully illuminate the riskiness or distribution of this portfolio. To dig deeper, let’s first visualize the density of returns with stat_density from ggplot2.

    portfolio_density_plot <- portfolio_returns_tq_rebalanced_monthly %>% ggplot(aes(x = returns)) + stat_density(geom = "line", alpha = 1, colour = "cornflowerblue") portfolio_density_plot

    The slight negative skew is a bit more evident here. It would be nice to shade the area that falls below some threshold again, and let’s go with the mean return. To do that, let’s create an object called shaded_area using ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)). That snippet will take our original ggplot object and create a new object filtered for x values less than mean return. Then we use geom_area to add the shaded area to portfolio_density_plot.

    shaded_area_data <- ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)) portfolio_density_plot_shaded <- portfolio_density_plot + geom_area(data = shaded_area_data, aes(x = x, y = y), fill="pink", alpha = 0.5) portfolio_density_plot_shaded

    The shaded area highlights the mass of returns that fall below the mean. Let’s add a vertical line at the mean and median, and some explanatory labels. This will help to emphasize that negative skew indicates a mean less than the median.

    First, create variables for mean and median so that we can add a vertical line.

    median <- median(portfolio_returns_tq_rebalanced_monthly$returns) mean <- mean(portfolio_returns_tq_rebalanced_monthly$returns)

    We want the vertical lines to just touch the density plot so we once again use a call to ggplot_build(portfolio_density_plot)$data[[1]].

    median_line_data <- ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x <= median)

    Now we can start adding aesthetics to the latest iteration of our graph, which is stored in the object portfolio_density_plot_shaded.

    portfolio_density_plot_shaded + geom_segment(aes(x = 0, y = 1.9, xend = -.045, yend = 1.9), arrow = arrow(length = unit(0.5, "cm")), size = .05) + annotate(geom = "text", x = -.02, y = .1, label = "returns <= mean", fontface = "plain", alpha = .8, vjust = -1) + geom_segment(data = shaded_area_data, aes(x = mean, y = 0, xend = mean, yend = density), color = "red", linetype = "dotted") + annotate(geom = "text", x = mean, y = 5, label = "mean", color = "red", fontface = "plain", angle = 90, alpha = .8, vjust = -1.75) + geom_segment(data = median_line_data, aes(x = median, y = 0, xend = median, yend = density), color = "black", linetype = "dotted") + annotate(geom = "text", x = median, y = 5, label = "median", fontface = "plain", angle = 90, alpha = .8, vjust = 1.75) + ggtitle("Density Plot Illustrating Skewness")

    We added quite a bit to the chart, possibly too much, but it’s better to be over-inclusive now to test different variants. We can delete any of those features when using this chart later, or refer back to these lines of code should we ever want to reuse some of the aesthetics.

    At this point, we have calculated the skewness of this portfolio throughout its history, and done so using three methods. We have also created an explanatory visualization.

    Similar to the portfolio standard deviation, though, our work is not complete until we look at rolling skewness. Perhaps the first two years of the portfolio were positive skewed, and last two were negative skewed but the overall skewness is slightly negative. We would like to understand how the skewness has changed over time, and in different economic and market regimes. To do so, we calculate and visualize the rolling skewness over time.

    In the xts world, calculating rolling skewness is almost identical to calculating rolling standard deviation, except we call the skewness() function instead of StdDev(). Since this is a rolling calculation, we need a window of time for each skewness; here, we will use a six-month window.

    window <- 6 rolling_skew_xts <- na.omit(rollapply(portfolio_returns_xts_rebalanced_monthly, window, function(x) skewness(x)))

    Now we pop that xts object into highcharter for a visualization. Let’s make sure our y-axis range is large enough to capture the nature of the rolling skewness fluctuations by setting the range to between 3 and -3 with hc_yAxis(..., max = 3, min = -3). I find that if we keep the range from 1 to -1, it makes most rolling skews look like a roller coaster.

    library(highcharter) highchart(type = "stock") %>% hc_title(text = "Rolling") %>% hc_add_series(rolling_skew_xts, name = "Rolling skewness", color = "cornflowerblue") %>% hc_yAxis(title = list(text = "skewness"), opposite = FALSE, max = 3, min = -3) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)

    {"x":{"hc_opts":{"title":{"text":"Rolling"},"yAxis":{"title":{"text":"skewness"},"opposite":false,"max":3,"min":-3},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.0511810666724179],[1377820800000,0.0965061660700991],[1380499200000,0.163579005279668],[1383177600000,-0.0111271137345324],[1385683200000,-0.329175858151934],[1388448000000,-0.682693636723609],[1391126400000,-0.299180325964799],[1393545600000,-1.05714468932288],[1396224000000,-1.12678024437719],[1398816000000,-0.907478344540586],[1401408000000,-0.923291364103329],[1404086400000,-0.994115098616124],[1406764800000,-0.106737545935232],[1409270400000,-0.963528231829294],[1412035200000,-0.750713278430844],[1414713600000,-0.912281062412703],[1417132800000,-0.767038118453993],[1419984000000,-0.252362341390629],[1422576000000,-0.281193932085033],[1424995200000,0.201889248675472],[1427760000000,0.758665508271901],[1430352000000,0.809337297292621],[1432857600000,0.96467960406329],[1435622400000,0.894828618685645],[1438300800000,0.917549882978621],[1440979200000,-0.859694106578632],[1443571200000,-0.416692593249242],[1446163200000,0.664519170731158],[1448841600000,0.6350949021651],[1451520000000,0.71546907245901],[1454025600000,0.99556931274623],[1456704000000,1.03166733380321],[1459382400000,0.36259225564619],[1461888000000,0.804397321645972],[1464652800000,0.81695627686931],[1467244800000,0.56636688270716],[1469750400000,0.966848135339037],[1472601600000,1.05049406444632],[1475193600000,1.28796539782198],[1477872000000,0.478246331580933],[1480464000000,-0.0334553106079278],[1483056000000,-0.314054096729488],[1485820800000,-1.18653628983962],[1488240000000,-1.52309795733004],[1490918400000,-1.65501284162553],[1493337600000,0.332506398750578],[1496188800000,0.633713296827805],[1498780800000,0.534386548695677],[1501459200000,0.719857327874752],[1504137600000,-0.0682564673333931],[1506643200000,-0.122507709915633],[1509408000000,-0.340860682326799],[1512000000000,-0.444064183403134],[1513036800000,-0.394434127262236]],"name":"Rolling skewness","color":"cornflowerblue"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//","canvasToolsURL":"http =//","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

    For completeness of methods, we can calculate rolling skewness in a tibble and then use ggplot.

    We will make use of rollapply() from within tq_mutate in tidyquant.

    rolling_skew_tidy <- portfolio_returns_tq_rebalanced_monthly %>% tq_mutate(select = returns, mutate_fun = rollapply, width = window, FUN = skewness, col_rename = "skew")

    rolling_skew_tidy is ready for ggplot. ggplot is not purpose-built for time series plotting, but we can set aes(x = date, y = skew) to make the x-axis our date values.

    library(scales) theme_update(plot.title = element_text(hjust = 0.5)) rolling_skew_tidy %>% ggplot(aes(x = date, y = skew)) + geom_line(color = "cornflowerblue") + ggtitle("Rolling Skew with ggplot") + ylab(paste("Rolling", window, "month skewness", sep = " ")) + scale_y_continuous(limits = c(-3, 3), breaks = pretty_breaks(n = 8)) + scale_x_date(breaks = pretty_breaks(n = 8))

    The rolling charts are quite illuminating and show that the six-month-interval skewness has been positive for about half the lifetime of this portfolio. Today, the overall skewness is negative, but the rolling skewness in mid-2016 was positive and greater than 1. It took a huge plunge starting at the end of 2016, and the lowest reading was -1.65 in March of 2017, most likely caused by one or two very large negative returns when the market was worried about the US election. We can see those worries start to abate as the rolling skewness becomes more positive throughout 2017.

    That’s all for today. Thanks for reading and see you next time when we tackle kurtosis.


    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A chart of Bechdel Test scores

    Tue, 12/12/2017 - 23:40

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    A movie is said to satisfy the Bechdel Test if it satisfies the following three criteria:

    1. The movie has at least two named female characters
    2. … who have a conversation with each other
    3. … about something other than a man

    The website scores movies accordingly, granting one point for each of the criteria above for a maximum of three points. The recent Wonder Woman movie scores the full three points (Diana and her mother discuss war, for example), while Dunkirk, which features no named female characters, gets zero. (It was still a great film, however.)

    The website also offers an API, which enabled data and analytics director Austin Wehrwein to create this time series chart of Bechdel scores for movies listed on

    This chart only includes ratings for that subset of movies listed on, so it's not clear whether this is representative of movies as a whole. Austin suggests combining these data with the broader data from IDMB, so maybe someone wants to give that a try. Austin's R code for generating the above chart and several others is available at the link below, so click through for the full analysis.

    Austin Wehrwein: A quick look at Bechdel test data (& an awtools update) (via Mara Averick)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Leveraging pipeline in Spark trough scala and Sparklyr

    Tue, 12/12/2017 - 22:59

    (This article was first published on, and kindly contributed to R-bloggers)


    Pipeline concept is definitely not new for software world, Unix pipe operator (|) links two tasks putting the output of one as the input of the other. In machine learning solutions it is pretty much usual to apply several transformation and manipulation to datasets, or to different portions or sample of the same dataset (from classic test/train slices to samples taken for cross-validation procedure). In these cases, pipelines are definitely useful to define a sequence of tasks chained together to define a complete process, which in turn simplifies the operation of the ml solution. In addition, in BigData environment, it is possible to apply the “laziness” of execution to the entire process in order to make it more scalable and robust, therefore no surprise to see pipeline implemented in Spark machine learning library and R API available, by SparklyR package, to leverage the construct. Pipeline component in Spark are basically of two types :

    • transformer: since dataframe usually need to undergo several kinds of changes column-wide, row-wide or even single value-wide, transformers are the component meant to deliver these transformations. Typically a transformer has a table or dataframe as input and delivers a table/dataframe as output. Sparks, through MLlib, provide a set of feature’s transformers meant to address most common transformations needed;
    • estimator: estimator is the component which delivers a model, fitting an algorithm to train data. Indeed fit() is the key method for an estimator and produces, as said a model which is a transformer. Leveraging the parallel processing which is provided by Spark, it is possible to run several estimators in parallel on different training dataset in order to find the best solution (or even to avoid overfitting issue). ML algorithms are basically a set of Estimators, they build a rich set of machine learning (ML) common algorithms, available from MLlib. This is a library of algorithms meant to be scalable and run in a parallel environment. Starting from the 2.0 release of Spark, the RDD-based library is in maintenance mode (the RDD-based APIs are expected to be removed in 3.0 release) whereas the mainstream development is focused on supporting dataframes. In MLlib features are to be expressed with labeledpoints, which means numeric vectors for features and predictors.Pipelines of transformers are, even for this reason, extremely useful to operationalize an ML solutions Spark-based. For additional details on MLlib refer to Apache Spark documentation

    In this post we’ll see a simple example of pipeline usage and we’ll see two version of the same example: the first one using Scala (which is a kind of “native” language for Spark environment), afterward we’ll see how to implement the same example in R, leveraging SaprklyR package in order to show how powerful and complete it is.

    The dataset

    For this example, the dataset comes from UCI – Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science. “Adults Income” dataset reports individual’s annual income results from various factors. The annual income will be our label (it is divided into two classes: <=50K and >50K) and there are 14 features, for each individual, we can leverage to explore the possibility in predicting income level. For additional detail on “adults dataset” refer to the UCI machine learning repository Scala code

    As said, we’ll show how we can use scala API to access pipeline in MLlib, therefore we need to include references to classes we’re planning to use in our example and to start a spark session :

    import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._ import import{VectorAssembler, StringIndexer, VectorIndexer} import import org.apache.spark.sql.SparkSession then we’ll read dataset and will start to manipulate data in order to prepare for the pipeline. In our example, we’ll get data out of local repository (instead of referring to an eg. HDFS or Datalake repository, there are API – for both scala and R – which allows the access to these repositories as well). We’ll leverage this upload activity also to rename some columns, in particular, we’ll rename the “income” column as “label” since we’ll use this a label column in our logistic regression algorithm.

    //load data source from local repository val csv ="inferSchema","true") .option("header", "true").csv("...\yyyy\xxxx\adult.csv") val data_start =$"workclass"),($"gender"),($"education"),($"age"), ($"marital-status").alias("marital"), ($"occupation"),($"relationship"), ($"race"),($"hours-per-week").alias("hr_per_week"), ($"native-country").alias("country"), ($"income").alias("label"),($"capital-gain").alias("capitalgain"), ($"capital-loss").alias("capitalloss"),($"educational-num").alias("eduyears")).toDF

    We’ll do some data clean up basically recoding the “working class” and “marital” columns, in order to reduce the number of codes and we’ll get rid of rows for which “occupation”, “working class”” (even recoded) and “capital gain” are not available. For first two column the dataset has the “?” value instead of “NA”, for capital gain there’s the 99999 value which will be filtered out. To recode “working class” and “marital” columns we’ll use UDF functions which in turn are wrappers of the actual recoding functions. To add a new column to the (new) dataframe we’ll use the “withColumn” method which will add “new_marital” and “new_workclass” to the startingdata dataframe. Afterwards, we’ll filter out all missing values and we’ll be ready to build the pipeline.

    // recoding marital status and working class, adding a new column def newcol_marital(str1:String): String = { var nw_str = "noVal" if ((str1 == "Married-spouse-absent") | (str1 =="Married-AF-spouse") | (str1 == "Married-civ-spouse")) {nw_str = "Married" } else if ((str1 == "Divorced") | (str1 == "Never-married" ) | (str1 == "Separated" ) | (str1 == "Widowed")) {nw_str = "Nonmarried" } else { nw_str = str1} return nw_str } val udfnewcol = udf(newcol_marital _) val recodeddata = data_start.withColumn("new_marital", udfnewcol('marital')) def newcol_wkclass(str1:String): String = { var nw_str = "noVal" if ((str1 == "Local-gov") | (str1 =="Federal-gov") | (str1 == "State-gov")) {nw_str = "Gov" } else if ((str1 == "Self-emp-not-inc") | (str1 == "Self-emp-inc" )) {nw_str = "Selfemployed" } else if ((str1 == "Never-worked") | (str1 == "Without-pay")) {nw_str = "Notemployed" } else { nw_str = str1} return nw_str } val udfnewcol = udf(newcol_wkclass _) val startingdata = recodeddata.withColumn("new_workclass", udfnewcol('workclass')) // remove missing data val df_work01 ="any") val df_work = startingdata.filter("occupation <> '?' and capitalgain < 99999 and new_workclass <> '?' and country <> '?' ")

    In our example, we’re going to use 12 features, 7 are categorical variables and 5 are numeric variables. The feature’s array we’ll use to fit the model will be the results of merging two arrays, one for categorical variables and the second one for numeric variables. Before building the categorical variables array, we need to transform categories to indexes using transformers provided by, even the label must be transformed to an index. Our pipeline then will include 7 pipeline stages to transform categorical variables, 1 stage to transform the label, 2 stages to build categorical and numeric arrays, and the final stage to fit the logistic model. Finally, we’ll build an 11-stages pipeline.
    To transform categorical variables into index we’ll use “Stringindexer” Transformer. StringIndexer encodes a vector of string to a column of non-negative indices, ranging from 0 to the number of values. The indices ordered by label frequencies, so the most frequent value gets index 0. For each variable, we need to define the input column and the output column which we’ll use as input for other transformer or evaluators. Finally it is possible to define the strategy to handle unseen labels (possible when you use the Stringindexer to fit a model and run the fitted model against a test dataframe) through setHandleInvalid method , in our case we simply put “skip” to tell Stringindexer we want to skip unseen labels (additional details are available in MLlib documentation).

    // define stages val new_workclassIdx = new StringIndexer().setInputCol("new_workclass") .setOutputCol("new_workclassIdx").setHandleInvalid("skip") val genderIdx = new StringIndexer().setInputCol("gender") .setOutputCol("genderIdx").setHandleInvalid("skip") val maritalIdx = new StringIndexer().setInputCol("new_marital") .setOutputCol("maritalIdx").setHandleInvalid("skip") val occupationIdx = new StringIndexer().setInputCol("occupation") .setOutputCol("occupationIdx").setHandleInvalid("skip") val relationshipIdx = new StringIndexer().setInputCol("relationship") .setOutputCol("relationshipIdx").setHandleInvalid("skip") val raceIdx = new StringIndexer().setInputCol("race") .setOutputCol("raceIdx").setHandleInvalid("skip") val countryIdx = new StringIndexer().setInputCol("country") .setOutputCol("countryIdx").setHandleInvalid("skip") val labelIdx = new StringIndexer().setInputCol("label") .setOutputCol("labelIdx").setHandleInvalid("skip")

    In addition to Transfomer and Estimator provided by package, it is possible to define custom Estimator and Transformers. As an example we’ll see how to define a custom transformer aimed at recoding “marital status” in our dataset (basically we’ll do the same task we have already seen, implementing it with a custom transformer; for additional details on implementing customer estimator and transformer see the nice article by H.Karau. To define a custom transformer, we’ll define a new scala class, columnRecoder, which extends the Transformer class, we’ll override the transformSchemamethod to map the correct type of the new column we’re going to add with the transformation and we’ll implement the transform method which actually does the recoding in our dataset. A possible implementation is :

    import class columnRecoder(override val uid: String) extends Transformer { final val inputCol= new Param[String](this, "inputCol", "input column") final val outputCol = new Param[String](this, "outputCol", "output column") def setInputCol(value: String): this.type = set(inputCol, value) def setOutputCol(value: String): this.type = set(outputCol, value) def this() = this(Identifiable.randomUID("columnRecoder")) def copy(existingParam: ParamMap): columnRecoder = {defaultCopy(existingParam)} override def transformSchema(schema: StructType): StructType = { // Check inputCol type val idx = schema.fieldIndex($(inputCol)) val field = schema.fields(idx) if (field.dataType != StringType) { throw new Exception(s"Input type ${field.dataType} type mismatch: String expected") } // The return field schema.add(StructField($(outputCol),StringType, false)) } val newcol_recode = new marital_code() private def newcol_recode(str1: String): String = { var nw_str = "noVal" if ((str1 == "Married-spouse-absent") | (str1 =="Married-AF-spouse") | (str1 == "Married-civ-spouse")) {nw_str = "Married" } else if ((str1 == "Divorced") | (str1 == "Never-married" ) | (str1 == "Separated" ) | (str1 == "Widowed")) {nw_str = "Nonmarried" } else { nw_str = str1} nw_str } private def udfnewcol = udf(newcol_recode.recode(_)) def transform(df: Dataset[_]): DataFrame = { df.withColumn($(outputCol), udfnewcol(df($(inputCol)))) } }

    Once defined as a transformer, we can use it in our pipeline as the first stage.

    // define stages val new_marital = new columnRecoder().setInputCol("marital") .setOutputCol("new_marital") val new_workclassIdx = new StringIndexer().setInputCol("new_workclass") .setOutputCol("new_workclassIdx").setHandleInvalid("skip") val genderIdx = new StringIndexer().setInputCol("gender") .setOutputCol("genderIdx").setHandleInvalid("skip") val maritalIdx = new StringIndexer().setInputCol("new_marital") .setOutputCol("maritalIdx").setHandleInvalid("skip") .......

    A second step in building our pipeline is to assemble categorical indexes in a single vector, therefore many categorical indexes are put all together in a single vector through the VectorAssemblertransformer. This VectorAssembler will deliver a single column feature which will be, in turn, transformed to indexes by VectorIndexer transformer to deliver indexes within the “catFeatures” column:

    // cat vector for categorical variables val catVect = new VectorAssembler() .setInputCols(Array("new_workclassIdx", "genderIdx", "catVect","maritalIdx", "occupationIdx","relationshipIdx","raceIdx","countryIdx")) .setOutputCol("cat01Features") val catIdx = new VectorIndexer() .setInputCol(catVect.getOutputCol) .setOutputCol("catFeatures")

    For numeric variables we need just to assemble columns with VectorAssembler, then we’re ready to put these two vectors (one for categorical variables, the other for numeric variables) together in a single vector.

    // numeric vector for numeric variable val numVect = new VectorAssembler() .setInputCols(Array("age","hr_per_week","capitalgain","capitalloss","eduyears")) .setOutputCol("numFeatures") val featVect = new VectorAssembler() .setInputCols(Array("catFeatures", "numFeatures")) .setOutputCol("features")

    We have now label and features ready to build the logistic regression model which is the final component of our pipeline. We can also set some parameters for the model, in particular, we can define the threshold (by default set to 0.5) to make the decision between label values, as well as the max number of iterations for this algorithm and a parameter to tune regularization.
    When all stages of the pipeline are ready, we just need to define the pipeline component itself, passing as an input parameter an array with all defined stages:

    val lr = new LogisticRegression().setLabelCol("labelIdx").setFeaturesCol("features") .setThreshold(0.33).setMaxIter(10).setRegParam(0.2) val pipeline = new Pipeline().setStages(Array(new_marital,new_workclassIdx, labelIdx,maritalIdx,occupationIdx, relationshipIdx,raceIdx,genderIdx, countryIdx,catVect, catIdx, numVect,featVect,lr))

    Now the pipeline component, which encompasses a number of transformations as well as the classification algorithm, is ready; to actually use it we supply a train dataset to fit the model and then a test dataset to evaluate our fitted model. Since we have defined a pipeline, we’ll be sure that both, train and test datasets, will undergo the same transformations, therefore, we don’t have to replicate the process twice.
    We need now to define train and test datasets.In our dataset, label values are unbalanced being the “more than 50k USD per year” value around the 25% of the total, in order to preserve the same proportion between label values we’ll subset the original dataset based on label value, obtaining a low-income dataset and an high-income dataset. We’ll split both dataset for train (70%) and test (30%), then we’ll merge back the two “train”” and the two “test” datasets and we’ll use resulting “train” dataset as input for our pipeline:

    // split betwen train and test val df_low_income = df_work.filter("label == '<=50K'") val df_high_income = df_work.filter("label == '>50K'") val splits_LI = df_low_income.randomSplit(Array(0.7, 0.3), seed=123) val splits_HI = df_high_income.randomSplit(Array(0.7, 0.3), seed=123) val df_work_train = splits_LI(0).unionAll(splits_HI(0)) val df_work_test = splits_LI(1).unionAll(splits_HI(1)) // fitting the pipeline val data_model =

    Once the pipeline is trained, we can use the data_model for testing against the test dataset, calculate the confusion matrix and evaluate the classifier metrics :

    // generate prediction val data_prediction = data_model.transform(df_work_test) val data_predicted ="features", "prediction", "label","labelIdx") // confusion matrix val tp = data_predicted.filter("prediction == 1 AND labelIdx == 1").count().toFloat val fp = data_predicted.filter("prediction == 1 AND labelIdx == 0").count().toFloat val tn = data_predicted.filter("prediction == 0 AND labelIdx == 0").count().toFloat val fn = data_predicted.filter("prediction == 0 AND labelIdx == 1").count().toFloat val metrics = spark.createDataFrame(Seq( ("TP", tp), ("FP", fp), ("TN", tn), ("FN", fn), ("Precision", tp / (tp + fp)), ("Recall", tp / (tp + fn)))).toDF("metric", "value") R code and SparklyR

    Now we’ll try to replicate the same example we just saw in R, more precisely, working with the SparklyR package.
    We’ll use the developer version of SparklyR (as you possibly know, there’s an interesting debate on the best API to access Apache Spark resources from R. For those that wants to know more about
    We need to install it from github before connecting with Spark environment.
    In our case Spark is a standalone instance running version 2.2.0, as reported in the official documentation for SparklyR, configuration parameters can be set through spark_config() function, in particular, spark_config() provides the basic configuration used by default for spark connection. To change parameters it’s possible to get default configuration via spark_connection() then change parameters as needed ( here’s the link for additional details to run sparklyR on Yarn cluster).

    devtools::install_github("rstudio/sparklyr") library(sparklyr) library(dplyr) sc <- spark_connect(master = "local",spark_home = "...\Local\spark",version="2.2.0")

    After connecting with Spark, we’ll read the dataset into a Spark dataframe and select fields (with column renaming, where needed) we’re going to use for our classification example. It is worthwhile to remember that dplyr (and therefore sparklyR) uses lazy evaluation when accessing and manipulating data, which means that ‘the data is only fetched at the last moment when it’s needed’ (Efficient R programming, C.Gillespie R.Lovelace).For this reason, later on we’ll that we’ll force the statement execution calling action functions (like collect()).
    As we did in scala script, we’ll recode “marital status” and “workclass” in order to simplify the analysis. In renaming dataframe columns, we’ll use the “select” function available from dplyr package, indeed one of the SparklyR aims is to allow the manipulation of Spark dataframes/tables through dplyr functions. This is an example of how function like “select” or “filter” can be used also for spark dataframes.

    income_table <- spark_read_csv(sc,"income_table","...\adultincome\adult.csv") income_table <- select(income_table,"workclass","gender","eduyears"="educationalnum", "age","marital"="maritalstatus","occupation","relationship", "race","hr_per_week"="hoursperweek","country"="nativecountry", "label"="income","capitalgain","capitalloss") # recoding marital status and workingclass income_table <- income_table %>% mutate(marital = ifelse(marital == "Married-spouse-absent" | marital == "Married-AF-spouse" | marital == "Married-civ-spouse","married","nonMarried")) income_table <- income_table %>% mutate(workclass = ifelse(workclass == "Local-gov"| workclass == "Federal-gov" | workclass == "State_gov", "Gov",ifelse(workclass == "Self-emp-inc" | workclass == "Self-emp-not-inc","Selfemployed",ifelse(workclass=="Never-worked" | workclass == "Without-pay","Notemployed",workclass))))

    SparklyR provides functions which are bound to Spark package, therefore it is possible to build Machine Learning solutions putting together the power of dplyr grammar with Spark ML algorithms. To simply link package function to, SparklyR provides functions which use specific prefixes to identify functions group:

    • functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly (as opposed to the dplyr interface which uses Spark SQL) to manipulate dataframes;
    • functions prefixed with ft_ are functions to manipulate and transform features. Pipeline transformers and estimators belong to this group of functions;
    • functions prefixed with ml_ implement algorithms to build machine learning workflow. Even pipeline instance is provided by ml_pipeline() which belongs to these functions.

    We can then proceed with pipeline, stages and feature array definition. Ft-prefixed functions recall the functions these are bound to:

    income_pipe <- ml_pipeline(sc,uid="income_pipe") income_pipe <-ft_string_indexer(income_pipe,input_col="workclass",output_col="workclassIdx") income_pipe <- ft_string_indexer(income_pipe,input_col="gender",output_col="genderIdx") income_pipe <- ft_string_indexer(income_pipe,input_col="marital",output_col="maritalIdx") income_pipe <- ft_string_indexer(income_pipe,input_col="occupation",output_col="occupationIdx") income_pipe <- ft_string_indexer(income_pipe,input_col="race",output_col="raceIdx") income_pipe <- ft_string_indexer(income_pipe,input_col="country",output_col="countryIdx") income_pipe <- ft_string_indexer(income_pipe,input_col="label",output_col="labelIdx") array_features <- c("workclassIdx","genderIdx","maritalIdx", "occupationIdx","raceIdx","countryIdx","eduyears", "age","hr_per_week","capitalgain","capitalloss") income_pipe <- ft_vector_assembler(income_pipe, input_col=array_features, output_col="features")

    In our example we’ll use ml_logistic_regression() to implement the classification solution aimed at predicting the income level. Being bound to the LogisticRegression() function in package, (as expected) all parameters are available for specific setting, therefore we can can “mirror” the same call we made in scala code: e.g. we can manage the “decision” threshold for our binary classifier (setting the value to 0.33).

    # putting in pipe the logistic regression evaluator income_pipe <- ml_logistic_regression(income_pipe, features_col = "features", label_col = "labelIdx", family= "binomial",threshold = 0.33, reg_param=0.2, max_iter=10L)

    Final steps in our example are to split data between train and test subset, fit the pipeline and evaluate the classifier. As we’ve had already done in scala code, we’ll manage the unbalance in label values, splitting the dataframe in a way which secures the relative percentage of label values. Afterwards, we’ll fit the pipeline and get the predictions relative to test dataframe (as we did already in scala code).

    # data split # dealing with label inbalance df_low_income = filter(income_table,income_table$label == "<=50K") df_high_income = filter(income_table,income_table$label == ">50K") splits_LI <- sdf_partition(df_low_income, test=0.3, train=0.7, seed = 7711) splits_HI <- sdf_partition(df_high_income,test=0.3,train=0.7,seed=7711) df_test <- sdf_bind_rows(splits_LI[[1]],splits_HI[[1]]) df_train <- sdf_bind_rows(splits_LI[[2]],splits_HI[[2]]) df_model <- ml_fit(income_pipe,df_train)

    Once fitted, the model exposes, for each pipeline stage, all parameters and logistic regression (which is the last element in the stages list) coefficients.

    df_model$stages df_model$stages[[9]]$coefficients

    We can then process the test dataset putting it in the pipeline to get predictions and evaluate the fitted model

    df_prediction <- ml_predict(df_model,df_test) df_results <- select(df_prediction,"prediction","labelIdx","probability")

    We can then proceed to evaluate the confusion matrix and get main metrics to evaluate the model (from precision to AUC).

    # calculating confusion matrix df_tp <- filter(df_results,(prediction == 1 && labelIdx == 1)) df_fp <- filter(df_results,(prediction ==1 && labelIdx == 0)) df_tn <- filter(df_results,(prediction == 1 && labelIdx == 0)) df_fn <- filter(df_results,(prediction == 1 && labelIdx == 1)) tp <- df_tp %>% tally() %>% collect() %>% as.integer() fp <- df_fp %>% tally() %>% collect() %>% as.integer() tn <- df_tn %>% tally() %>% collect() %>% as.integer() fn <- df_fn %>% tally() %>% collect() %>% as.integer() df_precision <- (tp/(tp+fp)) df_recall <- (tp/(tp+fn)) df_accuracy = (tp+tn)/(tp+tn+fp+fn) c_AUC <- ml_binary_classification_eval(df_prediction, label_col = "labelIdx", prediction_col = "prediction", metric_name = "areaUnderROC") Conclusion

    As we have seen, pipelines are a useful mechanism to assemble and serialize transformation in order to make it repeatable for different sets of data. Are then a simple way for fitting and evaluating models through train/test datasets, but also suitable to run the same sequence of transformer/estimator in parallel over different nodes of the Spark cluster (i.e. to find the best parameter set). Moreover, a powerful API to deal with pipelines in R is available by SparklyR package, which provides in addition a comprehensive set of functions to leverage the ML spark package (here’s a link for a guide to deploy SparlyR in different cluster environment). Finally a support to run R code distributed across a Spark cluster has been to SparklyR with the spark_apply() function ( which makes evenmore interesting the possibility to leverage pipelines in R in ditributed environment for analytical solutions. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New R Course: Introduction to the Tidyverse!

    Tue, 12/12/2017 - 21:56

    (This article was first published on, and kindly contributed to R-bloggers)

    Hi! Big announcement today as we just launched Introduction to the Tidyverse R course by David Robinson!

    This is an introduction to the programming language R, focused on a powerful set of tools known as the “tidyverse”. In the course you’ll learn the intertwined processes of data manipulation and visualization through the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. This gives a taste both of the value of exploratory data analysis and the power of tidyverse tools. This is a suitable introduction for people who have no previous experience in R and are interested in learning to perform data analysis.

    Take me to chapter 1!

    Introduction to the Tidyverse features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a Tidyverse expert!

    What you’ll learn

    1. Data wrangling

    In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.

    2. Data visualization

    You’ve already been able to answer some questions about the data through dplyr, but you’ve engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and manipulation are often intertwined, so you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

    3. Grouping and summarizing

    So far you’ve been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

    4. Types of visualizations

    You’ve learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.

    Master the Tidyverse with our course Introduction to the Tidyverse

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Point Pattern Analysis using Ecological Methods in R

    Tue, 12/12/2017 - 12:45

    (This article was first published on R –, and kindly contributed to R-bloggers)

    Here is a quick example for how to get started with some of the more sophisticated point pattern analysis tools that have been developed for ecologists – principally the adehabitathr package – but that are very useful for human data. Ecologists deploy point pattern analysis to establish the “home range” of a particular animal based on the know locations it has been sighted (either directly or remotely via camera traps). Essentially it is where the animal spends most of its time. In the case of human datasets the analogy can be extended to identify areas where most crimes are committed – hotspots – or to identify the activity spaces of individuals or the catchment areas of services such as schools and hospitals.

    This tutorial offers a rough analysis of crime data in London so the maps should not be taken as definitive – I’ve just used them as a starting point here.

    Police crime data have been taken from here This tutorial London uses data from London’s Metropolitan Police (September 2017).

    For the files used below click here.

    #Load in the library and the csv file. library("rgdal") library(raster) library(adehabitatHR) input<- read.csv("2017-09-metropolitan-street.csv") input<- input[,1:10] #We only need the first 10 columns input<- input[complete.cases(input),] #This line of code removes rows with NA values in the data.

    At the moment input is a basic data frame. We need to convert the data frame into a spatial object. Note we have first specified our epsg code as 4326 since the coordinates are in WGS84. We then use spTransform to reproject the data into British National Grid – so the coordinate values are in meters.

    Crime.Spatial<- SpatialPointsDataFrame(input[,5:6], input, proj4string = CRS("+init=epsg:4326")) Crime.Spatial<- spTransform(Crime.Spatial, CRS("+init=epsg:27700")) #We now project from WGS84 for to British National Grid plot(Crime.Spatial) #Plot the data

    The plot reveals that we have crimes across the UK, not just in London. So we need an outline of London to help limit the view. Here we load in a shapefile of the Greater London Authority Boundary.

    London<- readOGR(".", layer="GLA_outline")

    Thinking ahead we may wish to compare a number of density estimates, so they need to be performed across a consistently sized grid. Here we create an empty grid in advance to feed into the kernelUD function.

    Extent<- extent(London) #this is the geographic extent of the grid. It is based on the London object. #Here we specify the size of each grid cell in metres (since those are the units our data are projected in). resolution<- 500 #This is some magic that creates the empty grid x <- seq(Extent[1],Extent[2],by=resolution) # where resolution is the pixel size you desire y <- seq(Extent[3],Extent[4],by=resolution) xy <- expand.grid(x=x,y=y) coordinates(xy) <- ~x+y gridded(xy) <- TRUE #You can see the grid here (this may appear solid black if the cells are small) plot(xy) plot(London, border="red", add=T)

    OK now run the density estimation note we use grid= xy utlise the grid we just created. This is for all crime in London.

    all <- raster(kernelUD(Crime.Spatial, h="href", grid = xy)) #Note we are running two functions here - first KernelUD then converting the result to a raster object. #First results plot(all) plot(London, border="red", add=T)

    Unsurprisingly we have a hotpot over the centre of London. Are there differences for specific crime types? We may, for example, wish to look at the density of burglaries.

    plot(Crime.Spatial[Crime.Spatial$Crime.type=="Burglary",]) # quick plot of burglary points burglary<- raster(kernelUD(Crime.Spatial[Crime.Spatial$Crime.type=="Burglary",], h="href", grid = xy)) plot(burglary) plot(London, border="red", add=T)

    There’s a slight difference but still it’s tricky to see if there are areas where burglaries concentrate more compared to the distribution of all crimes. A very rough way to do this is to divide one density grid by the other.

    both<-burglary/all plot(both) plot(London, border="red", add=T)

    This hasn’t worked particularly well since there are edge effects on the density grid that are causing issues due to a few stray points at the edge of the grid. We can solve this by capping the values we map – in this we are only showing values of between 0 and 1. Some more interesting structures emerge with burglary occuring in more residential areas, as expected.

    both2 <- both both2[both <= 0] <- NA both2[both >= 1] <- NA #Now we can see the hotspots much more clearly. plot(both2) plot(London, add=T)

    There’s many more sophisticated approaches to this kind of analysis – I’d encourgae you to look at the adehabitatHR vignette on the package page.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Not So Simple Bar Plot Example Using ggplot2

    Tue, 12/12/2017 - 11:00

    (This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

    This is a reproduction of the (simple) bar plot of chapter 6.1.1 in Datendesign mit R with ggplot2. To download the data you can use the following lines:

    dir.create("data") writeLines("*", "data/.gitignore") download.file("", "data/") unzip("data/", exdir = "data")

    And to download the original script and the base R version of this plot:

    download.file("", "data/") unzip("data/", exdir = "data")

    After downloading check out the original pdf version of this plot in data/beispielcode/pdf/balkendiagramm_einfach.pdf.

    Preparing Data

    Here are some steps to modify the data such that it can be easily used with ggplot2.

    # remember to adjust the path ipsos <- openxlsx::read.xlsx("../data/alle_daten/ipsos.xlsx") ipsos <- ipsos[order(ipsos$Wert),] ipsos$Land <- ordered(ipsos$Land, ipsos$Land) ipsos$textFamily <- ifelse(ipsos$Land %in% c("Deutschland","Brasilien"), "Lato Black", "Lato Light") ipsos$labels <- paste0(ipsos$Land, ifelse(ipsos$Wert < 10, " ", " "), ipsos$Wert) rect <- data.frame( ymin = seq(0, 80, 20), ymax = seq(20, 100, 20), xmin = 0.5, xmax = 16.5, colour = rep(c(grDevices::rgb(191,239,255,80,maxColorValue=255), grDevices::rgb(191,239,255,120,maxColorValue=255)), length.out = 5)) The Basic Plot

    First we add the geoms, then modifications to the scales and flip of the coordinate system. The remaining code is just modifying the appearance.

    library("ggplot2") ggBar <- ggplot(ipsos) + geom_bar(aes(x = Land, y = Wert), stat = "identity", fill = "grey") + geom_bar(aes(x = Land, y = ifelse(Land %in% c("Brasilien", "Deutschland"), Wert, NA)), stat = "identity", fill = rgb(255,0,210,maxColorValue=255)) + geom_rect(data = rect, mapping = aes(ymin = ymin, ymax = ymax, xmin = xmin, xmax = xmax), fill = rect$colour) + geom_hline(aes(yintercept = 45), colour = "skyblue3") + scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100), expand = c(0, 0)) + scale_x_discrete(labels = ipsos$labels) + coord_flip() + labs(y = NULL, x = NULL, title = NULL) + theme_minimal() + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), axis.ticks = element_blank(), axis.text.y = element_text( family = ipsos$textFamily), text = element_text(family = "Lato Light")) ggBar

    Annotations and Layout

    Of course you can simply add the title and text annotations to the plot using ggplot2, but I didn’t find a way to do the exact placement comparable to the original version without the package grid.

    library("grid") vp_make <- function(x, y, w, h) viewport(x = x, y = y, width = w, height = h, just = c("left", "bottom")) main <- vp_make(0.05, 0.05, 0.9, 0.8) title <- vp_make(0, 0.9, 0.6, 0.1) subtitle <- vp_make(0, 0.85, 0.4, 0.05) footnote <- vp_make(0.55, 0, 0.4, 0.05) annotation1 <- vp_make(0.7, 0.85, 0.225, 0.05) annotation2 <- vp_make(0.4, 0.85, 0.13, 0.05) # To see which space these viewports will use: grid.rect(gp = gpar(lty = "dashed")) grid.rect(gp = gpar(col = "grey"), vp = main) grid.rect(gp = gpar(col = "grey"), vp = title) grid.rect(gp = gpar(col = "grey"), vp = subtitle) grid.rect(gp = gpar(col = "grey"), vp = footnote) grid.rect(gp = gpar(col = "grey"), vp = annotation1) grid.rect(gp = gpar(col = "grey"), vp = annotation2)

    And now we can add the final annotations to the plot:

    # pdf_datei<-"balkendiagramme_einfach.pdf" # cairo_pdf(bg = "grey98", pdf_datei, width=9, height=6.5) grid.newpage() print(ggBar, vp = main) grid.text("'Ich glaube fest an Gott oder ein höheres Wesen'", gp = gpar(fontfamily = "Lato Black", fontsize = 14), just = "left", x = 0.05, vp = title) grid.text("...sagten 2010 in:", gp = gpar(fontfamily = "Lato Light", fontsize = 12), just = "left", x = 0.05, vp = subtitle) grid.text("Quelle:, Design: Stefan Fichtel, ixtract", gp = gpar(fontfamily = "Lato Light", fontsize = 9), just = "right", x = 0.95, vp = footnote) grid.text("Alle Angaben in Prozent", gp = gpar(fontfamily = "Lato Light", fontsize = 9), just = "right", x = 1, y = 0.55, vp = annotation1) grid.text("Durchschnitt: 45", gp = gpar(fontfamily = "Lato Light", fontsize = 9), just = "right", x = 0.95, y = 0.55, vp = annotation2)

    # var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Stock-Recruitment Graphing Questions

    Tue, 12/12/2017 - 07:00

    (This article was first published on fishR Blog, and kindly contributed to R-bloggers)

    A fishR user recently asked me

    In the book that you published, I frequently use the stock-recruit curve code. The interface that shows both the Ricker/Beverton-Holt figure with the recruit per spawner to spawner figure (i.e., the dynamic plot for srStarts()) has not been working for quite some time. Additionally, I can get the recruits versus spawner plot for the Beverton-Holt or Ricker curve with confidence bounds around the curve, but how do you do the same for the recruit per spawner to spawner curve?

    Below I answer both questions. First, however, I load required packages …

    library(FSA) # srStarts, srFuns
    library(FSAdata) # PSalmonAK data
    library(dplyr) # %>%, filter, mutate, select
    library(nlstools) # nlsBoot

    … and obtain the same data (in PSalmonAK.csv available from here) used in the Stock-Recruitment chapter of my Introductory Fisheries Analyses with R (IFAR) book. Note that I created two new variables here: retperesc is the “recruits per spawner” and logretperesc is the natural log of the recruits per spawner variable.

    pinks <- read.csv("data/PSalmonAK.csv") %>%
    filter(!,! %>%
    retperesc=ret/esc,logretperesc=log(retperesc)) %>%
    headtail(pinks) ## year esc ret logret retperesc logretperesc
    ## 1 1960 1.418 2.446 0.894454 1.724965 0.5452066
    ## 2 1961 2.835 14.934 2.703640 5.267725 1.6615986
    ## 3 1962 1.957 10.031 2.305680 5.125703 1.6342676
    ## 28 1987 4.289 18.215 2.902245 4.246911 1.4461918
    ## 29 1988 2.892 9.461 2.247178 3.271438 1.1852298
    ## 30 1989 4.577 23.359 3.150982 5.103561 1.6299386 Dynamic Plot Issue

    The first question about the dynamic plot is due to a change in functionality in the FSA package since IFAR was published. The dynamicPlot= argument was removed because the code for that argument relied on the tcltk package, which I found difficult to reliably support. A similar, though more manual, approach is accomplished with the new fixed= and plot= arguments. For example, using plot=TRUE (without fixed=) will produces a plot of “recruits” versus “stock” with the chosen stock-recruitment model evaluated at the automatically chosen parameter starting values superimposed.

    svR <- srStarts(ret~esc,data=pinks,type="Ricker",plot=TRUE)

    The user, however, can show the stock-recruitment model evaluated at manually chosen parameter starting values by including those starting values in a list that is supplied to fixed=. These values can be iteratively changed in subsequent calls to srStarts() to manually find starting values that provide a model that reasonably fits (by eye) the stock-recruit data.

    svR <- srStarts(ret~esc,data=pinks,type="Ricker",plot=TRUE,

    Note however that srStarts() no longer supports the simultaneously plotting of spawners versus recruits and recruits per spawner versus recruits.

    Plot of Recruits per Spawner versus Spawners

    The first way that I can imagine plotting recruits per spawners versus spawners with the fitted curve and confidence bands is to first follow the code for fitting the stock-recruit function to the stock and recruit data as described in IFAR. In this case, the stock-recruit function is fit on the log scale to adjust for a multiplicative error structure (as described in IFAR).

    rckr <- srFuns("Ricker")
    srR <- nls(logret~log(rckr(esc,a,b)),data=pinks,start=svR)
    bootR <- nlsBoot(srR)
    cbind(estimates=coef(srR),confint(bootR)) ## estimates 95% LCI 95% UCI
    ## a 2.84924206 1.74768271 4.8435727
    ## b 0.05516673 -0.08443862 0.2158164

    As described in the book, the plot of spawners versus recruits is made by (i) constructing a sequence of “x” values that span the range of observed numbers of spawners, (ii) predicting the number of recruits at each spawner value using the best-fit stock-recruitment model, (iii) constructing lower and upper confidence bounds for the predicted number of recruits at each spawner value with the bootstrap results, (iv) making a schematic plot on which to put (v) a polygon for the confidence band, (vi) the raw data points, and (vii) the best-fit curve. The code below follows these steps and reproduces Figure 12.4 in the book.

    x <- seq(0,9,length.out=199) # many S for prediction
    pR <- rckr(x,a=coef(srR)) # predicted mean R
    LCI <- UCI <- numeric(length(x))

    for(i in 1:length(x)) { # CIs for mean R @ each S
    tmp <- apply(bootR$coefboot,MARGIN=1,FUN=rckr,S=x[i])
    LCI[i] <- quantile(tmp,0.025)
    UCI[i] <- quantile(tmp,0.975)
    ylmts <- range(c(pR,LCI,UCI,pinks$ret))
    xlmts <- range(c(x,pinks$esc))

    ylab="Returners (millions)",
    xlab="Escapement (millions)")

    These results can be modified to plot recruits per spawner versus spawners by replacing the “recruits” in the code above with “recruits per spawner.” This is simple for the actual data as ret is simply replaced with retperesc. However, the predicted number of recruits (in pR) and the confidence bounds (in LCI and UCI) from above must be divided by the number of spawners (in x). As the / symbol has a special meaning in R formulas, this division must be contained within I() as when it appears in a formula (see the lines() code below). Of course, the y-axis scale range must also be adjusted. Thus, a plot of recruits per spawner versus spawners is produced from the previous results with the following code.

    ylmts <- c(0.7,7)
    xlab="Escapement (millions)")

    Alternatively, the Ricker stock-recruitment model could be reparameterized by dividing each side of the function by “spawners” such that the right-hand-side becomes “recruits per spawner” (this is a fairly typical reparameterization of the model). This model can be put into an R function, with model parameters then estimated with nonlinear regression similar to above. The results below show that the paramter point estimates are identical and the bootsrapped confidence intervals are similar to what was obtained above.

    rckr2 <- function(S,a,b=NULL) {
    if (length(a)>1) { b <- a[[2]]; a <- a[[1]] }
    srR2 <- nls(logretperesc~log(rckr2(esc,a,b)),data=pinks,start=svR)
    bootR2 <- nlsBoot(srR2)
    cbind(estimates=coef(srR2),confint(bootR2)) ## estimates 95% LCI 95% UCI
    ## a 2.84924202 1.67734916 4.8613811
    ## b 0.05516673 -0.08776123 0.2040402

    With this, a second method for plotting recruits per spawner versus spawners is the same as how the main plot from the book was constructed but modified to use the results from this “new” model.

    x <- seq(0,9,length.out=199) # many S for prediction
    pRperS <- rckr2(x,a=coef(srR2)) # predicted mean RperS
    LCI2 <- UCI2 <- numeric(length(x))

    for(i in 1:length(x)) { # CIs for mean RperS @ each S
    tmp <- apply(bootR2$coefboot,MARGIN=1,FUN=rckr2,S=x[i])
    LCI2[i] <- quantile(tmp,0.025)
    UCI2[i] <- quantile(tmp,0.975)
    ylmts <- range(c(pRperS,LCI2,UCI2,pinks$retperesc))
    xlmts <- range(c(x,pinks$esc))

    xlab="Escapement (millions)")

    The two methods described above for plotting recruits per spawner versuse spawners are identical for the best-fit curve and nearly identical for the confidence bounds (slight differences likely due to the randomness inherent in bootstrapping). Thus, the two methods produce nearly the same visual.

    xlab="Escapement (millions)")

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Explaining Predictions of Machine Learning Models with LIME – Münster Data Science Meetup

    Tue, 12/12/2017 - 01:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    Slides from Münster Data Science Meetup

    These are my slides from the Münster Data Science Meetup on December 12th, 2017.

    My sketchnotes were collected from these two podcasts:

    Sketchnotes: TWiML Talk #7 with Carlos Guestrin – Explaining the Predictions of Machine Learning Models & Data Skeptic Podcast – Trusting Machine Learning Models with Lime

    Example Code
    • the following libraries were loaded:
    library(tidyverse) # for tidy data analysis library(farff) # for reading arff file library(missForest) # for imputing missing values library(dummies) # for creating dummy variables library(caret) # for modeling library(lime) # for explaining predictions Data

    The Chronic Kidney Disease dataset was downloaded from UC Irvine’s Machine Learning repository:

    data_file <- file.path("path/to/chronic_kidney_disease_full.arff")
    • load data with the farff package
    data <- readARFF(data_file) Features
    • age – age
    • bp – blood pressure
    • sg – specific gravity
    • al – albumin
    • su – sugar
    • rbc – red blood cells
    • pc – pus cell
    • pcc – pus cell clumps
    • ba – bacteria
    • bgr – blood glucose random
    • bu – blood urea
    • sc – serum creatinine
    • sod – sodium
    • pot – potassium
    • hemo – hemoglobin
    • pcv – packed cell volume
    • wc – white blood cell count
    • rc – red blood cell count
    • htn – hypertension
    • dm – diabetes mellitus
    • cad – coronary artery disease
    • appet – appetite
    • pe – pedal edema
    • ane – anemia
    • class – class
    Missing data
    • impute missing data with Nonparametric Missing Value Imputation using Random Forest (missForest package)
    data_imp <- missForest(data) One-hot encoding
    • create dummy variables (
    • scale and center
    data_imp_final <- data_imp$ximp data_dummy <-, -class), sep = "_") data <- cbind(dplyr::select(data_imp_final, class), scale(data_dummy, center = apply(data_dummy, 2, min), scale = apply(data_dummy, 2, max))) Modeling # training and test set set.seed(42) index <- createDataPartition(data$class, p = 0.9, list = FALSE) train_data <- data[index, ] test_data <- data[-index, ] # modeling model_rf <- caret::train(class ~ ., data = train_data, method = "rf", # random forest trControl = trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE)) model_rf ## Random Forest ## ## 360 samples ## 48 predictor ## 2 classes: 'ckd', 'notckd' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 324, 324, 324, 324, 325, 324, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 2 0.9922647 0.9838466 ## 25 0.9917392 0.9826070 ## 48 0.9872930 0.9729881 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2. # predictions pred <- data.frame(sample_id = 1:nrow(test_data), predict(model_rf, test_data, type = "prob"), actual = test_data$class) %>% mutate(prediction = colnames(.)[2:3][apply(.[, 2:3], 1, which.max)], correct = ifelse(actual == prediction, "correct", "wrong")) confusionMatrix(pred$actual, pred$prediction) ## Confusion Matrix and Statistics ## ## Reference ## Prediction ckd notckd ## ckd 23 2 ## notckd 0 15 ## ## Accuracy : 0.95 ## 95% CI : (0.8308, 0.9939) ## No Information Rate : 0.575 ## P-Value [Acc > NIR] : 1.113e-07 ## ## Kappa : 0.8961 ## Mcnemar's Test P-Value : 0.4795 ## ## Sensitivity : 1.0000 ## Specificity : 0.8824 ## Pos Pred Value : 0.9200 ## Neg Pred Value : 1.0000 ## Prevalence : 0.5750 ## Detection Rate : 0.5750 ## Detection Prevalence : 0.6250 ## Balanced Accuracy : 0.9412 ## ## 'Positive' Class : ckd ## LIME
    • LIME needs data without response variable
    train_x <- dplyr::select(train_data, -class) test_x <- dplyr::select(test_data, -class) train_y <- dplyr::select(train_data, class) test_y <- dplyr::select(test_data, class)
    • build explainer
    explainer <- lime(train_x, model_rf, n_bins = 5, quantile_bins = TRUE)
    • run explain() function
    explanation_df <- lime::explain(test_x, explainer, n_labels = 1, n_features = 8, n_permutations = 1000, feature_select = "forward_selection")
    • model reliability
    explanation_df %>% ggplot(aes(x = model_r2, fill = label)) + geom_density(alpha = 0.5)

    • plot explanations
    plot_features(explanation_df[1:24, ], ncol = 1)

    Session Info ## Session info ------------------------------------------------------------- ## setting value ## version R version 3.4.2 (2017-09-28) ## system x86_64, darwin15.6.0 ## ui X11 ## language (EN) ## collate de_DE.UTF-8 ## tz ## date 2017-12-12 ## Packages ----------------------------------------------------------------- ## package * version date source ## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0) ## backports 1.1.1 2017-09-25 CRAN (R 3.4.2) ## base * 3.4.2 2017-10-04 local ## BBmisc 1.11 2017-03-10 CRAN (R 3.4.0) ## bindr 0.1 2016-11-13 CRAN (R 3.4.0) ## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.0) ## blogdown 0.3 2017-11-13 CRAN (R 3.4.2) ## bookdown 0.5 2017-08-20 CRAN (R 3.4.1) ## broom 0.4.3 2017-11-20 CRAN (R 3.4.2) ## caret * 6.0-77 2017-09-07 CRAN (R 3.4.1) ## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0) ## checkmate 1.8.5 2017-10-24 CRAN (R 3.4.2) ## class 7.3-14 2015-08-30 CRAN (R 3.4.2) ## cli 1.0.0 2017-11-05 CRAN (R 3.4.2) ## codetools 0.2-15 2016-10-05 CRAN (R 3.4.2) ## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0) ## compiler 3.4.2 2017-10-04 local ## crayon 1.3.4 2017-09-16 cran (@1.3.4) ## CVST 0.2-1 2013-12-10 CRAN (R 3.4.0) ## datasets * 3.4.2 2017-10-04 local ## ddalpha 1.3.1 2017-09-27 CRAN (R 3.4.2) ## DEoptimR 1.0-8 2016-11-19 CRAN (R 3.4.0) ## devtools 1.13.4 2017-11-09 CRAN (R 3.4.2) ## digest 0.6.12 2017-01-27 CRAN (R 3.4.0) ## dimRed 0.1.0 2017-05-04 CRAN (R 3.4.0) ## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2) ## DRR 0.0.2 2016-09-15 CRAN (R 3.4.0) ## dummies * 1.5.6 2012-06-14 CRAN (R 3.4.0) ## e1071 1.6-8 2017-02-02 CRAN (R 3.4.0) ## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.0) ## farff * 1.0 2016-09-11 CRAN (R 3.4.0) ## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.0) ## foreach * 1.4.3 2015-10-13 CRAN (R 3.4.0) ## foreign 0.8-69 2017-06-22 CRAN (R 3.4.1) ## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0) ## glmnet 2.0-13 2017-09-22 CRAN (R 3.4.2) ## glue 1.2.0 2017-10-29 CRAN (R 3.4.2) ## gower 0.1.2 2017-02-23 CRAN (R 3.4.0) ## graphics * 3.4.2 2017-10-04 local ## grDevices * 3.4.2 2017-10-04 local ## grid 3.4.2 2017-10-04 local ## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0) ## haven 1.1.0 2017-07-09 CRAN (R 3.4.0) ## hms 0.4.0 2017-11-23 CRAN (R 3.4.3) ## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) ## htmlwidgets 0.9 2017-07-10 CRAN (R 3.4.1) ## httpuv 1.3.5 2017-07-04 CRAN (R 3.4.1) ## httr 1.3.1 2017-08-20 CRAN (R 3.4.1) ## ipred 0.9-6 2017-03-01 CRAN (R 3.4.0) ## iterators * 1.0.8 2015-10-13 CRAN (R 3.4.0) ## itertools * 0.1-3 2014-03-12 CRAN (R 3.4.0) ## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0) ## kernlab 0.9-25 2016-10-03 CRAN (R 3.4.0) ## knitr 1.17 2017-08-10 CRAN (R 3.4.1) ## labeling 0.3 2014-08-23 CRAN (R 3.4.0) ## lattice * 0.20-35 2017-03-25 CRAN (R 3.4.2) ## lava 1.5.1 2017-09-27 CRAN (R 3.4.1) ## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2) ## lime * 0.3.1 2017-11-24 CRAN (R 3.4.3) ## lubridate 1.7.1 2017-11-03 CRAN (R 3.4.2) ## magrittr 1.5 2014-11-22 CRAN (R 3.4.0) ## MASS 7.3-47 2017-02-26 CRAN (R 3.4.2) ## Matrix 1.2-12 2017-11-15 CRAN (R 3.4.2) ## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0) ## methods * 3.4.2 2017-10-04 local ## mime 0.5 2016-07-07 CRAN (R 3.4.0) ## missForest * 1.4 2013-12-31 CRAN (R 3.4.0) ## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0) ## ModelMetrics 1.1.0 2016-08-26 CRAN (R 3.4.0) ## modelr 0.1.1 2017-07-24 CRAN (R 3.4.1) ## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0) ## nlme 3.1-131 2017-02-06 CRAN (R 3.4.2) ## nnet 7.3-12 2016-02-02 CRAN (R 3.4.2) ## parallel 3.4.2 2017-10-04 local ## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0) ## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) ## prodlim 1.6.1 2017-03-06 CRAN (R 3.4.0) ## psych 1.7.8 2017-09-09 CRAN (R 3.4.1) ## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2) ## R6 2.2.2 2017-06-17 CRAN (R 3.4.0) ## randomForest * 4.6-12 2015-10-07 CRAN (R 3.4.0) ## Rcpp 0.12.14 2017-11-23 CRAN (R 3.4.3) ## RcppRoll 0.2.2 2015-04-05 CRAN (R 3.4.0) ## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0) ## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0) ## recipes 0.1.1 2017-11-20 CRAN (R 3.4.3) ## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0) ## rlang 0.1.4 2017-11-05 CRAN (R 3.4.2) ## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2) ## robustbase 0.92-8 2017-11-01 CRAN (R 3.4.2) ## rpart 4.1-11 2017-03-13 CRAN (R 3.4.2) ## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0) ## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1) ## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0) ## scales 0.5.0 2017-08-24 CRAN (R 3.4.1) ## sfsmisc 1.1-1 2017-06-08 CRAN (R 3.4.0) ## shiny 1.0.5 2017-08-23 CRAN (R 3.4.1) ## shinythemes 1.1.1 2016-10-12 CRAN (R 3.4.0) ## splines 3.4.2 2017-10-04 local ## stats * 3.4.2 2017-10-04 local ## stats4 3.4.2 2017-10-04 local ## stringdist 2017-07-31 CRAN (R 3.4.1) ## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2) ## stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0) ## survival 2.41-3 2017-04-04 CRAN (R 3.4.0) ## tibble * 1.3.4 2017-08-22 CRAN (R 3.4.1) ## tidyr * 0.7.2 2017-10-16 CRAN (R 3.4.2) ## tidyselect 0.2.3 2017-11-06 CRAN (R 3.4.2) ## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2) ## timeDate 3042.101 2017-11-16 CRAN (R 3.4.2) ## tools 3.4.2 2017-10-04 local ## utils * 3.4.2 2017-10-04 local ## withr 2.1.0 2017-11-01 CRAN (R 3.4.2) ## xml2 1.1.1 2017-01-24 CRAN (R 3.4.0) ## xtable 1.8-2 2016-02-05 CRAN (R 3.4.0) ## yaml 2.1.15 2017-12-01 CRAN (R 3.4.3) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Upcoming R conferences (2018)

    Tue, 12/12/2017 - 01:00

    (This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

    It’s that time of year when we need to start thinking about what R Conferences we would like to (and can!) attend. To help plan your (ahem) work trips, we thought it would be useful to list the upcoming main attractions.

    We maintain a list of upcoming rstats conferences. To keep up to date, just follow our twitter bot.

    rstudio::conf (San Diego, USA)

    rstudio::conf is about all things R and RStudio

    The next RStudio conference is held in San Diego, and boosts top quality speakers from around the R world. With excellent tutorials and great talks, this is certainly one of the top events. The estimated cost is around $200 per day, so not the cheapest options, but worthwhile.

    SatRday (Cape Town, South Africa)

    An opportunity to hear from and network with top Researchers, Data Scientists and Developers from the R community in South Africa and beyond.

    This workshop combines an amazing location, with great speakers and at an amazing price (only $70 per day). The key note speakers are Maëlle Salmon and Stephanie Kovalchik. This years SatRday has two tutorials, one on package building the other on sports modelling.

    • March 17th: SatRday. Cape Town, South Africa.

    The European R Users Meeting, eRum, is an international conference that aims at integrating users of the R language living in Europe. This conference is similar to useR!.

    R/Finance 2018 (Chicago, USA)

    From the inaugural conference in 2009, the annual R/Finance conference in Chicago has become the primary meeting for academics and practioners interested in using R in Finance. Participants from academia and industry mingle for two days to exchange ideas about current research, best practices and applications. A single-track program permits continued focus on a series of refereed submissions. We hear there’s a lively social program rounds out the event.

    useR! 2018 (Brisbane, Australia)

    With useR! 2017 spanning over 5 days and boasting some of the biggest names in data science, the next installment of the useR! series is sure to be even better. Last year the program was packed with speakers, programmes and tutorials from industry and academia. Each day usually containing numerous tutorials and usually ends in a keynote speech followed by dinner(!). Registration is open in January 2018.

    Noreast’R Conference (North East USA)

    Born on Twitter, the Noreast’R Conference is a grass roots effort to organize a regional #rstats conference in the Northeastern United States. Work is currently under way and further details will follow on the Noreast’R website and twitter page (linked below).

    Image credit var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Random walking

    Tue, 12/12/2017 - 00:43

    (This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)


    I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to determine a direction:

    1 up (North)
    2 diagonal to the upper right (Northeast)
    3 diagonal to the lower right (Southeast)
    4 down (South)
    5 diagonal to the lower left (Southwest)
    6 diagonal to the upper left (Northwest)

    Our first roll was a six so we drew a line to the hexagon northwest of where we started. That was the first “step.”

    After a few rolls we found ourselves coming back along a path we had gone down before. We decided to draw a second line close to the first in those cases.

    We did this about 50 times. The results are pictured above, along with kid hands for scale.

    I sent the picture to my friend and serial co-author Jake Hofman because he likes a good kid’s science project and has a mental association for everything in applied math. He wrote “time for some Brownian motion?” and sent a talk he’d given a decade ago at a high school which taught me all kind of stuff I didn’t realize connecting random walks to Brownian motion, Einstein, the existence of atoms and binomial pricing trees in finance. (I was especially embarrassed not to have mentally connected random walks to binomial pricing because I had a slide on that in my job talk years ago and because it is the method we used in an early distribution builder paper.)

    Back at home Jake did some simulations on random walks in one dimension (in which you just go forward or backward with equal probability) and sent them to me. Next, I did the same with hexagonal random walks (code at the end of this post). Here’s an image of one random walk on a hexagonal field.

    I simulated 5000 random walks of 250 steps, starting at the point 0,0. The average X and Y position is 0 at each step, as shown here.

    This might seem strange at first. But think about many walks of just one step. The number of one-step journeys in which your X position is increased a certain amount will be matched, in expectation, by an equal number of one-step journeys in which your X position is decreased by the same amount. Your average X position is thus 0 at the first step. Same is true for Y. The logic scales when you take two or more steps and that’s why we see the flat lines we do.

    If you think about this wrongheadedly you’d think you weren’t getting anywhere. But of course you are. Let’s look at your average distance from the starting point at each step (below).

    The longer you walk, the more distant from the starting point you tend to be. Because distances are positive, the average of those distances is positive. We say you “tend to” move away from the origin at each step, because that is what happens on average over many trips. At any given step on any given trip, you could move towards or away from the starting point with equal probability. This is deep stuff.

    Speaking of deep stuff, you might notice that the relationship is pretty. Let’s zoom in.

    The dashed line is the square root of the number of steps. It’s interesting to note that this square root relationship happens in a one-dimensional random walk as well. There’s a good explanation of it in this document. As Jake put it, it’s as if the average walk is covered by a circular plate whose area grows linearly with the number of steps. (Why linearly? Because area of a circle is proportional to its radius squared. Since the radius grows as the square root of the number of steps, the radius squared is linear in the number of steps)

    (*) As a sidenote, I was at first seeing something that grew more slowly than the square root and couldn’t figure out what the relationship was. It turns out that the square root relationship holds for the root mean squared distance (the mean of the squared distances) and I had been looking at the mean Euclidean distance. It’s a useful reminder that the term “average” has quite a few definitions. “Average” is a useful term for getting the gist across, but can lead to some confusion.

    Speaking of gists, here’s the R code. Thanks to @hadleywickham for creating the tidyverse and making everything awesome.


    The post Random walking appeared first on Decision Science News.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Personal note on joining the Microsoft Cloud Advocates team

    Mon, 12/11/2017 - 23:29

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    A quick personal note: today is my first day as a member of the Cloud Developer Advocates team at Microsoft! I'll still be blogging and events related to R, and supporting the R community, but now I'll be doing it as a member of a team dedicated to community outreach.

    As a bit of background, when I joined Microsoft back in 2015 via the acquisition of Revolution Analytics, I was thrilled to be able to continue my role supporting the R community. Since then, Microsoft as a whole has continue to ramp up its support of open source projects and to interact directly with developers of all stripes (including data scientists!) through various initiatives across the company. (Aside: I knew Microsoft was a big company before I joined, but even then took me a while to appreciate the scale of the different divisions, groups, and geographies. For me, it was a bit like moving into a new city in the sense that it takes a while to learn what the neighborhoods are and how to find your way around.)

    I learned of the Cloud Developer Advocates group after reading a RedMonk blog post about how some prominent techies (many of whom I've long admired and respected on Twitter) had been recruited into Microsoft to engage directly with developers. So when I learned that there was an entire group at Microsoft devoted to community outreach, and with such a fantastic roster already on board (and still growing!), I knew I had to be a part of it. I'll be working with a dedicated team (including Paige, Seth and Vadim) focused on data science, machine learning, and AI. As I mentioned above, I'll still be covering R, but will also be branching out into some other areas as well. 

    Stay tuned for more, and please hit me up on Twitter or via email if you'd like to chat, want to connect at an event, or just let me know how I can help. Let's keep the conversation going!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Using the tuber package to analyse a YouTube channel

    Mon, 12/11/2017 - 22:02

    (This article was first published on R – insightR, and kindly contributed to R-bloggers)

    By Gabriel Vasconcelos

    So I decided to have a quick look at the tuber package to extract YouTube data in R. My cousin is a singer (a hell of a good one) and he has a YouTube channel (dan vasc), which I strongly recommend, where he posts his covers. I will focus my analysis on his channel. The tuber package is very friendly and it downloads YouTube statistics on comments, views, likes and more straight to R using the YouTube API.

    First let us look on some general information on the channel (Codes for replication in the end of the text). The table below shows the number of followers, views, videos, etc in the moment I downloaded the data (2017-12-12 11:20pm). If you run the code on your computer the results may be different because the channel will have more activity. Dan’s channel is getting close to 1 million views and he has 58 times more likes than dislikes. His views ratio is 13000 views per video.


    Channel Subscriptions Views Videos Likes Dislikes Comments Dan Vasc 5127 743322 57 9008 155 1993


    We can also see some of the same statistics for each video. I selected only videos published after January 2016 that is when the channel became more active. The list has 29 videos. You can see that the channel became even more active in 2017. In the last month it started with weekly publications.


    date title viewCount likeCount dislikeCount commentCount 2016-03-09 “Heart Of Steel” – MANOWAR cover 95288 1968 53 371 2016-05-09 “The Sound Of Silence” – SIMON & GARFUNKEL / DISTURBED cover 13959 556 6 85 2016-07-04 One Man Choir – Handel’s Hallelujah 9390 375 6 70 2016-08-16 “Carry On” – MANOWAR cover 19146 598 12 98 2016-09-12 “You Are Loved (Don’t Give Up)” – JOSH GROBAN cover 2524 142 0 21 2016-09-26 “Hearts On Fire” – HAMMERFALL cover 6584 310 4 58 2016-10-26 “Dawn Of Victory” – RHAPSODY OF FIRE cover 10335 354 5 69 2017-04-28 “I Don’t Wanna Miss A Thing” – AEROSMITH cover 9560 396 5 89 2017-05-09 State of affairs 906 99 1 40 2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Japanese) 2862 160 4 39 2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Português) 3026 235 3 62 2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (English) 2682 108 2 14 2017-06-14 HOW TO BE A YOUTUBE SINGER | ASKDANVASC 01 559 44 1 19 2017-06-17 Promotional Live || Q&A and video games 206 16 0 2 2017-07-18 “The Bard’s Song” – BLIND GUARDIAN cover (SPYGLASS INN project) 3368 303 2 47 2017-07-23 “Numb” – LINKIN PARK cover (R.I.P. CHESTER) 6717 350 14 51 2017-07-27 THE PERFECT TAKE and HOW TO MIX VOCALS | ASKDANVASC 02 305 29 0 11 2017-08-04 4000 Subscribers and Second Channel 515 69 1 23 2017-08-10 “Hello” – ADELE cover [ROCK VERSION] 6518 365 2 120 2017-08-27 “The Rains Of Castamere” (The Lannister Song) – GAME OF THRONES cover 2174 133 5 28 2017-08-31 “Africa” – TOTO cover 18251 642 10 172 2017-09-24 “Chop Suey!” – SYSTEM OF A DOWN cover 2562 236 6 56 2017-10-09 “An American Trilogy” – ELVIS PRESLEY cover 1348 168 1 48 2017-11-08 “Beauty And The Beast” – Main Theme cover | Feat. Alina Lesnik 2270 192 2 59 2017-11-16 “Bohemian Rhapsody” – QUEEN cover 2589 339 3 95 2017-11-23 “The Phantom Of The Opera” – NIGHTWISH/ANDREW LLOYD WEBBER cover | Feat. Dragica 1857 209 2 42 2017-11-24 “Back In Black” – AC/DC cover (RIP MALCOLM YOUNG) 2202 207 2 56 2017-11-30 “Immigrant Song” – LED ZEPPELIN cover 3002 204 1 62 2017-12-07 “Sweet Child O’ Mine” – GUNS N’ ROSES cover 1317 201 2 86


    Now that we saw the data. Let’s explore it to check for structures and information. The plots below show how likes, dislikes and comments are related to views. The positive relation is obvious. However, we have some degree of nonlinearity in likes and comments. The increment on likes and comments becomes smaller as the views increase. The dislikes look more linear on the views but the number of dislikes is to small to be sure.

    Another interesting information is how comments are distributed over time in each video. I selected the four most recent videos and plotted the comments time-series below. All videos have a lot of activity in the first days but it decreases fast a few days latter. Followers and subscribers probably see the videos first and they must be responsible for the intense activity in the beginning of each plot.

    The most important information might be how the channel grows over the time. Dan’s channel had two important moments in 2017. It became much more active in April and it started having weekly publications in November. We can clearly see that both strategies worked in the plot below. I put two dashed lines to show these two events. In April the number of comments increased a lot and they increased even more in November.

    Finally, let’s have a look at what is in the comments using a WordCloud (wordcloud package). I removed words that are not informative such as “you, was, is, were” for English and Portuguese. The result is just below.


    Before using the tuber package you need an ID and a password from Google Developer Console. Click here for more information. If you are interested, the package tubern has some other tools to work with YouTube data such as generating reports.

    library(tuber) library(tidyverse) library(lubridate) library(stringi) library(wordcloud) library(gridExtra) httr::set_config( config( ssl_verifypeer = 0L ) ) # = Fixes some certificate problems on linux = # # = Autentication = # yt_oauth("ID", "PASS",token = "") # = Download and prepare data = # # = Channel stats = # chstat = get_channel_stats("UCbZRdTukTCjFan4onn04sDA") # = Videos = # videos = yt_search(term="", type="video", channel_id = "UCbZRdTukTCjFan4onn04sDA") videos = videos %>% mutate(date = as.Date(publishedAt)) %>% filter(date > "2016-01-01") %>% arrange(date) # = Comments = # comments = lapply(as.character(videos$video_id), function(x){ get_comment_threads(c(video_id = x), max_results = 1000) }) # = Prep the data = # # = Video Stat Table = # videostats = lapply(as.character(videos$video_id), function(x){ get_stats(video_id = x) }) videostats =, videostats) videostats$title = videos$title videostats$date = videos$date videostats = select(videostats, date, title, viewCount, likeCount, dislikeCount, commentCount) %>% as.tibble() %>% mutate(viewCount = as.numeric(as.character(viewCount)), likeCount = as.numeric(as.character(likeCount)), dislikeCount = as.numeric(as.character(dislikeCount)), commentCount = as.numeric(as.character(commentCount))) # = General Stat Table = # genstat = data.frame(Channel="Dan Vasc", Subcriptions=chstat$statistics$subscriberCount, Views = chstat$statistics$viewCount, Videos = chstat$statistics$videoCount, Likes = sum(videostats$likeCount), Dislikes = sum(videostats$dislikeCount), Comments = sum(videostats$commentCount)) # = videostats Plot = # p1 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = likeCount)) p2 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount)) p3 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = commentCount)) grid.arrange(p1, p2, p3, ncol = 2) # = Comments TS = # comments_ts = lapply(comments, function(x){ as.Date(x$publishedAt) }) comments_ts = tibble(date = as.Date(Reduce(c, comments_ts))) %>% group_by(date) %>% count() ggplot(data = comments_ts) + geom_line(aes(x = date, y = n)) + geom_smooth(aes(x = date, y = n), se = FALSE) + ggtitle("Comments by day")+ geom_vline(xintercept = as.numeric(as.Date("2017-11-08")), linetype = 2,color = "red")+ geom_vline(xintercept = as.numeric(as.Date("2017-04-28")), linetype = 2,color = "red") # = coments by video = # selected = (nrow(videostats) - 3):nrow(videostats) top4 = videostats$title[selected] top4comments = comments[selected] p = list() for(i in 1:4){ df = top4comments[[i]] df$date = as.Date(df$publishedAt) df = df %>% arrange(date) %>% group_by(year(date), month(date), day(date)) %>% count() df$date = make_date(df$`year(date)`, df$`month(date)`,df$`day(date)`) p[[i]] = ggplot(data=df) + geom_line(aes(x = date, y = n)) + ggtitle(top4[i]) },p) ## = WordClouds = ## comments_text = lapply(comments,function(x){ as.character(x$textOriginal) }) comments_text = tibble(text = Reduce(c, comments_text)) %>% mutate(text = stri_trans_general(tolower(text), "Latin-ASCII")) remove = c("you","the","que","and","your","muito","this","that","are","for","cara", "from","very","like","have","voce","man","one","nao","com","with","mais", "was","can","uma","but","ficou","meu","really","seu","would","sua","more", "it's","it","is","all","i'm","mas","como","just","make","what","esse","how", "por","favor","sempre","time","esta","every","para","i've","tem","will", "you're","essa","not","faz","pelo","than","about","acho","isso", "way","also","aqui","been","out","say","should","when","did","mesmo", "minha","next","cha","pra","sei","sure","too","das","fazer","made", "quando","ver","cada","here","need","ter","don't","este","has","tambem", "una","want","ate","can't","could","dia","fiquei","num","seus","tinha","vez", "ainda","any","dos","even","get","must","other","sem","vai","agora","desde", "dessa","fez","many","most","tao","then","tudo","vou","ficaria","foi","pela", "see","teu","those","were") words = tibble(word = Reduce(c, stri_extract_all_words(comments_text$text))) %>% group_by(word) %>% count() %>% arrange(desc(n)) %>% filter(nchar(word) >= 3) %>% filter(n > 10 & word %in% remove == FALSE) set.seed(3) wordcloud(words$word, words$n, random.order = FALSE, random.color = TRUE, rot.per = 0.3, colors = 1:nrow(words))

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – insightR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Linear mixed-effect models in R

    Mon, 12/11/2017 - 20:01

    (This article was first published on poissonisfish, and kindly contributed to R-bloggers)

    Arabidopsis thaliana rosette. From: Wikimedia Commons

    Statistical models generally assume that

    1. All observations are independent from each other
    2. The distribution of the residuals follows , irrespective of the values taken by the dependent variable y

    When any of the two is not observed, more sophisticated modelling approaches are necessary. Let’s consider two hypothetical problems that violate the two respective assumptions, where y denotes the dependent variable:

    A. Suppose you want to study the relationship between average income (y) and the educational level in the population of a town comprising four fully segregated blocks. You will sample 1,000 individuals irrespective of their blocks. If you model as such, you neglect dependencies among observations – individuals from the same block are not independent, yielding residuals that correlate within block.

    B. Suppose you want to study the relationship between anxiety (y) and the levels of triglycerides and uric acid in blood samples from 1,000 people, measured 10 times in the course of 24 hours. If you model as such, you will likely find that the variance of changes over time – this is an example of heteroscedasticity, a phenomenon characterized by the heterogeneity in the variance of the residuals.

    In A. we have a problem of dependency caused by spatial correlation, whereas in B. we have a problem of heterogeneous variance. As a result, classic linear models cannot help in these hypothetical problems, but both can be addressed using linear mixed-effect models (LMMs). In rigour though, you do not need LMMs to address the second problem.

    LMMs are extraordinarily powerful, yet their complexity undermines the appreciation from a broader community. LMMs dissect hierarchical and / or longitudinal (i.e. time course) data by separating the variance due to random sampling from the main effects. In essence, on top of the fixed effects normally used in classic linear models, LMMs resolve i) correlated residuals by introducing random effects that account for differences among random samples, and ii) heterogeneous variance using specific variance functions, thereby improving the estimation accuracy and interpretation of fixed effects in one go.

    I personally reckon that most relevant textbooks and papers are hard to grasp for non-mathematicians. Therefore, following the brief reference in my last post on GWAS I will dedicate the present tutorial to LMMs. For further reading I highly recommend the ecology-oriented Zuur et al. (2009) and the R-intensive Gałecki et al. (2013) books, and this simple tutorial from Bodo Winter. For agronomic applications, H.-P. Piepho et al. (2003) is an excellent theoretical introduction.

    Here, we will build LMMs using the Arabidopsis dataset from the package lme4, from a study published by Banta et al. (2010). These data summarize variation in total fruit set per plant in Arabidopsis thaliana plants conditioned to fertilization and simulated herbivory. Our goal is to understand the effect of fertilization and simulated herbivory adjusted to experimental differences across groups of plants.

    Mixed-effect linear models

    Whereas the classic linear model with n observational units and p predictors has the vectorized form

    with the predictor matrix , the vector of p + 1 coefficient estimates and the n-long vectors of the response and the residuals , LMMs additionally accomodate separate variance components modelled with a set of random effects ,

    where and are design matrices that jointly represent the set of predictors. Random effects models include only an intercept as the fixed effect and a defined set of random effects. Random effects comprise random intercepts and / or random slopes. Also, random effects might be crossed and nested. In terms of estimation, the classic linear model can be easily solved using the least-squares method. For the LMM, however, we need methods that rather than estimating predict , such as maximum likelihood (ML) and restricted maximum likelihood (REML). Bear in mind that unlike ML, REML assumes that the fixed effects are not known, hence it is comparatively unbiased (see Chapter 5 in Zuur et al. (2009) for more details). Unfortunately, LMMs too have underlying assumptions – both residuals and random effects should be normally distributed. Residuals in particular should also have a uniform variance over different values of the dependent variable, exactly as assumed in a classic linear model.

    One of the most common doubts concerning LMMs is determining whether a variable is a random or fixed. First of all, an effect might be fixed, random or even both simultaneously – it largely depends on how you approach a given problem. Generally, you should consider all factors that qualify as sampling from a population as random effects (e.g. individuals in repeated measurements, cities within countries, field trials, plots, blocks, batches) and everything else as fixed. As a rule of thumb, i) factors with fewer than 5 levels should be considered fixed and conversely ii) factors with numerous levels should be considered random effects in order to increase the accuracy in the estimation of variance. Both points relate to the LMM assumption of having normally distributed random effects.

    Best linear unbiased estimators (BLUEs) and predictors (BLUPs) correspond to the values of fixed and random effects, respectively. The usage of the so-called genomic BLUPs (GBLUPs), for instance, elucidates the genetic merit of animal or plant genotypes that are regarded as random effects when trial conditions, e.g. location and year of trials are considered fixed. However, many studies sought the opposite, i.e. using breeding values as fixed effects and trial conditions as random, when the levels of the latter outnumber the former, chiefly because of point ii) outlined above. In GWAS, LMMs aid in teasing out population structure from the phenotypic measures.

    Let’s get started with R

    We will follow a structure similar to the 10-step protocol outlined in Zuur et al. (2009): i) fit a full ordinary least squares model and run the diagnostics in order to understand if and what is faulty about its fit; ii) fit an identical generalized linear model (GLM) estimated with ML, to serve as a reference for subsequent LMMs; iii) deploy the first LMM by introducing random effects and compare to the GLM, optimize the random structure in subsequent LMMs; iv) optimize the fixed structure by determining the significant of fixed effects, always using ML estimation; finally, v) use REML estimation on the optimal model and interpret the results.

    You need to havenlme andlme4 installed to proceed. We will firstly examine the structure of the Arabidopsis dataset.

    # Install (if necessary) and load nlme and lme4 library(nlme) library(lme4) # Load dataset, inspect size and additional info data(Arabidopsis) dim(Arabidopsis) # 625 observations, 8 variables ?Arabidopsis attach(Arabidopsis)

    The Arabidopsis dataset describes 625 plants with respect to the the following 8 variables (transcript from R):

    region: a factor with 3 levels NL (Netherlands), SP (Spain), SW (Sweden)

    population: a factor with the form n.R representing a population in region R

    genotype: a factor with 24 (numeric-valued) levels

    a nuisance factor with 2 levels, one for each of two greenhouse racks

    fertilization treatment/nutrient level (1, minimal nutrients or 8, added nutrients)

    simulated herbivory or “clipping” (apical meristem damage): unclipped(baseline) or clipped

    a nuisance factor for germination method (Normal, Petri.Plate, or Transplant)

    total fruit set per plant (integer), henceforth TFPP for short.

    We will now visualise the absolute frequencies in all 7 factors and the distribution for TFPP.

    # Overview of the variables par(mfrow = c(2,4)) barplot(table(reg), ylab = "Frequency", main = "Region") barplot(table(popu), ylab = "Frequency", main = "Population") barplot(table(gen), ylab = "Frequency", las = 2, main = "Genotype") barplot(table(rack), ylab = "Frequency", main = "Rack") barplot(table(nutrient), ylab = "Frequency", main = "Nutrient") barplot(table(amd), ylab = "Frequency", main = "AMD") barplot(table(status), ylab = "Frequency", main = "Status") hist(total.fruits, col = "grey", main = "Total fruits", xlab = NULL)

    The frequencies are overall balanced, perhaps except for status (i.e. germination method). Genotype, greenhouse rack and fertilizer are incorrectly interpreted as quantitative variables. In addition, the distribution of TFPP is right-skewed. As such, we will encode these three variables as categorical variables and log-transform TFPP to approximate a Gaussian distribution (natural logarithm).

    # Transform the three factor variables gen, rack and nutrient Arabidopsis[,c("gen","rack","nutrient")] <- lapply(Arabidopsis[,c("gen","rack","nutrient")], factor) str(Arabidopsis) # Re-attach after correction, ignore warnings attach(Arabidopsis) # Add 1 to total fruits, otherwise log of 0 will prompt error total.fruits <- log(1 + total.fruits)

    A closer look into the variables shows that each genotype is exclusive to a single region. The data contain no missing values.

    # gen x popu table table(gen, popu) # Any NAs? any( # FALSE Formula syntax basics

    At this point I hope you are familiar with the formula syntax in R. Note that interaction terms are denoted by : and fully crossed effects with *, so that A*B = A + B + A:B. The following code example

    lm(y ~ x1 + x2*x3)

    builds a linear model of using , ,  and the interaction between  and . In case you want to perform arithmetic operations inside the formula, use the function I. You can also introduce polynomial terms with the function poly. One handy trick I use to expand all pairwise interactions among predictors is

    model.matrix(y ~ .*., data = X)

    provided a matrix X that gathers all predictors and y. You can also simply use .*. inside the lm call, however you will likely need to preprocess the resulting interaction terms.

    While the syntax of lme is identical to lm for fixed effects, its random effects are specified under the argument random as

    random = ~intercept + fixed effect | random effect

    and can be nested using /. In the following example

    random = ~1 + C | A/B

    the random effect B is nested within random effect A, altogether with random intercept and slope with respect to C. Therefore, not only will the groups defined by A and A/B have different intercepts, they will also be explained by different slight shifts of from the fixed effect C.

    Classic linear model

    Ideally, you should start will a full model (i.e. including all independent variables). Here, however, we cannot use all descriptors in the classic linear model since the fit will be singular due to the redundancy in the levels of reg and popu. For simplicity I will exclude these alongside gen, since it contains a lot of levels and also represents a random sample (from many other extant Arabidopsis genotypes). Additionally, I would rather use rack and  status as random effects in the following models but note that having only two and three levels respectively, it is advisable to keep them as fixed.

    LM <- lm(total.fruits ~ rack + nutrient + amd + status) summary(LM) par(mfrow = c(2,2)) plot(LM)

    These diagnostic plots show that the residuals of the classic linear model poorly qualify as normally distributed. Because we have no obvious outliers, the leverage analysis provides acceptable results. We will try to improve the distribution of the residuals using LMMs.

    Generalized linear model

    We need to build a GLM as a benchmark for the subsequent LMMs. This model can be fit without random effects, just like a lm but employing ML or REML estimation, using the gls function. Hence, it can be used as a proper null model with respect to random effects. The GLM is also sufficient to tackle heterogeneous variance in the residuals by leveraging different types of variance and correlation functions, when no random effects are present (see arguments correlation and weights).

    GLM <- gls(total.fruits ~ rack + nutrient + amd + status, method = "ML") summary(GLM)

    At this point you might consider comparing the GLM and the classic linear model and note they are identical. Also, you might wonder why are we using LM instead of REML – as hinted in the introduction, REML comparisons are meaningless in LMMs that differ in their fixed effects. Therefore, we will base all of our comparisons on LM and only use the REML estimation on the final, optimal model.

    Optimal random structure

    Let’s fit our first LMM with all fixed effects used in the GLM and introducing reg, popu, gen, reg/popu, reg/gen, popu/gen and reg/popu/gen as random intercepts, separately.

    In order to compare LMMs (and GLM), we can use the function anova (note that it does not work for lmer objects) to compute the likelihood ratio test (LRT). This test will determine if the models are significantly different with respect to goodness-of-fit, as weighted by the trade-off between variance explained and degrees-of-freedom. The model fits are also evaluated based on the Akaike (AIC) and Bayesian information criteria (BIC) – the smaller their value, the better the fit.

    lmm1 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg, method = "ML") lmm2 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|popu, method = "ML") lmm3 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|gen, method = "ML") lmm4 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg/popu, method = "ML") lmm5 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg/gen, method = "ML") lmm6 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|popu/gen, method = "ML") lmm7 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg/popu/gen, method = "ML") anova(GLM, lmm1, lmm2, lmm3, lmm4, lmm5, lmm6, lmm7)

    We could now base our selection on the AIC, BIC or log-likelihood. Considering most models are undistinguishable with respect to the goodness-of-fit, I will select lmm6 and lmm7  as the two best models so that we have more of a random structure to look at. Take a look into the distribution of the random effects with plot(ranef(MODEL)). We next proceed to incorporate random slopes.

    There is the possibility that the different researchers from the different regions might have handled and fertilized plants differently, thereby exerting slightly different impacts. Let’s update lmm6 and lmm7 to include random slopes with respect to nutrient. We first need to setup a control setting that ensures the new models converge.

    # Set optimization pars ctrl <- lmeControl(opt="optim") lmm6.2 <- update(lmm6, .~., random = ~nutrient|popu/gen, control = ctrl) lmm7.2 <- update(lmm7, .~., random = ~nutrient|reg/popu/gen, control = ctrl) anova(lmm6, lmm6.2, lmm7, lmm7.2) # both models improved anova(lmm6.2, lmm7.2) # similar fit; lmm6.2 more parsimonious summary(lmm6.2)

    Assuming a level of significance , the inclusion of random slopes with respect to nutrient improved both lmm6 and lmm7. Comparing lmm6.2 andlmm7.2 head-to-head provides no evidence for differences in fit, so we select the simpler model,lmm6.2. Let’s check how the random intercepts and slopes distribute in the highest level (i.e. gen within popu).

    plot(ranef(lmm6.2, level = 2))

    The random intercepts (left) appear to be normally distributed, except for genotype 34, biased towards negative values. This could warrant repeating the entire analysis without this genotype. The random slopes (right), on the other hand, are rather normally distributed. Interestingly, there is a negative correlation of -0.61 between random intercepts and slopes, suggesting that genotypes with low baseline TFPP tend to respond better to fertilization. Try plot(ranef(lmm6.2, level = 1)) to observe the distributions at the level of popu only. Next, we will use QQ plots to compare the residual distributions between the GLM and lmm6.2 to gauge the relevance of the random effects.

    # QQ plots (drawn to the same scale!) par(mfrow = c(1,2)) lims <- c(-3.5,3.5) qqnorm(resid(GLM, type = "normalized"), xlim = lims, ylim = lims,main = "GLM") abline(0,1, col = "red", lty = 2) qqnorm(resid(lmm6.2, type = "normalized"), xlim = lims, ylim = lims, main = "lmm6.2") abline(0,1, col = "red", lty = 2)

    The improvement is clear. Bear in mind these results do not change with REML estimation. Try different arrangements of random effects with nesting and random slopes, explore as much as possible!

    Optimal fixed structure

    Now that we are happy with the random structure, we will look into the summary of the optimal model so far (i.e. lmm6.2) and determine if we need to modify the fixed structure.


    All effects are significant with , except for one of the levels from status that represents transplanted plants. Given the significant effect from the other two levels, we will keep status and all current fixed effects. Just for fun, let’s add the interaction term nutrient:amd and see if there is any significant improvement in fit.

    lmm8 <- update(lmm6.2, .~. + nutrient:amd) summary(lmm8) anova(lmm8, lmm6.2)

    The addition of the interaction was non-significant with respect to both and the goodness-of-fit, so we will drop it. Note that it is not a good idea to add new terms after optimizing the random structure, I did so only because otherwise there would be nothing to do with respect to the fixed structure.

    Fit optimal model with REML

    We could play a lot more with different model structures, but to keep it simple let’s finalize the analysis by fitting the lmm6.2 model using REML and finally identifying and understanding the differences in the main effects caused by the introduction of random effects.

    finalModel <- update(lmm6.2, .~., method = "REML") summary(finalModel)

    We will now contrast our REML-fitted final model against a REML-fitted GLM and determine the impact of incorporating random intercept and slope, with respect to nutrient, at the level of popu/gen. Therefore, both will be given the same fixed effects and estimated using REML. # Reset previous graphical pars # New GLM, updated from the first by estimating with REML GLM2 <- update(GLM, .~., method = "REML") # Plot side by side, beta with respective SEs plot(coef(GLM2), xlab = "Fixed Effects", ylab = expression(beta), axes = F, pch = 16, col = "black", ylim = c(-.9,2.2)) stdErrors <- coef(summary(GLM2))[,2] segments(x0 = 1:6, x1 = 1:6, y0 = coef(GLM2) - stdErrors, y1 = coef(GLM2) + stdErrors, col = "black") axis(2) abline(h = 0, col = "grey", lty = 2) axis(1, at = 1:6, labels = c("Intercept", "Rack", "Nutrient (Treated)","AMD (Unclipped)","Status (PP)", "Status (Transplant)"), cex.axis = .7) # LMM points(1:6 + .1, fixef(finalModel), pch = 16, col = "red") stdErrorsLMM <- coef(summary(finalModel))[,2] segments(x0 = 1:6 + .1, x1 = 1:6 + .1, y0 = fixef(finalModel) - stdErrorsLMM, y1 = fixef(finalModel) + stdErrorsLMM, col = "red") # Legend legend("topright", legend = c("GLM","LMM"), text.col = c("black","red"), bty = "n")

    The figure above depicts the estimated from the different fixed effects, including the intercept, for the GLM (black) and the final LMM (red). Error bars represent the corresponding standard errors (SE). Overall the results are similar but uncover two important differences. First, for all fixed effects except the intercept and nutrient, the SE is smaller in the LMM. Second, the relative effects from two levels of status are opposite. With the consideration of random effects, the LMM estimated a more negative effect of culturing in Petri plates on TFPP, and conversely a less negative effect of transplantation.


    The distribution of the residuals as a function of the predicted TFPP values in the LMM is still similar to the first panel in the diagnostic plots of the classic linear model. The usage of additional predictors and generalized additive models would likely improve it.


    Now that we account for genotype-within-region random effects, how do we interpret the LMM results?

    Plants that were placed in the first rack, left unfertilized, clipped and grown normally have an average TFPP of 2.15. This is the value of the estimated grand mean (i.e. intercept), and the predicted TFPP when all other factors and levels do not apply. For example, a plant grown under the same conditions but placed in the second rack will be predicted to have a smaller yield, more precisely of . To these reported yield values, we still need to add the random intercepts predicted for region and genotype within region (which are tiny values, by comparison; think of them as a small adjustment). Moreover, we can state that

    • Fertilized plants produce more fruits than those kept unfertilized. This was the strongest main effect and represents a very sensible finding.
    • Plants grown in the second rack produce less fruits than those in the first rack. This was the second strongest main effect identified. Could this be due to light / water availability?
    • Simulated herbivory (AMD) negatively affects fruit yield. This is also a sensible finding – when plants are attacked, more energy is allocated to build up biochemical defence mechanisms against herbivores and pathogens, hence compromising growth and eventually fruit yield.
    • Both culturing in Petri plates and transplantation, albeit indistinguishable, negatively affect fruit yield as opposed to normal growth. When conditions are radically changed, plants must adapt swiftly and this comes at a cost as well. Thus, these observations too make perfect sense.
    • One important observation is that the genetic contribution to fruit yield, as gauged by gen, was normally distributed and adequately modelled as random. One single outlier could eventually be excluded from the analysis. This makes sense assuming plants have evolved universal mechanisms in response to both nutritional status and herbivory that overrule any minor genetic differences among individuals from the same species.
    • Always check the residuals and the random effects! While both linear models and LMMs require normally distributed residuals with homogeneous variance, the former assumes independence among observations and the latter normally distributed random effects. Use normalized residuals to establish comparisons.
    • One key additional advantage of LMMs we did not discuss is that they can handle missing values.
    • Wide format data should be first converted to long format, using e.g. the R package reshape.
    • Variograms are very helpful in determining spatial or temporal dependence in the residuals. In the case of spatial dependence, bubble plots nicely represent residuals in the space the observations were drown from (e.g. latitude and longitude; refer to Zuur et al. (2009) for more information).
    • REML estimation is unbiased but does not allow for comparing models with different fixed structures. Only use the REML estimation on the optimal model.

    With respect to this particular set of results:

    • The analysis outlined here is not as exhaustive as it should be. Among other things, we did neither initially consider interaction terms among fixed effects nor investigate in sufficient depth the random effects from the optimal model.
    • The dependent variable (total fruit set per plant) was highly right-skewed and required a log-transformation for basic modeling. The large amount of zeros would in rigour require zero inflated GLMs or similar approaches.
    • All predictors used in the analysis were categorical factors. We could similarly use an ANOVA model. LMMs are likely more relevant in the presence of quantitative or mixed types of predictors.

    I would like to thank Hans-Peter Piepho for answering my nagging questions over ResearchGate. I hope these superficial considerations were clear and insightful. I look forward for your suggestions and feedback. My next post will cover a joint multivariate model of multiple responses, the graph-guided fused LASSO (GFLASSO) using a R package I am currently developing. Happy holidays!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: poissonisfish. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Tennis Grand Slam Tournaments Champions Basic Analysis

    Mon, 12/11/2017 - 15:00

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    The present tutorial analyses the Tennis Grand Slam tournaments main results from the statistical point of view. Specifically, I try to answer the following questions:

    – How to fit the distribution of the Grand Slam tournaments number of victories across players?
    – How to compute the probability of having player’s victories greater than a specific number?
    – How the number of Grand Slam tournaments winners increases along with time?
    – How can we assign a metric to the tennis players based on the number of Grand Slam tournaments they won?

    Our dataset is provided within ref. [1], whose content is based on what reported by the ESP site at: ESPN site tennis history table.

    Packages suppressPackageStartupMessages(library(fitdistrplus)) suppressPackageStartupMessages(library(extremevalues)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(ggplot2)) Analysis

    The analysis within the present tutorial is organized as follows:

    • a basic data exploration is run
    • log-normal distribution is fit against the Grand Slam tournaments victories data
    • regression analysis is presented in order to evaluate the increase of tennis Grand Slam champions along with time
    • population statistical dispersion is evaluated to determine a Tennis Quotient assigned to tournaments’ winners
    Basic Data Exploration

    Loading the Tennis Grand Slam tournaments dataset and running a basic data exploration.

    url_file <- "" slam_win <- read.delim(url(url_file), sep="\t", stringsAsFactors = FALSE) dim(slam_win) [1] 489 4 kable(head(slam_win), row.names=FALSE) | YEAR|TOURNAMENT |WINNER |RUNNER.UP | |----:|:---------------|:-------------|:--------------| | 2017|U.S. Open |Rafael Nadal |Kevin Anderson | | 2017|Wimbledon |Roger Federer |Marin Cilic | | 2017|French Open |Rafael Nadal |Stan Wavrinka | | 2017|Australian Open |Roger Federer |Rafael Nadal | | 2016|U.S. Open |Stan Wawrinka |Novak Djokovic | | 2016|Wimbledon |Andy Murray |Milos Raonic | kable(tail(slam_win), row.names=FALSE) | YEAR|TOURNAMENT |WINNER |RUNNER.UP | |----:|:----------|:----------------|:------------------| | 1881|U.S. Open |Richard D. Sears |William E. Glyn | | 1881|Wimbledon |William Renshaw |John Hartley | | 1880|Wimbledon |John Hartley |Herbert Lawford | | 1879|Wimbledon |John Hartley |V. St. Leger Gould | | 1878|Wimbledon |Frank Hadow |Spencer Gore | | 1877|Wimbledon |Spencer Gore |William Marshall | nr <- nrow(slam_win) start_year <- slam_win[nr, "YEAR"] end_year <- slam_win[1, "YEAR"] (years_span <- end_year - start_year + 1) [1] 141 (total_slam_winners <- length(unique(slam_win[,"WINNER"]))) [1] 166

    So we have 166 winners spanning over 141 years of Tennis Grand Slam tournaments. We observe that during first and second World Wars, a reduced number of Grand Slam tournaments were played for obvious reasons.

    slam_win_df <-[,"WINNER"])) slam_win_df = slam_win_df %>% arrange(desc(Freq)) # computing champions' leaderboard position pos <- rep(0, nrow(slam_win_df)) pos[1] <- 1 for(i in 2:nrow(slam_win_df)) { pos[i] <- ifelse(slam_win_df$Freq[i] != slam_win_df$Freq[i-1], i, pos[i-1]) } last_win_year = sapply(slam_win_df$Var1, function(x) {slam_win %>% filter(WINNER == x) %>% dplyr::select(YEAR) %>% max()}) # creating and showing leaderboard dataframe slam_winners <- data.frame(RANK = pos, PLAYER = slam_win_df$Var1, WINS = slam_win_df$Freq, LAST_WIN_YEAR = last_win_year) kable(slam_winners) | RANK|PLAYER | WINS| LAST_WIN_YEAR| |----:|:---------------------------|----:|-------------:| | 1|Roger Federer | 19| 2017| | 2|Rafael Nadal | 16| 2017| | 3|Pete Sampras | 14| 2002| | 4|Novak Djokovic | 12| 2016| | 4|Roy Emerson | 12| 1967| ...

    The WINS and log(WINS) distribution density are shown.

    par(mfrow=c(1,2)) plot(density(slam_winners$WINS), main = "Wins Density") plot(density(log(slam_winners$WINS)), main = "Log Wins Density")

    Gives this plot:

    You may want to arrange the same dataframe ordering by the champions’ last win year.

    par(mfrow=c(1,1)) slam_winners_last_win_year = slam_winners %>% arrange(LAST_WIN_YEAR) kable(slam_winners %>% arrange(desc(LAST_WIN_YEAR))) | RANK|PLAYER | WINS| LAST_WIN_YEAR| |----:|:---------------------------|----:|-------------:| | 1|Roger Federer | 19| 2017| | 2|Rafael Nadal | 16| 2017| | 4|Novak Djokovic | 12| 2016| | 43|Andy Murray | 3| 2016| ...

    We may want to visualize the timeline of the number of Tennis Grand Slam champions.

    df_nwin = data.frame() for (year in start_year : end_year) { n_slam_winners = slam_win %>% filter(YEAR <= year) %>% dplyr::select(WINNER) %>% unique %>% nrow df_nwin = rbind(df_nwin, data.frame(YEAR = year, N_WINNERS = n_slam_winners)) } plot(x = df_nwin$YEAR, y = df_nwin$N_WINNERS, type ='s', xlab = "year", ylab = "no_winners") grid()

    Gives this plot:

    We may want to visualize the timeline of the Grand Slam tournaments wins record.

    df2_nwin = data.frame() for (year in start_year : end_year) { slam_win_years = slam_win %>% filter(YEAR <= year) slam_win_record =[,"WINNER"])) df2_nwin = rbind(df2_nwin, data.frame(YEAR = year, RECORD_WINS = max(slam_win_record$Freq))) } plot(x = df2_nwin$YEAR, y = df2_nwin$RECORD_WINS, type ='s', xlab = "year", ylab = "record_wins") grid()

    Gives this plot:

    It is interesting to have a look at the number of wins frequency.

    wins_frequency <-[,"WINS"])) colnames(wins_frequency) <- c("WINS", "FREQUENCY") kable(wins_frequency) |WINS | FREQUENCY| |:----|---------:| |1 | 80| |2 | 25| |3 | 19| |4 | 12| |5 | 5| |6 | 3| |7 | 6| |8 | 8| |10 | 1| |11 | 2| |12 | 2| |14 | 1| |16 | 1| |19 | 1| summary(slam_winners[,"WINS"]) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 2.946 3.750 19.000 Probabilistic Distribution Fit

    We now take advantage of the fitdist() function within the fitdistr package to fit a lognormal distribution for our Grand Slam wins data.

    fw <- fitdist(slam_winners$WINS, "lnorm") summary(fw) Fitting of the distribution ' lnorm ' by maximum likelihood Parameters : estimate Std. Error meanlog 0.7047927 0.06257959 sdlog 0.8062817 0.04425015 Loglikelihood: -316.7959 AIC: 637.5918 BIC: 643.8158 Correlation matrix: meanlog sdlog meanlog 1 0 sdlog 0 1

    Then we can plot the distribution fit results.


    Gives this plot:

    # left outliers quantile left_thresh <- 0.05 # right outliers quantile right_thresh <- 0.95 # determining the outliers slam_outlier <- getOutliersI(as.vector(slam_winners$WINS), FLim = c(left_thresh, right_thresh), distribution = "lognormal") # outliers are plotted in red color outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")

    Gives this plot:

    The outliers are:

    slam_winners[slam_outlier$iRight,] RANK PLAYER WINS LAST_WIN_YEAR 1 1 Roger Federer 19 2017 2 2 Rafael Nadal 16 2017

    The mean and standard deviation associated to the log-normal fit are:

    (mean_log <- fw$estimate["meanlog"]) meanlog 0.7047927 (sd_log <- fw$estimate["sdlog"]) sdlog 0.8062817

    Now we compute the probability associated with 19 and 16 wins.

    # clearing names names(mean_log) <- NULL names(sd_log) <- NULL # probability associated to 19 wins performance or more (lnorm_19 <- plnorm(19, mean_log, sd_log, lower.tail=FALSE)) [1] 0.002736863 # probability associated to 16 wins performance or more (lnorm_16 <- plnorm(16, mean_log, sd_log, lower.tail=FALSE)) [1] 0.005164628

    However, if a random variable follows a log-normal distribution, its logarithm follows a normal distribution. Hence we fit the logarithm of the variable under analysis using a normal distribution and compare the results with above log-normal fit.

    fw_norm <- fitdist(log(slam_winners$WINS), "norm") summary(fw_norm) Fitting of the distribution ' norm ' by maximum likelihood Parameters : estimate Std. Error mean 0.7047927 0.06257959 sd 0.8062817 0.04425015 Loglikelihood: -199.8003 AIC: 403.6006 BIC: 409.8246 Correlation matrix: mean sd mean 1 0 sd 0 1

    Similarly, we plot the fit results.


    Gives this plot:

    # left outliers quantile left_thresh <- 0.05 # right outliers quantile right_thresh <- 0.95 slam_outlier <- getOutliersI(log(as.vector(slam_winners$WINS)), FLim = c(left_thresh, right_thresh), distribution = "normal") outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")

    Gives this plot:

    The outliers are:

    slam_winners[slam_outlier$iRight,] RANK PLAYER WINS LAST_WIN_YEAR 1 1 Roger Federer 19 2017 2 2 Rafael Nadal 16 2017

    The mean and standard deviation values are:

    # mean and standard deviation of the fitted lognormal distribution (mean_norm <- fw_norm$estimate["mean"]) mean 0.7047927 (sd_norm <- fw_norm$estimate["sd"]) sd 0.8062817

    As we can see above, same fitting parameters result from the two approaches, even though with different log-likelihood, AIC and BIC metrics. Now we compute the probability associated to 19 and 16 wins together with their distance from the mean in terms of multiples of the standard deviation.

    # clearing names names(mean_norm) <- NULL names(sd_norm) <- NULL # probability associated to the 19 wins performance or better (norm_19 <- pnorm(log(19), mean_norm, sd_norm, lower.tail=FALSE)) [1] 0.002736863 # standard deviation times from the mean associated to 19 wins (deviation_19 <- (log(19) - mean_norm)/sd_norm) [1] 2.777747 # probability associated to the 16 wins performance or better (norm_16 <- pnorm(log(16), mean_norm, sd_norm, lower.tail=FALSE)) [1] 0.005164628 # standard deviation times from the mean associated to 16 wins (deviation_16 <- (log(16) - mean_norm)/sd_norm) [1] 2.564608

    As we can see above, we also obtained the same probability value as resulting from the log-normal distribution fit. In the following, we consider the second fitting approach (the one which takes the log of the original variable) for easing the computation of the distance from the mean in terms of multiples of the standard deviation.

    Regression Analysis

    Let us see again the plot of the number of tennis Grand Slam winners against their timeline.

    year <- df_nwin $YEAR winners <- df_nwin$N_WINNERS plot(x = year, y = winners, type ='l') grid()

    Gives this plot:

    It is visually evident the linear relationship between the variables. Hence, a linear regression would help in understanding how many newbie Grand Slam winners we may have each year.

    year_lm <- lm(winners ~ year) summary(year_lm) Call: lm(formula = winners ~ year) Residuals: Min 1Q Median 3Q Max -9.8506 -1.9810 -0.4683 2.6102 6.2866 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.388e+03 1.220e+01 -195.8 <2e-16 *** year 1.270e+00 6.264e-03 202.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.027 on 139 degrees of freedom Multiple R-squared: 0.9966, Adjusted R-squared: 0.9966 F-statistic: 4.113e+04 on 1 and 139 DF, p-value: < 2.2e-16

    Coefficients are reported as significant and the R-squared value is very high. On average each year, 1.27 Grand Slam tournaments newbie winners show up. Residuals analysis has not been reported for brevity. Similarly, we can regress the year against the number of winners.

    n_win_lm <- lm(year ~ winners) summary(n_win_lm) Call: lm(formula = year ~ winners) Residuals: Min 1Q Median 3Q Max -4.8851 -1.9461 0.3268 1.4327 7.9641 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.880e+03 3.848e-01 4886.4 <2e-16 *** winners 7.846e-01 3.868e-03 202.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.379 on 139 degrees of freedom Multiple R-squared: 0.9966, Adjusted R-squared: 0.9966 F-statistic: 4.113e+04 on 1 and 139 DF, p-value: < 2.2e-16

    Coefficients are reported as significant and the R-squared value is very high. On average, each new Grand Slam tournaments winner appears every 0.7846 fraction of year. Residuals analysis has not been reported for brevity. Such model can be used to predict the year when a given number of Grand Slam winners may show up. For example, considering Federer, Nadal and Sampras wins, we obtain:

    # probability associated to the 14 wins performance or better (norm_14 <- pnorm(log(14), mean_norm, sd_norm, lower.tail=FALSE)) [1] 0.008220098 # standard deviation times from the mean associated to 14 wins (deviation_14 <- (log(14) - mean_norm)/sd_norm) [1] 2.398994 # average number of Grand Slam winners to expect for a 19 Grand Slam wins champion (x_19 <- round(1/norm_19)) [1] 365 # average number of Grand Slam winners to expect for a 16 Grand Slam wins champion (x_16 <- round(1/norm_16)) [1] 194 # average number of Grand Slam winners to expect for a 14 Grand Slam wins champion (x_14 <- round(1/norm_14)) [1] 122

    The x_19, x_16 and x_14 values can be interpreted as the average size of Grand Slam tournaments winners population to therein find a 19, 16, 14 times winner respectively. As a consequence, the prediction of the calendar years to see players capable to win 19, 16, 14 times is:

    predict(n_win_lm, newdata = data.frame(winners = c(x_19, x_16, x_14)), interval = "p") fit lwr upr 1 2166.732 2161.549 2171.916 2 2032.573 2027.779 2037.366 3 1976.084 1971.355 1980.813

    The table above shows the earliest year when, on average, to expect a Grand Slam tournament winner capable to win 19, 16, 14 times (fit column), together with lower (lwr) and upper (upr) bound predicted values. In the real world, 14 wins champion showed up a little bit later than expected by our linear regression model, whilst 16 and 19 win champions did much earlier than expected by the same model.

    Population Statistical Dispersion Analysis

    In our previous analysis, we computed the distance from the mean for 19, 16 and 14 Grand Slam tournaments win probabilities, distance expressed in terms of multiples of the standard deviation.

    deviation_19 [1] 2.777747 deviation_16 [1] 2.564608 deviation_14 [1] 2.398994

    Based on above values, we can compute the probability to have a 19, 16 and 14 times winner. As we saw before, we resume up such result using the pnorm() function.

    (prob_19 <- pnorm(mean_norm+deviation_19*sd_norm, mean_norm, sd_norm, lower.tail = FALSE)) [1] 0.002736863 (prob_16 <- pnorm(mean_norm+deviation_16*sd_norm, mean_norm, sd_norm, lower.tail = FALSE)) [1] 0.005164628 (prob_14 <- pnorm(mean_norm+deviation_14*sd_norm, mean_norm, sd_norm, lower.tail = FALSE)) [1] 0.008220098

    Similarly to the Intellectual Quotient (IQ) assigning a value equal to 100 at the mean and +/- 15 points for each standard deviation of distance from the mean itself, we can figure out a Tennis Quotient (TQ).

    Here in below, we show a plot to remember how the IQ is computed:

    We notice that the median of our player’s population scores a TQ equal to 100.

    (median_value <- median(slam_winners$WINS)) [1] 2 (deviation_median <- (log(median_value) - mean_norm)/sd_norm) [1] -0.01444343 round(100 + 15*deviation_median) [1] 100

    We now compute the Tennis Quotients (TQ) for leading champions.

    (Federer_TQ <- round(100 + 15*deviation_19)) [1] 142 (Nadal_TQ <- round(100 + 15*deviation_16)) [1] 138 (Sampras_TQ <- round(100 + 15*deviation_14)) [1] 136

    And what about for example 7 times Grand Slam tournament winner?

    (deviation_7 <- (log(7) - mean_norm)/sd_norm) [1] 1.53931 TQ_7wins <- round(100 + 15*deviation_7) [1] 123

    Let us then compute the Tennis Quotients (TQ) for all our tennis Grand Slam tournaments winners.

    tq_compute <- function(x) { deviation_x <- (log(x) - mean_norm)/sd_norm round(100 + 15*deviation_x) } slam_winners = slam_winners %>% mutate(TQ = tq_compute(WINS)) kable(slam_winners) | RANK|PLAYER | WINS| LAST_WIN_YEAR| TQ| |----:|:---------------------------|----:|-------------:|---:| | 1|Roger Federer | 19| 2017| 142| | 2|Rafael Nadal | 16| 2017| 138| | 3|Pete Sampras | 14| 2002| 136| | 4|Novak Djokovic | 12| 2016| 133| ....

    We then visualize the top twenty Grand Slam tournaments champions.

    ggplot(data=slam_winners[1:20,], aes(x=reorder(PLAYER, TQ), y=TQ, fill = TQ)) + geom_bar(stat ="identity") + coord_flip() + scale_y_continuous(breaks = seq(0,150,10), limits = c(0,150)) + xlab("PLAYER")

    Gives this plot.


    The answers to our initial questions:

    – The log-normal distribution
    – Based upon the fitted distribution and taking advantage of plnorm() or pnorm() stats package functions, probabilities have been computed
    – A linear increase is a very good fit for that, resulting in significative regression coefficients and high R-squared values
    – Yes, we defined the Tennis Quotient similarly to the Intellectual Quotient and show the resulting leaderboard. Federer is confirmed “genius” in that respect, however, a few other very talented players are not that far from him.

    If you have any questions, please feel free to comment below.


      Related Post

      1. Oneway ANOVA Explanation and Example in R; Part 2
      2. Oneway ANOVA Explanation and Example in R; Part 1
      3. One-way ANOVA in R
      4. Cubic and Smoothing Splines in R
      5. Chi-Squared Test – The Purpose, The Math, When and How to Implement?
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...