Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 45 min ago

Shiny server series part 1: setting up

Sun, 04/09/2017 - 17:46

(This article was first published on Jasper Ginn's blog, and kindly contributed to R-bloggers)

This guide is part of a series on setting up your own private server running shiny apps. There are many guides out there with great advice on how to set up an R shiny server and related software. I try to make a comprehensive guide based in part on these resources as well as my own experiences. I always aim to properly attribute information to their respective sources. If you notice an issue, please contact me.

Welcome to the first part of the shiny server series!

Recently, I have had to set up several shiny servers that required some custom tweaks like user authentication, running shiny server on multiple ports and setting up SSL. With the help of numerous resources, I got everything running smoothly. I decided to publish these steps on my blog so that others might also use them. In this first part, we’ll be setting up the private server and installing shiny.

Resources used for this part

This guide draws from other guides on the web. The guides below were used to put together this resource.

  1. This guide shows you how to install R on Ubuntu 16.04
  2. Dean Attali’s blog post is one of the best and most comprehensive guides on setting up your own shiny server.
  3. This guides – is used to couple a domain name to your server.
What you’ll need

In order to complete this entire series, I expect you have access to the following:

  • A Virtual Private Server (VPS, see below)
  • A domain name, such as These are available on e.g. namecheap.

If you have these two things, you may proceed!

A VP…What?

A Virtual Private Server (VPS) is a service that you can purchase from providers such as DigitalOcean, Amazon AWS and Google Cloud Services. Instead of buying a physical server and planting it in your home somewhere, you essentially rent a small portion of a server from a provider which then looks like a normal server environment: you have your own OS and you can manage your own users and so on.

In this guide, I’ll be using DigitalOcean, but you can use any VPS from any provider as long as it runs on Ubuntu 16.04 and has at least 512MB RAM and one processor. You can use this link to sign up to DigitalOcean. You’ll receive $10 in credit.

Setting up a VPS on DigitalOcean

After signing up to DO, you will see the following screen

There won’t be much going on here … yet! Click on ‘create droplet’ to set up your VPS. The options are relatively straightforward here. You want to select a Ubuntu 16.04 distribution.

Then, you can decide how much power your droplet will have. I chose the smallest version, which works perfectly fine, but feel free to take a bigger size if you like.

Select an appropriate data centre (usually that means choosing one in or near your own country)

The final set of options can look arcane if you’re not used to them. Fortunately, the only option you really need to pay attention to is the SSH key.

An SSH key functions as a unique identifier for you, the owner of the VPS, and adds a layer of security to your server. DigitalOcean provides a tutorial for Windows users and Unix users on their website. This step is optional: you don’t have to do it, but I strongly recommended.

After you’ve set up the SSH access, you can choose a suitable name for your droplet and press ‘create’.

Congratulations! You are now the proud owner of your own VPS.

Accessing your server and setting up shiny

Click on ‘droplets’ at the top right of the DO menu. Copy the IP address of your droplet and open up a terminal or PuTTY. Log into your server:

ssh root@<ip-address> Setting up a user

Since we don’t want to use the root user to install everything, we’ll create a new user called ‘shiny’. When you are prompted to enter a name, office number etc. you can just hit enter:

adduser shiny

Next, we give the shiny user admin rights and switch to the shiny user:

# Give shiny admin rights gpasswd -a shiny sudo # Switch to shiny user su shiny Installing dependencies

Firstly, update the list of packages:

sudo apt-get update

Then add the CRAN repository to this list (copy these commands one by one):

# Add a trusted key sudo apt-key adv --keyserver --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 # Add the CRAN repo sudo add-apt-repository 'deb [arch=amd64,i386] xenial/' # Update the list again sudo apt-get update

If your droplet has 1G of RAM or less, we need to set up some swap space. This will allow the server to use hard-drive space as additional RAM when it needs to:

# Set up swap sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024 sudo /sbin/mkswap /var/swap.1 sudo /sbin/swapon /var/swap.1 sudo sh -c 'echo "/var/swap.1 swap swap defaults 0 0 " >> /etc/

Now, we’re ready to install R and other dependencies for shiny server:

# Install R sudo apt-get -y install r-base r-base-dev # Install shiny server dependencies sudo apt-get -y install libapparmor1 gdebi-core # These are used by R libraries sudo apt-get install -y libxml2-dev libcurl4-openssl-dev libssl-dev

Copy and execute the following lines to install these R packages:

# Install shiny sudo su - -c "R -e \"install.packages('shiny', repos='')\"" # Install devtools sudo su - -c "R -e \"install.packages('devtools', repos='')\"" # Install digest sudo su - -c "R -e \"install.packages('digest', repos='')\""

Now we’re ready to install shiny server:

# Download rstudio server: wget # Download shiny server: wget # Install sudo gdebi rstudio-server-1.0.136-amd64.deb sudo gdebi shiny-server-

The rstudio and shiny server are now accessible on ports 8787 and 3838: you can access them by navigating to :8787 in a browser.

While this is fine, it’s a bit difficult to remember and not very ideal to share with others. This is easily solvable! We can simply point a domain name to our new server.

Pointing a domain name to your server

This step involves two actions: one from the provider you bought the domain name from, and one on the DO dashboard.

Changing name servers

The first thing we’ll do is to change the name servers of your domain name so that they point to the DO name servers. Most providers will give this option and the process is largely the same, but I’ll be using my namecheap account.

Once you are logged into your namecheap account, you will see the following screen:

Click on the ‘manage’ button for the domain name you want to use. You will now see the following screen:

Click on the arrow (surrounded by green box) and select ‘custom DNS’ (surrounded by blue box). Add the DO name servers as shown in the image below:

Save your changes and close the tab.

Setting up A and CNAME records on DigitalOcean

In your DO dashboard, click on the ‘networking’ option

Enter your domain name (without ‘www’) and click ‘add domain’

We’re going to add one A-record and two CNAME records. Firstly, add ‘@’ under the hostname as in the image below and click ‘Create Record’:

Next, create a CNAME record. Under ‘hostname’ enter ‘*’, and under ‘is an alias of’, enter ‘@‘, as shown in the picture below:

Next, create another CNAME record where you input ‘www’ under ‘hostname’ and ‘@’ under ‘is an alias of’:

The entire setup should look like the image below

You are now done setting up your domain name. Nominally, it can take up to 24 hours before your domain name points to your server, but usually this takes less than several hours.

Configuring nginx on your server

OK. So now your server can be accessed using a custom domain name. This is great, but rstudio and shiny server are still only accessible using the ports on which they are hosted. We’ll now set up Nginx on your server to manage the routes to the shiny and studio servers.

Go back to the terminal where you’re logged into the server, and execute the following:

sudo apt-get -y install nginx

Now, make a backup of the nginx configuration:

sudo cp /etc/nginx/sites-enabled/default /etc/nginx/sites-enabled/default-backup

Open up the configuration file

sudo nano /etc/nginx/sites-enabled/default

As in the image below, add ‘localhost’ to the server name (green box).

Then, copy the following lines to the config file and paste them in the location shown in the image (area surrounded by pink box).

location /apps/ { proxy_pass; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection “upgrade”; } location /editor/ { proxy_pass; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection “upgrade”; }

Hit control+x and then Y+enter, and your changes will be saved. Your rstudio and shiny servers are now accessible using e.g. www../apps

Adding shiny applications

The easiest way to install shiny applications is to create a github or bitbucket repository for the shiny application, which you then clone in the shiny server directory. However, first we need to set up user rights for our shiny user.

Setting up permissions

First, navigate to the shiny server directory where the apps live:

cd /srv/shiny-server

Then, give the shiny user read/write permissions:

sudo groupadd shiny-apps sudo usermod -aG shiny-apps shiny sudo chown -R shiny:shiny-apps . sudo chmod g+w . sudo chmod g+s . Example: hosting an application

This part shows how you can store your shiny apps on github and host them on your server.

Sign up for a github account if you have not already. You can also choose to use bitbucket: the process is largely the same.

Create a repository and follow the steps to add files to it.

Follow the steps on the next screen to add files to your repository

When you’ve pushed your server.R and ui.R files, it should look like this

You can now clone this repository in your server by clicking the ‘clone or download’ button (surrounded by the pink box). Copy the url and go back to the terminal where you are logged into your server. Then, execute the following commands:

# Navigate to the folder with shiny applications cd /srv/shiny-server # Clone the repository git clone # Install dependencies sudo su - -c "R -e \"devtools::install_github('JasperHG90/TenK')\""

You can now visit the shiny application by navigating to www.<your-domain-name>.<extension>/apps/TenK-example in a browser.

Some notes on installing R packages

When installing R packages, it’s best to use the following code:

sudo su - -c "R -e \"install.packages('<package-name>', repos='')\""

This ensures that packages are installed globally and that every user has access to it.

Using packrat

You can use packrat – a package management system authored by rstudio – for shiny applications. Packrat is quite useful because it creates a local repository (with, for example, specific package versions) of packages. The advantage of this approach is that it improves the portability of the application and ensures that apps that use the same dependencies but, for example, different versions of that dependency can easily run on the same system. On the flip side, it does not work very well with certain specific libraries such as rgdal. You can read more about packrat on this page.

Setting up packrat

Setting up packrat is easy. In rstudio, start a new project (or open an existing one).

Go to the packrat section and check the box

Packrat will now make a local repository with necessary packages. This can take a while.

Don’t forget to push these changes to your github repository.

Cloning the repo and restoring the library

Getting the repository with the packrat library is very similar to the process outlined above.

First, install the packrat package:

sudo su - -c "R -e \"install.packages('packrat', repos='')\""

Navigate to the /srv/shiny-server folder and clone the repository:

# Navigate to the folder cd /srv/shiny-server # Clone the repo git clone

Now, enter the repository and start an R session:

# Enter the folder cd tenk-example-2 # Start an R session R

Packrat should automatically kick in and start installing packages. If it doesn’t, execute the following:


Execute q() to quit the R session

Future content in this series

This wraps up the first part of the shiny server series. You now have a VPS with a custom domain name and running rstudio plus shiny servers. This setup forms the basis on which the next installments of this series build.

Subsequent parts in this series will focus on the following topics

Part 2: running shiny server on multiple ports

We’ll run shiny server on multiple ports so that we have one server to host public shiny apps and another server to host private apps for a select group of people. We’ll add user authentication to this private server in part 4 of this series.

Part 3: adding an SSL security certificate

Before we can add user authentication to shiny, we want to set up an SSL certificate to buff security on our server. This encrypts traffic from and to your server.

Part 4: adding user authentication to your server using Auth0

In this part, we’ll use Auth0, a free service for up to 7.000 users, to add authentication to our shiny server running private applications. To this end, we’ll slightly adapt the excellent Auth0 tutorial to work with our setup.

Part 5: hosting a static website on your server

Most websites have a front end and a back end. The front end is what you see when you visit a page. The back end usually consists of one or more databases and services such as user authentication. Conversely, a static website is simply a front end: you pre-compile all necessary materials before hosting it on a server. Static websites do not require databases and are typically fast to load. Examples of services that help you make static websites are Jekyll and Pelican, to name a few.

You can use the GitHub approach outlined above for shiny applications for static websites as well. You create a repository, upload the static website, and then clone it in the directory from which nginx hosts its example page.

To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn's blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Web Scraping and Applied Clustering Global Happiness and Social Progress Index

Sun, 04/09/2017 - 15:02

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Increasing amount of data is available on the web. Web scraping is a technique developed to extract data from web pages automatically and transforming it into a data format for further data analysis and insights. Applied clustering is an unsupervised learning technique that refers to a family of pattern discovery and data mining tools with applications in machine learning, bioinformatics, image analysis, and segmentation of consumer types, among others. R offers several packages and tools for web scraping, data manipulation, statistical analysis and machine learning. The motivation for this post is to illustrate the applications of web scraping, dimension reduction and applied clustering tools in R.
There are two separate data sets for web scraping in this post. The first data set is from a recently released World Happiness Report 2017 by the United Nations Sustainable Development Solutions Network. The 2017 report launched on March 20, the day of world happiness, contained global rankings for happiness and social well-being. The second data set for web scraping is the 2015 social progress index of countries in the world. Social Progress Index has been described as measuring the extent to which countries provide for the social and environmental needs of their citizens. In this exercise, the two data sets joined by country column were pre-processed prior to principal component analysis (PCA) and clustering. The goals of the clustering approach in this post were to segment rows of the over 150 countries in the data into separate groups (clusters), The expectation is that sets of countries within a cluster are as similar as possible to each other for happiness and social progress, and as dissimilar as possible to the other sets of countries assigned in different clusters.

Load Required Packages

require(rvest) require(magrittr) require(plyr) require(dplyr) require(reshape2) require(ggplot2) require(FactoMineR) require(factoextra) require(cluster) require(useful) Web Scraping and Data Pre-processing # Import the first data set from the site url1 <- "" happy % html_nodes("table") %>% extract2(1) %>% html_table() # inspect imported data structure str(happy) ## 'data.frame': 155 obs. of 12 variables: ## $ Overall Rank : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Change in rank : int 3 -1 0 -2 0 1 -1 0 0 0 ... ## $ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ... ## $ Score : num 7.54 7.52 7.5 7.49 7.47 ... ## $ Change in score : num 0.039 -0.004 0.003 -0.015 0.056 0.038 -0.088 -0.02 -0.029 -0.007 ... ## $ GDP per capita : num 1.62 1.48 1.48 1.56 1.44 ... ## $ Social support : num 1.53 1.55 1.61 1.52 1.54 ... ## $ Healthy life expectancy : num 0.797 0.793 0.834 0.858 0.809 0.811 0.835 0.817 0.844 0.831 ... ## $ Freedom to make life choices: num 0.635 0.626 0.627 0.62 0.618 0.585 0.611 0.614 0.602 0.613 ... ## $ Generosity : num 0.362 0.355 0.476 0.291 0.245 0.47 0.436 0.5 0.478 0.385 ... ## $ Trust : num 0.316 0.401 0.154 0.367 0.383 0.283 0.287 0.383 0.301 0.384 ... ## $ Dystopia : num 2.28 2.31 2.32 2.28 2.43 ...

Pre-process the first data set for analysis:

## Exclude columns with ranks and scores, retain the other columns happy <- happy[c(3,6:11)] ### rename column headers colnames(happy) <- gsub(" ", "_", colnames(happy), perl=TRUE) names(happy) ## [1] "Country" "GDP_per_capita" ## [3] "Social_support" "Healthy_life_expectancy" ## [5] "Freedom_to_make_life_choices" "Generosity" ## [7] "Trust" ### standardize names of selected countries to confirm with country names in the the map database happy$Country <- as.character(mapvalues(happy$Country, from = c("United States", "Congo (Kinshasa)", "Congo (Brazzaville)", "Trinidad and Tobago"), to = c("USA","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad"))) Import the second data set from the web ### scrape social progress index data report from the site url2 <- "" social % html_nodes("table") %>% .[[2]] %>% html_table(fill=T) # check imported data structure str(social) ## 'data.frame': 163 obs. of 9 variables: ## $ Country : chr "Norway" "Sweden" "Switzerland" "Iceland" ... ## $ Rank (SPI) : chr "1" "2" "3" "4" ... ## $ Social Progress Index : chr "88.36" "88.06" "87.97" "87.62" ... ## $ Rank (BHN) : chr "9" "8" "2" "6" ... ## $ Basic Human Needs : chr "94.8" "94.83" "95.66" "95" ... ## $ Rank (FW) : chr "1" "3" "2" "4" ... ## $ Foundations of Well-being: chr "88.46" "86.43" "86.5" "86.11" ... ## $ Rank (O) : chr "9" "5" "10" "11" ... ## $ Opportunity : chr "81.82" "82.93" "81.75" "81.73" ...

Pre-process the second data set for analysis:

### Again, exclude columns with ranks, keep the rest social <- social[c(1,5,7,9)] ### rename column headers names(social) <- c("Country", "basic_human_needs", "foundations_well_being", "opportunity") ### Standardize country names to confirm with country names in the map dataset social$Country <- as.character(mapvalues(social$Country, from = c("United States", "Côte d'Ivoire","Democratic Republic of Congo", "Congo", "Trinidad and Tobago"), to=c("USA", "Ivory Cost","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad"))) ## coerce character data columns to numeric social[, 2:4] <- sapply(social[, 2:4], as.numeric) ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

Join the two imported data sets

### perform left join soc.happy <- left_join(happy, social, by = c('Country' = 'Country')) ### check for missing values in the combined data set mean([, 2:10])) ## [1] 0.0353857

Left joining the two data files introduced missing values (approximately 3.5% of the total data set) in the combined data set. R offers a variety of imputation algorithms to populate missing values. In this post, a median imputation strategy was applied by replacing missing values with the median for each column.

### median imputation for(i in 1:ncol(soc.happy[, 2:10])) { soc.happy[, 2:10][[, 2:10][,i]), i] <- median(soc.happy[, 2:10][,i], na.rm = TRUE) } ### summary of the raw data summary(soc.happy[,2:10]) ## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :0.0000 Min. :0.000 Min. :0.0000 ## 1st Qu.:0.6600 1st Qu.:1.042 1st Qu.:0.3570 ## Median :1.0550 Median :1.252 Median :0.6050 ## Mean :0.9779 Mean :1.187 Mean :0.5474 ## 3rd Qu.:1.3150 3rd Qu.:1.412 3rd Qu.:0.7190 ## Max. :1.8710 Max. :1.611 Max. :0.9490 ## Freedom_to_make_life_choices Generosity Trust ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.3010 1st Qu.:0.1530 1st Qu.:0.0570 ## Median :0.4370 Median :0.2320 Median :0.0890 ## Mean :0.4078 Mean :0.2461 Mean :0.1224 ## 3rd Qu.:0.5140 3rd Qu.:0.3220 3rd Qu.:0.1530 ## Max. :0.6580 Max. :0.8380 Max. :0.4640 ## basic_human_needs foundations_well_being opportunity ## Min. :26.81 Min. :44.12 Min. :21.12 ## 1st Qu.:61.94 1st Qu.:63.09 1st Qu.:41.43 ## Median :75.73 Median :69.41 Median :49.55 ## Mean :71.82 Mean :68.29 Mean :52.12 ## 3rd Qu.:84.73 3rd Qu.:74.79 3rd Qu.:62.83 ## Max. :96.03 Max. :88.46 Max. :86.58

An important procedure for meaningful clustering and dimension reduction steps involves data transformation and scaling variables. The simple function below will transform all variables to values between 0 and 1.

## transform variables to a range of 0 to 1 range_transform <- function(x) { (x - min(x))/(max(x)-min(x)) } soc.happy[,2:10] <-[,2:10], 2, range_transform)) ### summary of transformed data shows success of transformation summary(soc.happy[,2:10]) ## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.3528 1st Qu.:0.6468 1st Qu.:0.3762 ## Median :0.5639 Median :0.7772 Median :0.6375 ## Mean :0.5227 Mean :0.7367 Mean :0.5768 ## 3rd Qu.:0.7028 3rd Qu.:0.8765 3rd Qu.:0.7576 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## Freedom_to_make_life_choices Generosity Trust ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.4574 1st Qu.:0.1826 1st Qu.:0.1228 ## Median :0.6641 Median :0.2768 Median :0.1918 ## Mean :0.6198 Mean :0.2936 Mean :0.2638 ## 3rd Qu.:0.7812 3rd Qu.:0.3842 3rd Qu.:0.3297 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## basic_human_needs foundations_well_being opportunity ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.5075 1st Qu.:0.4278 1st Qu.:0.3103 ## Median :0.7067 Median :0.5704 Median :0.4343 ## Mean :0.6503 Mean :0.5451 Mean :0.4735 ## 3rd Qu.:0.8368 3rd Qu.:0.6917 3rd Qu.:0.6372 ## Max. :1.0000 Max. :1.0000 Max. :1.0000

Although the previous data transformation step could have been adequate for next steps of this analysis, the function shown below would re-scale all variables to a mean of 0 and standard deviation of 1.

## Scaling variables to a mean of 0 and standard deviation of 1 sd_scale <- function(x) { (x - mean(x))/sd(x) } soc.happy[,2:10] <-[,2:10], 2, sd_scale)) ### summary of the scaled data summary(soc.happy[,2:10]) ## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :-2.3045 Min. :-4.1377 Min. :-2.2984 ## 1st Qu.:-0.7492 1st Qu.:-0.5050 1st Qu.:-0.7994 ## Median : 0.1817 Median : 0.2271 Median : 0.2419 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7944 3rd Qu.: 0.7849 3rd Qu.: 0.7206 ## Max. : 2.1046 Max. : 1.4787 Max. : 1.6863 ## Freedom_to_make_life_choices Generosity Trust ## Min. :-2.7245 Min. :-1.8320 Min. :-1.2102 ## 1st Qu.:-0.7137 1st Qu.:-0.6929 1st Qu.:-0.6467 ## Median : 0.1949 Median :-0.1048 Median :-0.3304 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7093 3rd Qu.: 0.5652 3rd Qu.: 0.3023 ## Max. : 1.6713 Max. : 4.4067 Max. : 3.3767 ## basic_human_needs foundations_well_being opportunity ## Min. :-2.6059 Min. :-2.5434 Min. :-1.9025 ## 1st Qu.:-0.5721 1st Qu.:-0.5471 1st Qu.:-0.6559 ## Median : 0.2263 Median : 0.1180 Median :-0.1575 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7474 3rd Qu.: 0.6842 3rd Qu.: 0.6576 ## Max. : 1.4016 Max. : 2.1228 Max. : 2.1154 Simple correlation analysis

Now, the data is ready to explore variable relationships with each other.

corr <- cor(soc.happy[, 2:10], method="pearson") ggplot(melt(corr, varnames=c("x", "y"),"correlation"), aes(x=x, y=y)) + geom_tile(aes(fill=correlation)) + scale_fill_gradient2(low="green", mid="yellow", high="red", guide=guide_colorbar(ticks=FALSE, barheight = 5), limits=c(-1,1)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Heatmap of Correlation Matrix", x=NULL, y=NULL)

Principal Component Analysis

Principal component analysis is a multivariate statistics widely used for examining relationships among several quantitative variables. PCA identifies patterns in the variables to reduce the dimensions of the data set in multiple regression and clustering, among others.

soc.pca <- PCA(soc.happy[, 2:10], graph=FALSE) fviz_screeplot(soc.pca, addlabels = TRUE, ylim = c(0, 65))

Here is the plot:

The percentages on each bar indicate the proportion of total variance explained by the respective principal component. As seen in the scree plot, the first three principal components accounted for ~80% of the total variance.

soc.pca$eig ## eigenvalue percentage of variance cumulative percentage of variance ## comp 1 5.0714898 56.349887 56.34989 ## comp 2 1.4357885 15.953205 72.30309 ## comp 3 0.6786121 7.540134 79.84323 ## comp 4 0.6022997 6.692219 86.53544 ## comp 5 0.4007136 4.452373 90.98782 ## comp 6 0.3362642 3.736269 94.72409 ## comp 7 0.2011131 2.234590 96.95868 ## comp 8 0.1471443 1.634937 98.59361 ## comp 9 0.1265747 1.406386 100.00000

The output suggests that the first two principal components are worthy of consideration. A practical guide to determining the optimum number of principal component axes could be to consider only those components with eigen values greater than or equal to 1.

# Contributions of variables to principal components soc.pca$var$contrib[,1:3] ## Dim.1 Dim.2 Dim.3 ## GDP_per_capita 15.7263477 2.455323e+00 3.162470e-02 ## Social_support 12.0654754 5.445993e-01 1.345610e+00 ## Healthy_life_expectancy 15.1886385 2.259270e+00 3.317633e+00 ## Freedom_to_make_life_choices 6.6999181 2.207049e+01 8.064596e+00 ## Generosity 0.3270114 4.189437e+01 5.406678e+01 ## Trust 4.3688692 2.570658e+01 3.211058e+01 ## basic_human_needs 14.5402807 3.836956e+00 1.021076e+00 ## foundations_well_being 15.1664220 1.232353e+00 4.169125e-02 ## opportunity 15.9170370 6.170376e-05 4.113192e-04

The first principal component explained approximately 57% of the total variation and appears to represent opportunity, GDP per capita, healthy life expectancy, Foundations of Well-being, and basic human needs.
The second principal component explained a further 16% of the total variation and represented opportunity, social support, and generosity.

Applied Clustering

The syntax for clustering algorithms require specifying the number of desired clusters (k=) as an input. The practical issue is what value should k take? In the absence of a subject-matter knowledge, R offers various empirical approaches for selecting a value of k. One such R tool for suggested best number of clusters is the NbClust package.

require(NbClust) nbc <- NbClust(soc.happy[, 2:10], distance="manhattan",,, method="ward.D", index='all') ## Warning in pf(beale, pp, df2): NaNs produced ## *** : The Hubert index is a graphical method of determining the number of clusters. ## In the plot of Hubert index, we seek a significant knee that corresponds to a ## significant increase of the value of the measure i.e the significant peak in Hubert ## index second differences plot. ## ## *** : The D index is a graphical method of determining the number of clusters. ## In the plot of D index, we seek a significant knee (the significant peak in Dindex ## second differences plot) that corresponds to a significant increase of the value of ## the measure. ## ## ******************************************************************* ## * Among all indices: ## * 4 proposed 2 as the best number of clusters ## * 9 proposed 3 as the best number of clusters ## * 3 proposed 4 as the best number of clusters ## * 1 proposed 5 as the best number of clusters ## * 1 proposed 9 as the best number of clusters ## * 1 proposed 16 as the best number of clusters ## * 1 proposed 28 as the best number of clusters ## * 2 proposed 29 as the best number of clusters ## * 1 proposed 30 as the best number of clusters ## ## ***** Conclusion ***** ## ## * According to the majority rule, the best number of clusters is 3 ## ## ## *******************************************************************

The NbClust algorithm suggested a 3-cluster solution to the combined data set. So, we will apply K=3 in next steps.

set.seed(4653) pamK3 <- pam(soc.happy[, 2:10], diss=FALSE, 3, # Number of countries assigned in the three clusters fviz_silhouette(pamK3) ## cluster size ave.sil.width ## 1 1 28 0.40 ## 2 2 82 0.29 ## 3 3 47 0.27

The number of countries assigned in each cluster is displayed in the table above.
An interesting aspect of the K-medoids algorithm is that it finds observations from the data that are typical of each cluster. So, the following code will ask for a list of “the typical countries” as selected by the algorithm to be closest to the center of the cluster.

## which countries were typical of each cluster soc.happy$Country[pamK3$] ## [1] "Germany" "Macedonia" "Burkina Faso" Putting it together

We can now display individual countries and superimpose their cluster assignments on the plane defined by the first two principal components.

soc.happy['cluster'] <- as.factor(pamK3$clustering) fviz_pca_ind(soc.pca, label="none", habillage = soc.happy$cluster, #color by cluster palette = c("#00AFBB", "#E7B800", "#FC4E07", "#7CAE00", "#C77CFF", "#00BFC4"), addEllipses=TRUE )

Finally, displaying clusters of countries on a world map <- map_data("world") # LEFT JOIN map.world_joined <- left_join(, soc.happy, by = c('region' = 'Country')) ggplot() + geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill=cluster, color=cluster)) + labs(title = "Applied Clustering World Happiness and Social Progress Index", subtitle = "Based on data from: and\n", x=NULL, y=NULL) + coord_equal() + theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), panel.background=element_blank() )

The map of clusters:

Concluding Remarks

Cluster analysis and dimension reduction algorithms such as PCA are widely used approaches to split data sets into distinct subsets of groups as well as to reduce the dimensionality of the variables, respectively. Together, these analytics approaches: 1) make the structure of the data easier to understand, and 2) also make subsequent data mining tasks easier to perform. Hopefully, this post has attempted to illustrate the applications of web scraping, pre-processing data for analysis, PCA, and cluster analysis using various tools available in R. Please let us know your comments and suggestions. Ok to networking with the author in LinkdIn.

    Related Post

    1. Key Phrase Extraction from Tweets
    2. Financial time series forecasting – an easy approach
    3. Outlier detection and treatment with R
    4. Implementing Apriori Algorithm in R
    5. R for Publication: Lesson 6, Part 2 – Linear Mixed Effects Models

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Python-Like walk() Function for R

    Sat, 04/08/2017 - 22:07

    (This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

    A really nice function available in Python is walk(), which recursively descends a directory tree, calling a user-supplied function in each directory within the tree. It might be used, say, to count the number of files, or maybe to remove all small files and so on. I had students in my undergraduate class write such a function in R for homework, and thought it may be interesting to present it here.

    Among other things, readers who are not familiar with recursive function calls will learn about those here. I must add that all readers, even those with such background, will find this post to be rather subtle and requiring extra patience, but I believe you will find it worthwhile.

    Let’s start with an example, in which we count the number of files with a given name:

    countinst <- function(startdir,flname) { walk(startdir,checkname, list(flname=flname,tot=0)) } checkname <- function(drname,filelist,arg) { if (arg$flname %in% filelist) arg$tot <- arg$tot + 1 arg }

    Say we try all this in a directory containing a subdirectory mydir that consists of a file x, and two subdirectories, d1 and d2. The latter in turn consists of another file named x. We then make the call countinst(‘mydir’,’x’).  As can be seen above, that call will in turn make the call


    The walk() function, which I will present below, will start in mydir, and will call the user-specified function checkname() at every directory it encounters in the tree rooted at mydir, in this case mydir, d1 and d2.

    At each such directory, walk() will pass to the user-specified function, in this case checkname(), first the current directory name, then a character vector reporting the names of all files (including directories) in that directory.  In mydir, for instance, this vector will be c(‘d1′,’d2′,’x’).  In addition, walk() will pass to checkname() an argument, formally named arg above, which will serve as a vehicle for accumulating running totals, in this case the total number of files of the given name.

    So, let’s look at walk() itself:

    walk <- function(currdir,f,arg) { # "leave trail of bread crumbs" savetop <- getwd() setwd(currdir) fls <- list.files() arg <- f(currdir,fls,arg) # subdirectories of this directory dirs <- list.dirs(recursive=FALSE) for (d in dirs) arg <- walk(d,f,arg) setwd(savetop) # go back to calling directory arg }

    The comments are hopefully self-explanatory, but the key point is that within walk(), there is another call to walk()! This is recursion.

    Note how arg accumulates, as needed in this application. If on the other hand we wanted, say, simply to remove all files named ‘x’, we would just put in a dummy variable for arg. And though we didn’t need drname here, in some applications it would be useful.

    For compute-intensive tasks, recursion is not very efficient in R, but it can be quite handy in certain settings.

    If you would like to conveniently try the above example, here is some test code:

    test <- function() { unlink('mydir',recursive=TRUE) dir.create('mydir') file.create('mydir/x') dir.create('mydir/d1') dir.create('mydir/d2') file.create('mydir/d2/x') print(countinst('mydir','x')) }


    To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    #4: Simpler shoulders()

    Sat, 04/08/2017 - 21:09

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Welcome to the fourth post in the repulsively random R ramblings series, or R4 for short.

    My twitter feed was buzzing about a nice (and as yet unpublished, ie not-on-CRAN) package by Dirk Schumacher which compiles a a list of packages (ordered by maintainer count) for your current session (or installation or …) with a view towards saying thank you to those whose packages we rely upon. Very nice indeed.

    I had a quick look and run it twice … and had a reaction of ewwww, really? as running it twice gave different results as on the second instance a boatload of tibblyverse packages appeared. Because apparently kids these day can only slice data that has been tidied or something.

    So I had another quick look … and put together an alternative version using just base R (as there was only one subfunction that needed reworking):

    source(file="") format_pkg_df <- function(df) { # non-tibblyverse variant tb <- table(df[,2]) od <- order(tb, decreasing=TRUE) ndf <- data.frame(maint=names(tb)[od], npkgs=as.integer(tb[od])) colpkgs <- function(m, df) { paste(df[ df$maintainer == m, "pkg_name"], collapse=",") } ndf[, "pkg"] <- sapply(ndf$maint, colpkgs, df) ndf }

    Running this in the ESS session I had open gives:

    R> shoulders() ## by Dirk Schumacher, with small modifications maint npkgs pkg 1 R Core Team <> 9 compiler,graphics,tools,utils,grDevices,stats,datasets,methods,base 2 Dirk Eddelbuettel <> 4 RcppTOML,Rcpp,RApiDatetime,anytime 3 Matt Dowle <> 1 data.table R>

    and for good measure a screen is below:

    I think we need a catchy moniker for R work using good old base R. SoberVerse? GrumbyOldFolksR? PlainOldR? Better suggestions welcome.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Correlation and Correlogram Exercises

    Sat, 04/08/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Correlation analysis is one of the most popular techniques for data exploration. This set of exercises is intended to help you to extend, speed up, and validate your correlation analysis. It allows to practice in:
    – calculating linear and nonlinear correlation coefficients,
    – testing those coefficients for statistical significance,
    – creating correlation matrices to study interdependence between variables in dataframes,
    – drawing graphical representations of those matrices (correlograms),
    – calculating coefficients for partial correlation between two variables (controlling for their correlation with other variables).
    The exercises make use of functions from the packages Hmisc, corrgram, and ggm. Please install these packages, but do not load them before starting the exercises in which they are needed (to avoid a namespace conflict) (the ggm package contains the function called rcorr which masks the rcorr function from the Hmisc package, and vice versa. If you want to return to the rcorr function from the Hmisc after loading the ggm package run detach(package:ggm)).
    Exercises are based on a reduced version of the auto dataset from the corrgram package (download here). The dataset contains characteristics of 1979 automobile models.
    Answers to the exercises are available here

    Exercise 1
    Calculate simple (linear) correlation between car price and its fuel economy (measured in miles per gallon, or mpg).

    Exercise 2
    Use the cor.test function to check whether the obtained coefficient is statistically significant at 5% level.

    Exercise 3
    Simple correlation assumes a linear relationship between variables, but it may be useful to relax this assumption. Calculate Spearman’s correlation coefficient for the same variables, and find its statistical significance.

    Exercise 4
    In R, it is possible to calculate correlation for all pairs of numeric variables in a dataframe at once. However, this requires excluding non-numeric variables first.
    Create a new dataframe, auto_num, that contains only columns with numeric values from the auto dataframe. You can do this using the Filter function.

    Exercise 5
    Use the cor function to create a matrix of correlation coefficients for variables in the auto_num dataframe.

    Exercise 6
    The standard cor.test function does not work with dataframes. However, statistical significance of correlation coefficients for a dataframe can be verified using the rcorr function from the Hmisc package.
    Transform the auto_num dataframe into a matrix (auto_mat), and use it to check significance of the correlation coefficients with the rcorr function.

    Exercise 7
    Use the corrgram function from the corrgram package to create a default correlogram to visualize correlations between variables in the auto dataframe.

    Exercise 8
    Create another correlogram that (1) includes only the lower panel, (2) uses pie diagrams to represent correlation coefficients, and (3) orders the variables using the default order.

    Exercise 9
    Create a new dataframe, auto_subset, by subsetting the auto dataframe to include only the Price, MPG, Hroom, and Rseat variables. Use the new dataframe to create a correlogram that (1) shows correlation coefficients on the lower panel, and (2) shows scatter plots (points) on the upper panel.

    Exercise 10
    Use the the correlations function from the ggm package to create a correlation matrix with both full and partial correlation coefficients for the auto_subset dataframe. Find the partial correlation between car price and its fuel economy.

    Related exercise sets:
    1. Accessing Dataframe Objects Exercises
    2. Cross Tabulation with Xtabs exercises
    3. Merging Dataframes Exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Exploring propensity score matching and weighting

    Sat, 04/08/2017 - 14:00

    (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

    This post jots down some playing around with the pros, cons and limits of propensity score matching or weighting for causal social science research.

    Intro to propensity score matching

    One is often faced with an analytical question about causality and effect sizes when the only data around is from a quasi-experiment, not the random controlled trial one would hope for. This is common in many fields, but some of the most important occurrences are in public policy. For example, government programs to help individuals or firms are typically not allocated at random, but go to those with higher need, or higher potential to make something out of the assistance. This makes isolating the effect in the data of the treatment difficult, to say the least.

    A frequently-used family of analytical methods to deal with this are grouped under propensity score matching (although not all these methods literally “match”). Such methods model the probability of each unit (eg individual or firm) receiving the treatment; and then using these predicted probabilities or “propensities” to somehow balance the sample to make up for the confounding of the treatment with the other variablers of interest.

    The simplest to explain member of this family involves creating a pseudo control group from the non-treatment individuals that resembles the treatment group in that they have similar propensities to get the treatment, but differs in that they just didn’t get it. Then analysis can proceed “as though” the data had been generated by an experiment. This approach is easy to explain to non-statisticians and has the great benefit of creating a tangible, concrete “control” group.

    But everything depends on the model of the probabilities of getting the treatment. If it’s a good model, you’re fine. If not – and in particular, if the chance of getting the treatment is related to unobserved variables which also impact on your response variable of interest – the propensity score approach helps very little if at all. A good text on all this (and much more) is Morgan and Winship’s Counterfactuals and Causal Inference: Methods and Principles for Social Research.

    Job training example Basics

    A pretty thorough implementation of various propensity score methods in R comes in Daniel E. Ho, Kosuke Imai, Gary King, Elizabeth A. Stuart (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical
    Software, Vol. 42, No. 8, pp. 1-28
    . There’s a good overview in the MatchIt vignette. I’ll get started with data from one of their examples, which shows a typical application of this technique:

    “Our example data set is a subset of the job training program analyzed in Lalonde (1986)
    and Dehejia and Wahba (1999). MatchIt includes a subsample of the original data consisting
    of the National Supported Work Demonstration (NSW) treated group and the comparison
    sample from the Population Survey of Income Dynamics (PSID).1 The variables in
    this data set include participation in the job training program (treat, which is equal to 1
    if participated in the program, and 0 otherwise), age (age), years of education (educ), race
    (black which is equal to 1 if black, and 0 otherwise; hispan which is equal to 1 if hispanic,
    and 0 otherwise), marital status (married, which is equal to 1 if married, 0 otherwise), high
    school degree (nodegree, which is equal to 1 if no degree, 0 otherwise), 1974 real earnings
    (re74), 1975 real earnings (re75), and the main outcome variable, 1978 real earnings (re78).”

    First, loading up the functionality I need for the rest of this post

    library(tidyverse) library(forcats) library(scales) library(MASS) # for rlm library(boot) # for inv.logit library(MatchIt) library(boot) # for bootstrapping library(doParallel) # for parallel processing library(mgcv) # for gam #==============example from the MatchIt vignette================= data(lalonde) # naive comparison - people who got the job training program # have lower incomes in 1978 - because the training was given # to people with income problems. So comparison is unfair: lalonde %>% group_by(treat) %>% summarise(Income1978 = mean(re78), n = n()) treat Income1978 n 1 0 6984.170 429 2 1 6349.144 185

    So we see that the 429 people who didn’t get the job training “treatment” had an average income about $635 more than the 185 beneficiaries. Program failure!

    Obviously that’s unfair on the program, so we use matchit and to create an artificial control group that resembles the treatment group in terms of age, education, ethnicity, marital stats, and income in 1974 and 1975:

    # Choose one of the large variety of propensity score matching methods to model propensity match_model <- matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, method = "nearest") match_data <- # Simple comparison is now much fairer match_data %>% group_by(treat) %>% summarise(Income1978 = mean(re78), n = n()) treat Income1978 n 1 0 5440.941 185 2 1 6349.144 185

    That gives us an average treatment effect of $908. Notice that the new control group is the same size as the treatment group; the rest of the observations have been discarded.

    You don’t need to limit yourself to simple comparisons, although in principle they should work. A regression with the matched control and treatment data, even using the same explanatory variables as were used in the matching model, helps address the inevitable lack of complete balance between the two groups. That gives us a result of $960 (output not shown). I use the robust M-estimator rather than ordinary least squares to deal with breaches of the classical regression assumptions (particularly the non-Normal response, with variance increasing as its mean increases).

    # regression model estimate with matched data coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = match_data))["treat"] Regression is simpler and gives similar results

    However, a similar estimate could have come from a simpler, single-step regression with the original data, skipping the propensity score modelling altogether (there are arguments pro and con). The regression below estimates a treatment effect of $1,183. I call this “similar” because the uncertainty around all these estimates is huge, which I’ll demonstrate further down the post.

    # regression model estimate with original data coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde))["treat"] Weighting might be better than matching

    Creating a control group by matching has the distressing side-effect of throwing away large amounts of the data, because the control group is shrunk down to the same size as the treatment group. A possibly better use of the propensity scores is to keep all observations in play but weight them according to the propensity score – one of the methods described by Peter Austin in this article on “An introduction to propensity score methods for reducing the effects of confounding in observational studies”.

    Under “inverse probability of treatment weighting”, proposed by Imbens in 2000, observations that receive the treatment are given weight of \frac{1}{p} and those that did not receive the treatment are given weight of \frac{1}{1-p}, where p is the probability of getting the treatment. That is, each observation is given weight of the inverse of the probability of the treatment they actually got. Intuitively, treatment cases that resemble the controls are interesting and given more weight, and control cases that look like they should have got the treatment are interesting and get more weight. Analysis can then proceed via weighted average values, or a regression with explanatory variables (which may or may not be the same variables as those used in the model of propensity for treatment).

    I implemented this with Lalonde’s data without using the MatchIt package, partly to convince myself I understood how it worked, and partly because I wanted more flexibility in modelling propensity than is supported by that package. I’d looked at the generalized linear model with binomial family response that was used for its propensity score matching, and noticed that age was showing up as unhelpful in determining treatment. My expectation is that age is probably related in a non-linear fashion to the probability of getting job training treatment, so I used a generalized additive model to allow for this…

    mod <- gam(treat ~ s(age) + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, family = "binomial") plot(mod, shade = TRUE, main = "Non-linear function of age in treatment propensity")

    …which turns out to be the case:

    Converting predicted probabilities into weights is straightforward. In the code below I have a quick look at the resulting density of weights.

    lalonde <- lalonde %>% mutate(propensity_gam = predict(mod, type = "response"), weight_gam = 1 / propensity_gam * treat + 1 / (1 - propensity_gam) * (1 - treat)) ggplot(lalonde, aes(x = weight_gam, fill = as.factor(treat))) + geom_density(alpha = 0.5, colour = "grey50") + geom_rug() + scale_x_log10(breaks = c(1, 5, 10, 20, 40)) + ggtitle("Distribution of inverse probability weights")

    One of the criticisms of this inverse probability of treatment weighting approach is that individual observations can get very high weights and become unduly influential. For example, in an incomplete article by Posner and Ash (there may be a complete version somewhere else):

    “While this method can be shown to have nice mathematical properties, it does not work well in practice. Consider a lone treated observation that happens to have a very low probability of being treated…. The value of the inverse of the propensity score will be extremely high, asymptotically infinity. The effect size obtained will be dominated by this single value, and any fluctuations in it will produce wildly varied results, which is an undesirable property.”

    Ouch. Posner and Ash go on to suggest alternative ways of weighting less vulnerable to these problems. As a simple starter, in the below I try the obvious first step of truncating the weights at 10. Here are the average incomes of the treatment and non-treatment groups using the full set of inverse probability weights, and another set truncated at 10.

    lalonde %>% group_by(treat) %>% summarise(Income_allweights = round(weighted.mean(re78, weight_gam)), Income_truncweights = round(weighted.mean(re78, pmin(10, weight_gam))), n = n()) treat Income_allweights Income_truncweights n 1 0 6446 6435 429 2 1 6814 7167 185

    Note that we now have all 429 of the non-treatment cases, a definite advantage over the matching methods.

    We’re not limited to simple weighted averages of course; we can use these weights in the analysis of our choice including our robust regression model. Ho et al (2007) suggest that (in Morgan and Winship’s paraphrase):

    “the general procedure one should carry out in any multivariate analysis that aspires to generate causal inferences is to first balance one’s data as much as possible with a matching routine and then estimate a regression model on the matched data. From this perspective, matching is a preprocessor, which can be used to prepare the data for subsequent analysis with something such as a regression model.”

    The “doubly robust” approach to regression modelling to identify causal effects uses weights from propensity score matching as well as a regression.

    “There has been considerable debate as to which approach to confounder control is to be preferred, as the first [ie single step regression] is biased if the outcome regression model is misspecified while the second approach [ie propensity score matching] is biased if the treatment regression, ie propensity, model is misspecified. This controversy could be resolved if an estimator were available that was guaranteed to be consistent … whenever at least one of the two models was correct. … We refer to such combined methods as doubly-robust or doubly-protected as they can protect against misspecification of either the outcome or treatment model, although not against simultaneous misspecification of both.”

    Robins and Rotnitzky 2001, quoted in Morgan and Winship Counterfactuals and Causal Analysis

    So here is my robust M estimator regression, using inverse propensity of treatment as weights, where propensity of treatment was modelled earlier as a generalized additive model on all explanatory variables and non-linearly with age:

    coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, weights = as.vector(weight_gam)))["treat"]

    This gives a treatment estimate of $910 and I think it’s my best point estimate so far.

    These weighting methods seem better than simply matching, if only because they retain all the data, but crucially they are still vulnerable to model mis-specification and unobserved explanatory variables. There’s a good critical discussion in this article by Freedman and Berk:

    “Regressions can be weighted by propensity scores in order to reduce bias. However,
    weighting is likely to increase random error in the estimates, and to bias the
    estimated standard errors downward, even when selection mechanisms are well understood.
    Moreover, in some cases, weighting will increase the bias in estimated
    causal parameters. If investigators have a good causal model, it seems better just to
    fit the model without weights. If the causal model is improperly specified, there can
    be significant problems in retrieving the situation by weighting, although weighting
    may help under some circumstances.”

    And with respect to the “doubly robust” approach:

    “you need to get at least one of the two models (and preferably both) nearly right in order for weighting to help much. If both models are wrong, weighting could easily be a dead end. There are papers suggesting that under some circumstances, estimating a shaky causal model and a shaky selection model should be doubly robust. Our results indicate that under other circumstances, the technique is doubly frail.”

    When I see multi-stage approaches like propensity score matching or weighting – just like structural equation models, and two or three stage least squares – that aim to deal with causality by adding complexity, I always get very nervous; the more so when I read criticisms like those above. I’ve done some simulations exploring this issue but a write-up will have to wait for a separate post.

    Bootstrap or cross-validation must envelope the propensity modelling

    The literature has a range of (conflicting) views on estimating uncertainty of statistics estimated after propensity score matching or weighting. Morgan and Winship report in a footnote that the bootstrap does not to work particularly well for matching in particular, because the resampling process leaves fewer distinct cases to match to during the propensity modelling stage. However, Austin and Small reported in 2014 tests of bootstrap methods that seem quite effective. Definitely, one of the appeals of weighting (rather than matching) is that it should make the overall process more suitable for sitting inside a bootstrap. However, I don’t have time for a really thorough review of this literature, so let’s just note that estimating uncertainty after propensity score matching is a moving area of uncertainty itself.

    I use the method described by Austin and Small as the “complex bootstrap”, which involves resampling from the original data and performing the propensity modelling and matching for each resample. The propensity modelling is a big source of our uncertainty in the final estimates of interest. So that source of uncertainty needs to be repeated each time we mimic the sampling process with our bootstrap (the same applies to other pre-processing steps, like imputation). Austin and Small report that this results in a small overestimate of sampling variability (standard error higher by 7% than it should be; compared to 4% for a simpler bootstrap approach) but my principle, following the approach in Harrell’s Regression Modeling Strategies and elsewhere, is to always include pre-processing inside the bootstrap. I suspect that with less ideal data than Austin and Small use in their simulations (on my reading, they assume all relevant variables are available, amongst other ideals) this will pay off. Anyway, the results reported below look plausible.

    Here’s code that bootstraps four of the methods of estimating treatment effect above:

    • simple comparison of matched data;
    • robust regression with matched data;
    • robust regression with data weighted by inverse propensity of treatment;
    • robust regression with the original data.
    #=================bootstrap for confidence intervals================= # Set up a cluster for parallel computing cluster <- makeCluster(7) # only any good if you have at least 7 processors :) registerDoParallel(cluster) clusterEvalQ(cluster, { library(MatchIt) data(lalonde) }) #' Function to estimate treatment effect three different methods #' @return a vector of four estimates of treatment effect. See comments #' in function for description of which is which my_function <- function(x, i){ resampled_data <- x[i, ] match_data <- matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data, method = "nearest") ) # simple mean of matched groups est1 <- with(match_data, mean(re78[treat == 1]) - mean(re78[treat == 0])) # regression model estimate with matched groups est2 <- coef(rlm(re78 ~ treat + age + educ + black + hispan + nodegree + married + re74 + re75, data = match_data))["treat"] # regression model with IPTW mod <- gam(treat ~ s(age) + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data, family = "binomial") resampled_data <- resampled_data %>% mutate(propensity_gam = predict(mod, type = "response"), weight_gam = 1 / propensity_gam * treat + 1 / (1 - propensity_gam) * (1 - treat)) est3 <- coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data, weights = as.vector(weight_gam)))["treat"] # regression model estimate with original data est4 <- coef(rlm(re78 ~ treat + age + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data))["treat"] return(c(est1, est2, est3, est4)) } # test gives same results as when done earlier in the post: my_function(lalonde, 1:nrow(lalonde)) booted <- boot(lalonde, statistic = my_function, R = 5000, parallel = "snow", cl = cluster) booted_df <-$t) names(booted_df) <- c("Simple difference with matched data", "Regression with matched data", "Regression with weighted data", "Regression with original data") booted_tidy <- booted_df %>% gather("Method", "Estimate") %>% mutate(Method = fct_reorder(Method, Estimate)) booted_summary <- booted_tidy %>% group_by(Method) %>% summarise(lower = quantile(Estimate, 0.025), upper = quantile(Estimate, 0.975)) %>% gather(type, Estimate, -Method) ggplot(booted_tidy, aes(x = Estimate, fill = Method)) + geom_density(alpha = 0.4, colour = "grey50") + geom_segment(data = booted_summary, aes(xend = Estimate), y = 0, yend = 2e-04) + geom_text(data = booted_summary, aes(label = round(Estimate)), y = 2.5e-04, size = 3) + facet_wrap(~Method) + theme(legend.position = "none") + scale_x_continuous(label = dollar) + ggtitle("Treatment effects from four different modelling methods", "Distribution and 95% confidence intervals of estimated impact on income in 1978 of a job training program") + labs(x = "Estimated treatment effect", caption = "Lalonde's 1986 data on a job training program", fill = "")

    > # biases revealed by the bootstrap: > booted Bootstrap Statistics : original bias std. error t1* 908.2022 -268.85736 766.5107 t2* 959.5030 -75.55153 655.4119 t3* 909.9001 12.09643 838.2337 t4* 1182.5056 33.35265 682.6993

    Those are big confidence intervals – reflecting the small sample size and the difficulty of picking up an impact of an intervention amongst all the complexities of income determinants. I’m satisfied that those are ok depictions of the uncertainty.

    All three of the two stage methods that model propensity first give extremely similar results; and the simpler one-stage regression on the original unweighted data gives a result within a third of a standard error. In practice which would I do? I’d probably do it both ways and if there are wildly different results I’d worry, treating that as a diagnostic warning.

    Basically with this sort of data, the big gain comes from any modelling strategy that accepts treatment as endogenous and hence you need to control for other variables that are related to both it and the response. Once you’ve adopted a method to do that, the benefits of which particular method are, to me, not particularly material.

    To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R code to accompany Real-World Machine Learning (Chapter 5): Event Modeling

    Sat, 04/08/2017 - 13:00

    The rwml-R Github repo is updated with R code for the event modeling examples from Chapter 5 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. Examples given include reading large data files with the fread function from data.table, optimization of model parameters with caret, computing and plotting ROC curves with ggplot2, engineering date-time features, and determining the relative importance of features with the varImp function in caret.

    Event-Modeling Data

    The data for the event modeling examples in chapter 5 of the book is from the
    Kaggle Event Recommendation Engine Challenge.
    The rules of the challenge prohibit redristribution of the data, so the reader must
    download the data from Kaggle.

    Using fread

    In order to work through the examples, features from the rather large
    event.csv file are processed several times. To save time, an alternative to
    the read.csv function is needed. This is where the fread function from the
    data.table library comes in. It is similar to read.table but
    faster and more convenient. On my MacBook Pro, it took only seven seconds to read
    the event_id, lat, and lng features from the >1GB events.csv data file.

    events <- fread(file.path(dataDir, "events.csv"), sep=",", colClasses = c("integer64", rep("NULL",6), "numeric", "numeric", rep("NULL",101))) Initial cross-validated ROC curve and AUC metric

    Once a training data set is built with features from the events.csv,
    train.csv, and users.csv files, the caret train function is used
    to train a random forsest model
    evaluated using 10-fold cross-validation with the
    receiver operating characteristic (ROC) curve as the metric.
    The ROC curve and area under the curve (AUC) for the model (when applied
    to held-out test data from the training data set) are shown in the figure
    below. An AUC of 0.86 is achieved with the initial set of six features.

    !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitshot-bjs');

    Inclusion of date-time and bag-of-words features lead to over-fitting

    Ten date-time features (such as ‘‘week of the year’’ and ‘‘day of the week’’)
    are extracted from the timestamp column of the events.csv file.
    When added to the model, the AUC actually decreases slightly,
    indicating over-fitting. The AUC decreases even more when available
    ‘‘bag-of-words’’ data is included in the model.

    Feature importance

    The varImp function from the caret package computes the
    relative importance of each variable for objects produced by the train
    method. A plot of the relative importance for the top seven variables
    in the final random forest model is shown below.

    Feedback welcome

    If you have any feedback on the rwml-R project, please
    leave a comment below or use the Tweet button.
    As with any of my projects, feel free to fork the rwml-R repo
    and submit a pull request if you wish to contribute.
    For convenience, I’ve created a project page for rwml-R with
    the generated HTML files from knitr, including a page with
    all of the event-modeling examples from chapter 5.


    ReinforcementLearning: A package for replicating human behavior in R

    Sat, 04/08/2017 - 09:52

    (This article was first published on R Blog, and kindly contributed to R-bloggers)

    Nicolas Proellochs and Stefan Feuerriegel 2017-04-06 Introduction

    Reinforcement learning has recently gained a great deal of traction in studies that call for human-like learning. In settings where an explicit teacher is not available, this method teaches an agent via interaction with its environment without any supervision other than its own decision-making policy. In many cases, this approach appears quite natural by mimicking the fundamental way humans learn. However, implementing reinforcement learning is programmatically challenging, since it relies on continuous interactions between an agent and its environment. In fact, there is currently no package available that performs model-free reinforcement learning in R. As a remedy, we introduce the ReinforcementLearning R package, which allows an agent to learn optimal behavior based on sample experience consisting of states, actions and rewards. The result of the learning process is a highly interpretable reinforcement learning policy that defines the best possible action in each state.

    In the following sections, we present multiple step-by-step examples to illustrate how to take advantage of the capabilities of the ReinforcementLearning package. Moreover, we present methods to customize the learning and action selection behavior of the agent. Main features of ReinforcementLearning include, but are not limited to,

    • Learning an optimal policy from a fixed set of a priori known transition samples
    • Predefined learning rules and action selection modes
    • A highly customizable framework for model-free reinforcement learning tasks
    Reinforcement learning

    Reinforcement learning refers to the problem of an agent that aims to learn optimal behavior through trial-and-error interactions with a dynamic environment. All algorithms for reinforcement learning share the property that the feedback of the agent is restricted to a reward signal that indicates how well the agent is behaving. In contrast to supervised machine learning methods, any instruction concerning how to improve its behavior is absent. Thus, the goal and challenge of reinforcement learning is to improve the behavior of an agent given only this limited type of feedback.

    The reinforcement learning problem

    In reinforcement learning, the decision-maker, i.e. the agent, interacts with an environment over a sequence of observations and seeks a reward to be maximized over time. Formally, the model consists of a finite set of environment states S, a finite set of agent actions A, and a set of scalar reinforcement signals (i.e. rewards) R. At each iteration i, the agent observes some representation of the environment’s state si ∈ S. On that basis, the agent selects an action ai ∈ A(si), where A(si) ⊆ A denotes the set of actions available in state si. After each iteration, the agent receives a numerical reward ri+1 ∈ R and observes a new state si+1.

    Policy learning

    In order to store current knowledge, the reinforcement learning method introduces a so-called state-action function Q(si,ai), which defines the expected value of each possible action ai in each state si. If Q(si,ai) is known, then the optimal policy π*(si,ai) is given by the action ai, which maximizes Q(si,ai) given the state si. Consequently, the learning problem of the agent is to maximize the expected reward by learning an optimal policy function π*(si,ai).

    Experience replay

    Experience replay allows reinforcement learning agents to remember and reuse experiences from the past. The underlying idea is to speed up convergence by replaying observed state transitions repeatedly to the agent, as if they were new observations collected while interacting with a system. Hence, experience replay only requires input data in the form of sample sequences consisting of states, actions and rewards. These data points can be, for example, collected from a running system without the need for direct interaction. The stored training examples then allow the agent to learn a state-action function and an optimal policy for every state transition in the input data. In a next step, the policy can be applied to the system for validation purposes or to collect new data points (e.g. in order to iteratively improve the current policy). As its main advantage, experience replay can speed up convergence by allowing for the back-propagation of information from updated states to preceding states without further interaction.

    Setup of the ReinforcementLearning package

    Even though reinforcement learning has recently gained a great deal of traction in studies that perform human-like learning, the available tools are not living up to the needs of researchers and practitioners. The ReinforcementLearning package is intended to partially close this gap and offers the ability to perform model-free reinforcement learning in a highly customizable framework.


    Using the devtools package, one can easily install the latest development version of ReinforcementLearning as follows.

    # install.packages("devtools") # Option 1: download and install latest version from GitHub devtools::install_github("nproellochs/ReinforcementLearning") # Option 2: install directly from bundled archive devtoos::install_local("ReinforcementLearning_1.0.0.tar.gz") Package loading

    Afterwards, one merely needs to load the ReinforcementLearning package as follows.

    library(ReinforcementLearning) Usage

    The following sections present the usage and main functionality of the ReinforcementLearning package.

    Data format

    The ReinforcementLearning package uses experience replay to learn an optimal policy based on past experience in the form of sample sequences consisting of states, actions and rewards. Here each training example consists of a state transition tuple (s,a,r,s_new), as described ibelow.

    • s The current environment state.
    • a The selected action in the current state.
    • r The immediate reward received after transitioning from the current state to the next state.
    • s_new The next environment state.


    • The input data must be a dataframe in which each row represents a state transition tuple (s,a,r,s_new).
    Read-in sample experience

    The state transition tuples can be collected from an external source and easily read-in into R. The sample experience can then be used to train a reinforcement learning agent without requiring further interaction with the environment. The following example shows a representative dataset containing game states of 100,000 randomly sampled Tic-Tac-Toe games.

    data("tictactoe") head(tictactoe, 5) ## State Action NextState Reward ## 1 ......... c7 ......X.B 0 ## 2 ......X.B c6 ...B.XX.B 0 ## 3 ...B.XX.B c2 .XBB.XX.B 0 ## 4 .XBB.XX.B c8 .XBBBXXXB 0 ## 5 .XBBBXXXB c1 XXBBBXXXB 0 Experience sampling using an environment function

    The ReinforcementLearning package is shipped with the built-in capability to sample experience from a function that defines the dynamics of the environment. If the the dynamics of the environment are known a priori, one can set up an arbitrary complex environment function in R and sample state transition tuples. This function has to be manually implemented and must take a state and an action as input. The return value must be a list containing the name of the next state and the reward. As a main advantage, this method of experience sampling allows one to easily validate the performance of reinforcement learning, by applying the learned policy to newly generated samples.

    environment <- function(state, action) { ... return(list("NextState" = newState, "Reward" = reward)) }

    The following example illustrates how to generate sample experience using an environment function. Here we collect experience from an agent that navigates from a random starting position to a goal position on a simulated 2×2 grid (see figure below).

    | s1  | s4  |
    | s2     s3 |

    Each cell on the grid represents one state, which yields a total of 4 states. The grid is surrounded by a wall, which makes it impossible for the agent to move off the grid. In addition, the agent faces a wall between s1 and s4. At each state, the agent randomly chooses one out of four possible actions, i. e. to move up, down, left, or right. The agent encounters the following reward structure: Crossing each square on the grid leads to a reward of -1. If the agent reaches the goal position, it earns a reward of 10.

    # Load exemplary environment (gridworld) env <- gridworldEnvironment print(env) ## function(state, action) { ## next_state <- state ## if(state == state("s1") && action == "down") next_state <- state("s2") ## if(state == state("s2") && action == "up") next_state <- state("s1") ## if(state == state("s2") && action == "right") next_state <- state("s3") ## if(state == state("s3") && action == "left") next_state <- state("s2") ## if(state == state("s3") && action == "up") next_state <- state("s4") ## ## if(next_state == state("s4") && state != state("s4")) { ## reward <- 10 ## } else { ## reward <- -1 ## } ## ## out <- list("NextState" = next_state, "Reward" = reward) ## return(out) ## } ## <environment: namespace:ReinforcementLearning> # Define state and action sets states <- c("s1", "s2", "s3", "s4") actions <- c("up", "down", "left", "right") # Sample N = 1000 random sequences from the environment data <- sampleExperience(N = 1000, env = env, states = states, actions = actions) head(data) ## State Action Reward NextState ## 1 s4 left -1 s4 ## 2 s2 right -1 s3 ## 3 s2 right -1 s3 ## 4 s3 left -1 s2 ## 5 s4 up -1 s4 ## 6 s1 down -1 s2 Performing reinforcement learning

    The following example shows how to teach a reinforcement learning agent using input data in the form of sample sequences consisting of states, actions and rewards. The ‘data’ argument must be a dataframe object in which each row represents a state transition tuple (s,a,r,s_new). Moreover, the user is required to specify the column names of the individual tuple elements in ‘data’.

    # Define reinforcement learning parameters control <- list(alpha = 0.1, gamma = 0.5, epsilon = 0.1) # Perform reinforcement learning model <- ReinforcementLearning(data, s = "State", a = "Action", r = "Reward", s_new = "NextState", control = control) # Print result print(model) ## State-Action function Q ## right up down left ## s1 -0.6633782 -0.6687457 0.7512191 -0.6572813 ## s2 3.5806843 -0.6893860 0.7760491 0.7394739 ## s3 3.5702779 9.1459425 3.5765323 0.6844573 ## s4 -1.8005634 -1.8567931 -1.8244368 -1.8377018 ## ## Policy ## s1 s2 s3 s4 ## "down" "right" "up" "right" ## ## Reward (last iteration) ## [1] -263

    The result of the learning process is a state-action table and an optimal policy that defines the best possible action in each state.

    # Print policy policy(model) ## s1 s2 s3 s4 ## "down" "right" "up" "right" Updating an existing policy

    Specifying an environment function to model the dynamics of the environment allows one to easily validate the performance of the agent. In order to do this, one simply applies the learned policy to newly generated samples. For this purpose, the ReinforcementLearning package comes with an additional predefined action selection mode, namely ‘epsilon-greedy’. In this strategy, the agent explores the environment by selecting an action at random with probability 1-ε. Alternatively, the agent exploits its current knowledge by choosing the optimal action with probability 1-ε.

    The following example shows how to sample new experience from an existing policy using ‘epsilon-greedy’ action selection. The result is new state transition samples (‘data_new’) with significantly higher rewards compared to the the original sample (‘data’).

    # Define reinforcement learning parameters control <- list(alpha = 0.1, gamma = 0.5, epsilon = 0.1) # Sample N = 1000 sequences from the environment using epsilon-greedy action selection data_new <- sampleExperience(N = 1000, env = env, states = states, actions = actions, model = model, actionSelection = "epsilon-greedy", control = control) head(data_new) ## State Action Reward NextState ## 1 s2 right -1 s3 ## 2 s4 right -1 s4 ## 3 s4 right -1 s4 ## 4 s4 right -1 s4 ## 5 s2 right -1 s3 ## 6 s1 down -1 s2 # Update the existing policy using new training data model_new <- ReinforcementLearning(data_new, s = "State", a = "Action", r = "Reward", s_new = "NextState", control = control, model = model) # Print result print(model_new) ## State-Action function Q ## right up down left ## s1 -0.643587 -0.6320560 0.7657318 -0.6314927 ## s2 3.530829 -0.6407675 0.7714129 0.7427914 ## s3 3.548196 9.0608344 3.5521760 0.7382102 ## s4 -1.939574 -1.8922783 -1.8835278 -1.8856132 ## ## Policy ## s1 s2 s3 s4 ## "down" "right" "up" "down" ## ## Reward (last iteration) ## [1] 1211 summary(model_new) ## ## Model details ## Learning rule: experienceReplay ## Learning iterations: 2 ## Number of states: 4 ## Number of actions: 4 ## Total Reward: 1211 ## ## Reward details (per iteration) ## Min: -263 ## Max: 1211 ## Average: 474 ## Median: 474 ## Standard deviation: 1042.275 # Plot reinforcement learning curve plot(model_new)

    Parameter configuration

    The ReinforcementLearning package allows for the adjustment of the following parameters in order to customize the learning behavior of the agent.

    • alpha The learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated and, hence, nothing is learned. Setting a high value, such as 0.9, means that learning can occur quickly.
    • gamma Discount factor, set between 0 and 1. Determines the importance of future rewards. A factor of 0 will render the agent short-sighted by only considering current rewards, while a factor approaching 1 will cause it to strive for a greater reward over the long term.
    • epsilon Exploration parameter, set between 0 and 1. Defines the exploration mechanism in ε-greedy action selection. In this strategy, the agent explores the environment by selecting an action at random with probability ε. Alternatively, the agent exploits its current knowledge by choosing the optimal action with probability 1-ε. This parameter is only required for sampling new experience based on an existing policy.
    • iter Number of repeated learning iterations. Iter is an integer greater than 0. The default is set to 1.
    # Define control object control <- list(alpha = 0.1, gamma = 0.1, epsilon = 0.1) # Pass learning parameters to reinforcement learning function model <- ReinforcementLearning(data, iter = 10, control = control) Working example: Learning Tic-Tac-Toe

    The following example shows the use of ReinforcementLearning in an applied setting. More precisely, we utilize a dataset containing 406,541 game states of Tic-Tac-Toe to learn the optimal actions for each state of the board.

    # Load dataset data("tictactoe") # Define reinforcement learning parameters control <- list(alpha = 0.2, gamma = 0.4, epsilon = 0.1) # Perform reinforcement learning model <- ReinforcementLearning(tictactoe, s = "State", a = "Action", r = "Reward", s_new = "NextState", iter = 1, control = control) # Print optimal policy policy(model)


    • All states are observed from the perspective of player X, who is also assumed to have played first.
    • The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game. Reward for player X is +1 for ‘win’, 0 for ‘draw’, and -1 for ‘loss’.


    To leave a comment for the author, please follow the link and comment on their blog: R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Mapping waxwings annual migration without Twitter

    Sat, 04/08/2017 - 02:00

    (This article was first published on Maëlle, and kindly contributed to R-bloggers)

    Recently a reader left a comment on this blog mentioning his cool blog post in which he mapped the spread of a migratory bird using Twitter. His data source was the Waxwings UK account which reports sightings of Bohemian waxwings in the UK. I decided to try reproducing and extending his work using the rOpenSci spocc package that interfaces different sources of species occurrence data.

    Getting the occurrence data

    As mentioned above I got the data via spocc. I read the README of the Github repo and learnt that it’s called spocc like species occurrence data. So now I should never forget how many “c” there are in the word occurrence. Now please send help for my remembering it has two “r”.

    The spocc package interacts with so many data sources that I felt a bit overwhelmed. I guess ecologists are not often blessed with so much data. Note that I don’t limit my query to the UK. Indeed, since seeing this map my interest for the Bohemian waxwing became global. I even decided to get data for the two other species of waxwings, although it only worked for the Bohemian and Cedar waxwings in the end.

    I decided to use only data from GBIF. You’ll find more information about GBIF for instance, and other data sources for species occurrences, in this blog post from rOpenSci blog. Since I wanted to get data for different years and all 3 species of waxwings, which meant a lot of data so for getting it I sliced the Earth (I like how evil this sentence makes me sound, ah!). Note that I used overlapping slices because otherwise I didn’t get data at the limits between slices. I guess I could have made my slices slightly less overlapping though.

    library("spocc") library("dplyr") library("purrr") library("scrubr") get_slice <- function(longitude, year){ print(paste(year, longitude)) gbifopts <- list(limit = 200000, year = year) waxwings <- occ(query = "Bombycilla", from = c('gbif'), gbifopts = gbifopts, geometry = c(longitude - 0.5, - 90, longitude + 0.5, 90)) waxwings <- occ2df(waxwings) } longitudes <- seq( -180, 179, by = 0.5) years <- rep(2011:2016, length(lo ngitudes)) longitudes <- rep(longitudes, 6) waxwings <- map2(longitudes, years, get_slice) for (i in 1:length(waxwings)){ waxwings[[i]]$latitude <- as.numeric(waxwings[[i]]$latitude) waxwings[[i]]$longitude <- as.numeric(waxwings[[i]]$longitude) } waxwings <- bind_rows(waxwings) waxwings <- unique(waxwings) waxwings <- filter(waxwings, ! readr::write_csv(waxwings, path = "uncleaned_waxwings.csv") Cleaning the occurrence data

    Now because my MSc of ecology is far behind me (but still close to my heart!) I have no idea how to assess the quality of species occurrence data. And even if I did I would have been delighted by the discovery of another rOpenSci package, scrubr, whose aim is to clean species occurrence records. Because I had so much data, I cleaned each day separately, otherwise it was just too long and hard on my poor computer. Don’t start thinking scrubr is slow, it isn’t and it’s getting faster by the minute thanks to work by its maintainer.

    cleanup <- function(df){ print(df$date[1]) df <- df %>% coord_impossible() %>% coord_incomplete() %>% coord_unlikely() if(nrow(df) > 1){ df <- dedup(df) } df <- df %>% date_standardize("%Y-%m-%d") %>% date_missing() return(df) } waxwings <- readr::read_csv("uncleaned_waxwings.csv") waxwings <- split(waxwings, waxwings$date) waxwings <- lapply(waxwings, cleanup) waxwings <- bind_rows(waxwings) waxwings <- unique(waxwings) waxwings <- dplyr::filter(waxwings, name != "Bombycilla japonica", longitude < 50) readr::write_csv(waxwings, path = "waxwings.csv")

    I removed the records with impossible, incomplete or unlikely coordinates (unlikely being e.g. a political centroid, impossible coordinates with a too high longitude), I also removed duplicate records and records without a data. This was so easy! I’d like the scrubr package to come clean my flat, as a team with the janitor package. Now, in real life, I’d probably be even stricter with data cleaning but for the scope of this blog post, using that package without any additionnal check was enough.

    I also removed occurrences of the Japanese waxwing because they were too few of them, and occurrences with a longitude higher than 50 because it seemed weird to have non Japanese waxwings in Asia. After all this glorious data cleaning, I had 1015603 records.

    Exploring the data

    Let the fun begin!

    Number of occurrences by species over time library("ggplot2") library("hrbrthemes") library("dplyr") library("lubridate") waxwings <- mutate(waxwings, week = update(date, wday = 1)) waxwings %>% group_by(week, name) %>% summarize(n = n()) %>% ggplot() + geom_point(aes(week, n)) + facet_grid(name ~ ., scales = "free_y") + theme(strip.text.y = element_text(angle = 0, size = 6)) + xlab("Time (weeks)") + ylab("No. of occurrences")

    I have no idea why I have no data in 2016 for one of the two species. I decided to not investigate it further since I had enough birds for my primary goal which was mapping migration. The number of occurrences of Cedar waxwing increases over time before 2016, maybe because of more birders reporting sightings? For both species there is a clear seasonality, probably because these birds tend to breed in places where less people can observe them as we’ll see later in the post.

    Day-of-the-week effects waxwings <- mutate(waxwings, wday = lubridate::wday(date, label = TRUE)) waxwings %>% group_by(update(date, wday = 1), wday) %>% summarize(n = n()) %>% ggplot() + geom_boxplot(aes(wday, n))+ labs(x="Day of the week", y="No. of reported occurrences", title="No. of occurrences of waxwings by day-of-the week", caption="Data from GBIF accessed via rOpenSci package spocc") + theme_ipsum()

    So, more birds are reported on week-ends than on weekdays which I assume is due to a difference in human rather than bird behaviour (but who knows?). Note that for finer characterization of days where more people are birding, I could have used the bizdays package, but then I’d have limited my observations for one country only, because mixing holidays from different countries doesn’t sound like fun. Another thing that might influence sightings, beside people having the day off, might be the weather. For getting weather data in R I can recommend a few packages.

    Mapping the migrations!

    I first decided to plot the occurrences themselves on maps, by month, and to make a gif out of it. I then had to choose the colour and shape of the points used to represent the birds. Shape? Bird emojis of course! Regarding the colour, I was quite glad when someone I follow on Twitter posted about the birdcolourbot account created by David Lawrence Miller, because I looked and found this tweet with colours for the Bohemian waxwing, Bombycilla garrulus, the one present in both America and Europe. For the Cedar waxwing I had to create a palette myself, which I’d never thought of doing if I hadn’t seen the birdcolourbot account, I’m not a colour artist and often stick to viridis. I used a pretty pic from the internet. In both cases, I uploaded images on this website to get colour codes. A bit of copy-paste work but much easier than having to mix paint colours for real.

    bohemian_palette <- c("#1A1A1A", "#878787", "#B15929", "#E21A1C", "#FEFF99") cedar_palette <- c("#050608", "#5D5C7A", "#AF5F2A", "#8E2F49", "#F3DD31")

    To me both species look quite similar so I didn’t expect the palettes to be reallt different. I decided the colours would all have the same probability, instead of weighing them according them to their presence in the usual patterns of each species.

    set.seed(1) library("dplyr") waxwings <- mutate(waxwings, colour = factor(sample(1:5, size = nrow(waxwings), replace = TRUE))) waxwings <- mutate(waxwings, month = lubridate::month(date), month_name = lubridate::month(date, label = TRUE, abbr = FALSE)) bohemian <- filter(waxwings, name == "Bombycilla garrulus") cedar <- filter(waxwings, name == "Bombycilla cedrorum")

    After this preparation of the two data.frames I created a function for plotting a month of data for one species. A point I’d like to work on in the future for not being so ashamed of each of my maps are projections.

    library("ggalt") library("magick") library("ggmap") library("ggthemes") library("emojifont") load.emojifont('OpenSansEmoji.ttf') wax_map <- map_data("world") wax_map <- wax_map[wax_map$region != "Antarctica",] plot_month_species <- function(df, species, name, palette){ p <- ggplot() p <- p + geom_map(data = wax_map, map = wax_map, aes(x = long, y = lat, map_id = region), color = "white", fill = "#7f7f7f", size = 0.05, alpha = 1/4) p <- p + theme_map() p <- p + geom_text(aes(longitude, latitude, col = colour), label = emoji("bird"), data = df, family="OpenSansEmoji", size = 5) p <- p + scale_colour_manual(values = palette) p <- p + ylim(min(species$latitude), max(species$latitude)) p <- p + xlim(min(species$longitude), max(species$longitude)) p <- p + theme(legend.position = "none") outfil <- paste0("fig_waxwings/", name, "_", df$month[1], ".png") ggsave(outfil, p, width=5, height=5) image_read(outfil) %>% image_annotate(text = as.character(df$month_name[1]), size = 100) %>% image_write(outfil) outfil }

    I chose to annotate the graph after creating it in order not to have to use the emoji font for the month name.

    I first applied the function to the Bohemian waxwings.

    bohemian_l <- split(bohemian, bohemian$month) bohemian_l %>% purrr::map(plot_month_species, species = bohemian, name = "bohemian", palette = bohemian_palette) %>% purrr::map(image_read) %>% image_join() %>% image_animate(fps=1) %>% image_write("bohemian.gif")

    First observation: I think the colours are pretty, but I also think I’ve just invented the concept of a confetti map and I’m not sure I’m proud of that. Then, regarding the underlying bird movements, Bohemian waxwings breed in the spring in regions farther up North than the regions where they winter and this can be seen on the map. There seem to be less sightings in the breeding season, probably because less people live up there, and apparently Santa’s helpers don’t even help.

    Then it was the turn of the Cedar waxwing for which I had many more observations (871310 vs. 144293). Brace yourself for many confetti!

    cedar_l <- split(cedar, cedar$month) cedar_l %>% purrr::map(plot_month_species, species = cedar, name = "cedar", palette = cedar_palette) %>% purrr::map(image_read) %>% image_join() %>% image_animate(fps=1) %>% image_write("cedar.gif")

    I’ve had a lot of fun making my confetti gifs. I could extend the analysis by using spatial smoothing to draw the zone where waxwings are mostly seen each month. I think the general pattern is clear enough now, but there are some outliers like one Bohemian waxwing in Spain. If there really is a Bohemian waxwing in Spain I’d appreciate its visiting me because I’ve never seen a waxwing!


    I was impressed by rOpenSci tools for getting and cleaning occurrence data, and quite thankful for the species occurrence data provided by GBIF. I liked plotting migration thanks to species occurrence, although I guess the best method for knowing migration patterns is tracking birds. But then I’ll let professionals do this and keep my bird migration confetti map making as a nice hobby.

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data Visualization – Part 3

    Fri, 04/07/2017 - 22:18

    (This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

    What Type of Data Visualization Do You Choose (if any)?

    Determining whether or not you need a visualization is step one. While it seems silly, this is probably something everyone (including myself) should be doing more often. A lot of times, it seems like a great way to showcase the amount of work you have been doing, but winds up being completely ineffective and could potentially harm what you’re doing. Once you determine that you actually need to visualize your data, you should have a rough idea of the options to look at. This post will explain and demonstrate some of the common types of charts and plots.

    This is Part 3 in a series about Data visualization:

    Determine whether or not you actually need a visualizatoin in the first place.

    Like the best practices I listed in Data Visualization – Part 1, make sure your visualizations:

    • Are clearly illustrating a relevant point
    • Are tailored to the appropriate audience
    • Are tailored to the presentation medium
    • Are memorable to those who care about the material
    • Are increasing the understanding of the subject matter

    If these don’t seem possible, you probably don’t need a data visualization.

    If you do need one, what’s a good first step to take?

    Take a look at the forum in which you’re presenting, it matters! If you are writing for a scientific journal, it will be different than presenting live to a thousand person audience. Think about a Ted Talk compared to the Journal of Physics.

    Point being: consider your audience!

    Let’s talk about a high-level presentation. Everyone has seen a slideshow with fancy charts that add zero value. Do not be the person presenting something that way! Providing useless content will confuse the audience and/or lead to boredom.

    If your point is to show year-over-year change of a single metric – show it as a simple number on the page in big bold font rather than a chart.

    In this made up example, I am displaying revenue over the last few years (note: be more specific when it comes to what type of revenue you’re talking about).

    Which of the following makes more sense to put on a slide?

    If you agree with me, the one on the right will be much easier for people to understand in a presentation. It gets the point across without requiring processing which will allow people to focus on what is important. Any additional nuggets you would like to point out can be spoken to.

    Now, let’s talk about publishing content that isn’t for academic use but will reach the public (i.e. newspapers, magazines, blogs, etc.). These types of charts can cover a wide range of topics so we’ll have to stick to the basics. We’re going to look at displaying information which is interesting and adds value.

    Here is a great example from Junk Charts in which the author of the original Daily Kos Article is showing a type of “lie detector” chart. The chart does a number of things well: it illustrates a relevant point, it is appropriate to the audience and medium, and really helps to understand the subject matter better. However, the original chart is too colorful which takes away from its effectiveness. Junk Charts took it to the next level by simplifying the colors and axes.

    Original Version (Daily Kos)

    Modified Version (Junk Charts)

    By merely looking at this chart you can see how it is ranked, a sense of scale, the comparison between people, and clearly labeled names. Fantastic work!

    Rather than going over more examples of work others are doing, please visit Chart Porn (don’t worry about the name, it’s a great data visualization site) and Junk Charts. They have phenomenal examples of what to do (and what not to do) when publishing to the public.

    You have a point, now what?

    There is no rulebook as to how to display your data. However, as you have seen, there are both great and poor options. The choice is up to you – so think long and hard before making a decision (and you can always try a number of them out on people before publishing).

    Ask yourself the following questions to help drive your decision:

    • Are you making a comparison?
    • Are you finding a relationship?
    • Are you showing a distribution?
    • Are you finding a trend over time?
    • Are you showing composition?

    Once you know which question you are asking, it will keep your mind focused on the outcome and will quickly narrow down your charting options.

    Rule of Thumb
    • Trend: Column, Line
    • Comparison: Area, Bar, Bullet, Column, Line, Scatter
    • Relationship: Line, Scatter
    • Distribution: Bar, Boxplot, Column
    • Composition: Donut, Pie, Stacked Bar, Stacked Column

    Obviously, there are plenty of choices beyond these, so don’t hesitate to use what works best. I will go over some of these basics and show some comparisons of poor charting techniques vs. slightly better ones.

    For this project, I’ll use some oil production data that I found while digging through (pretty great site). The data can be found here

    Let’s load up some libraries and get started.

    library(ggplot2) library(dplyr) library(tidyr) library(lubridate) library(scales)

    #Custom data preparation #GitHub (linked to at bottom of this post) source('data_preparation.R') data = getData()


    ## Location Month Year ThousandBarrel Date ## 1 Alabama Mar 2013 883 2013-03-01 ## 2 Alabama Apr 2013 844 2013-04-01 ## 3 Alabama May 2013 878 2013-05-01 ## 4 Alabama Feb 2013 809 2013-02-01 ## 5 Alabama Mar 1982 1687 1982-03-01 ## 6 Alabama Apr 1982 1567 1982-04-01

    Trend – Line Chart

    Objective: Visualize a trend in oil production in the US from 1981 – 2016 by year. I want to illustrate the changes over the time period. This is a very high-level view and only shows us a decline followed by a ramp up at the end of the period.

    Poor Version

    The x-axis is a disaster and the y-axis isn’t formatted well. While it gets the point across, it’s still worthless.

    df = data %&gt;% group_by(Year) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) p = ggplot(df,aes(x=Year,y=ThousandBarrel,group=1)) p + geom_line(stat='identity') + ggtitle('Oil Production Over Time') + theme(plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5)) + xlab('') + ylab('')

    Better Version

    The title gives us a much better understanding of what we’re looking at. The chart is slightly wider and the axes are formatted to be legible.

    p = ggplot(df,aes(x=Year,y=ThousandBarrel,group=1)) p + geom_line(stat='identity') + ggtitle('Thousand Barrel Oil Production By Year in the U.S.') + theme(plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_continuous(labels = comma)

    Comparison – Line Chart

    Objective: Identify which states affected the trend the most. Evaluate them simultaneously in order to paint the picture and compare their trends over the time period. From this visual you can see the top states are Alaska, California, Louisiana, Oklahoma, Texas and Wyoming. Texas seems to break the mold quite drastically and drove the spike which occurred after 2010.

    Poor Version

    There are far too many colors going on here. Everything at the bottom of the chart is relatively useless and takes our focus away from the big players.

    df = data %&gt;% group_by(Location, Year) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) df$Year = as.numeric(df$Year) p = ggplot(df,aes(x=Year,y=ThousandBarrel,col=Location)) p + geom_line(stat='identity') + ggtitle(paste('Oil Production By Year By State in the U.S.')) + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

    Better Version

    This focuses attention on the top producing states. It compares them to each other and shows the trend per state as well. Using facet_wrap() tends to be used in what’s known as “small multiples” – this is a technique which helps to break up the visual components of the data into easy-to-understand pieces which make intuitive sense.

    n=6 #Arbitrary at first, after trying a few, this made the most sense topN = data %&gt;% group_by(Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) %&gt;% arrange(-ThousandBarrel) %&gt;% top_n(n) df = data %&gt;% filter(Location %in% topN$Location) %&gt;% group_by(Year,Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) df$Year = as.numeric(df$Year) df$Location = as.factor(df$Location) p = ggplot(df,aes(x=Year,y=ThousandBarrel,group=1)) p + geom_line(stat='identity') + ggtitle(paste('Top',as.character(n),'States - Oil Production By Year in the U.S.')) + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_wrap(~Location) + scale_y_continuous(labels = comma)

    Relationship – Scatter Plot

    Objective: Check to see if data from Alaska and California is correlated. While this isn’t extremely interesting, it does allow us to use this same data set (sorry). The charts indicate that there appears to be a strong positive correlation between the two states.

    Poor Version

    Lots of completely irrelevant data! The size of the point should have nothing to do with the year.

    statesList = c('Alaska','California') df = data %&gt;% filter(Location %in% statesList) %&gt;% spread(Location,ThousandBarrel) %&gt;% select(Alaska,California,Month,Year) p = ggplot(df,aes(x=Alaska,y=California,col=Month,size=Year)) p + geom_point() + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma) + ggtitle('Oil Production - CA vs. AK') + theme(plot.title = element_text(hjust = 0.5))

    Better Version

    The points are all the same size and a trend line helps to visualize the relationship. While it can sometimes be misleading, it makes sense with our current data.

    df = data %&gt;% filter(Location %in% statesList) %&gt;% spread(Location,ThousandBarrel) %&gt;% select(Alaska,California,Year) p = ggplot(df,aes(x=Alaska,y=California)) p + geom_point() + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma) + ggtitle('Monthly Thousand Barrel Oil Production 1981-2016 CA vs. AK') + theme(plot.title = element_text(hjust = 0.5)) + geom_smooth(method='lm')

    Distribution – Boxplot

    Objective: Examine the range of production by state (per year) to give us an idea of the variance. While the sums and means are nice, it’s quite important to have an idea of distributions. While it was semi-apparent in the line charts, the variance of Texas is huge compared to the others!

    Poor Version

    Alphabetical order doesn’t add any value, names are overlapping on top of each other. While you can tell who the big players are, this visual does not add the value it should.

    df = data %&gt;% group_by(Year,Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) p = ggplot(df,aes(x=Location,y=ThousandBarrel)) p + geom_boxplot() + ggtitle('Distribution of Oil Production by State')

    Better Version

    This gives a nice ranking to the plot while still showing their distributions. We could take this a step further and separate out the big players from the small players (I’ll leave that up to you).

    p = ggplot(df,aes(x=reorder(Location,ThousandBarrel),y=ThousandBarrel)) p + geom_boxplot() + scale_y_continuous(labels = comma) + ggtitle('Distribution of Annual Oil Production By State (1981 - 2016)') + coord_flip()

    Composition – Stacked Bar

    Objective: Check out the composition of total production by state. It’s interesting to see how the composition was relatively similar across decades until the 2010’s. Texas was 50% of the output!

    Poor Version

    My favorite, the beautiful pie chart! There’s nothing better than this… (no need for further commentary).

    df = data %&gt;% group_by(Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) %&gt;% mutate(ThousandBarrel = ThousandBarrel/sum(ThousandBarrel)) df$ThousandBarrel = round(100*df$ThousandBarrel,0) library(plotrix) pie(x=df$ThousandBarrel,labels=df$Location,explode=0.1,col=rainbow(nrow(df)),main='Percentage of Oil Production by State')

    Better Version

    The 1980’s and 2010’s will be missing years in terms of a “decade” due to the data provided (and it’s only 2017). While the percentage labels are slightly off center, it’s certainly much better than the pie chart. It’s not quite “apples-to-apples” for a comparison because I created different decades, but you get the idea.

    I also created an “Other” category in order to simplify the output. When you are doing comparisons, it’s typically a good idea to find a way to reduce the number of variables in the output while not removing data by dropping it completely – do this carefully and transparently!

    data$Decade = '1980s' data$Decade[data$Year &gt;= 1990] = '1990s' data$Decade[data$Year &gt;= 2000] = '2000s' data$Decade[data$Year &gt;= 2010] = '2010s' data$Decade = as.factor(data$Decade) top5 = data %&gt;% group_by(Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) %&gt;% arrange(-ThousandBarrel) %&gt;% top_n(5) %&gt;% select(Location) top5List = top5$Location data$State = "Other" for(i in 1:length(top5List)){ data$State[data$Location == top5List[i]] = top5List[i] } df = data %&gt;% group_by(Decade,State) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) %&gt;% mutate(ThousandBarrel = ThousandBarrel/sum(ThousandBarrel)) df$ThousandBarrel = round(df$ThousandBarrel,3) df$text = paste(round(100*df$ThousandBarrel,0),'%', sep='') p = ggplot(df,aes(x=Decade,y=ThousandBarrel,col=reorder(State,ThousandBarrel),fill=reorder(State,ThousandBarrel))) p + geom_bar(stat='identity') + geom_text(aes(label=text),col='Black',size = 4, hjust = 0.5, vjust = 3, position = "stack") + scale_y_continuous(labels = percent) + ggtitle('Percentage of Top Oil Producing States by Decade') + guides(fill=guide_legend(title='State'),col=guide_legend(title='State')) + theme(plot.title = element_text(hjust = 0.5))

    Some other fun concepts are below!

    Some of them are nice, others are terrible! I won’t comment on any of them, but I felt it was necessary to include some other ideas I toyed around with.

    Have fun with your data visualizations: be creative, think outside the box, use tools other than computers if it makes sense, fail often but learn quickly. I’m sure I’ll think of a thousand better ways to have illustrated the concepts in this post by tomorrow, so I’ll make updates as I think of them!

    Now it’s your turn!

    As always, the code used in this post is on my GitHub

    df = data %&gt;% group_by(Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) %&gt;% arrange(-ThousandBarrel) p = ggplot(df,aes(x=reorder(Location,ThousandBarrel),y=ThousandBarrel)) p + geom_bar(stat='identity') + ggtitle('Oil Production 1981 - 2016 By Location') + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

    top10 = data %&gt;% group_by(Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) %&gt;% arrange(-ThousandBarrel) %&gt;% top_n(10)

    ## Selecting by ThousandBarrel


    ## # A tibble: 10 × 2 ## Location ThousandBarrel ## &lt;chr&gt; &lt;dbl&gt; ## 1 Texas 23447172 ## 2 Alaska 15775279 ## 3 California 9988225 ## 4 Louisiana 4267246 ## 5 Oklahoma 3701224 ## 6 Wyoming 2894624 ## 7 Kansas 1708873 ## 8 Colorado 1288643 ## 9 Utah 894657 ## 10 Mississippi 861999

    df = data %&gt;% group_by(Location,Year) %&gt;% filter(Location %in% top10$Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) p = ggplot(df,aes(x=Year,y=ThousandBarrel,col=Location,fill=Location)) p + geom_bar(stat='identity') + ggtitle('Oil Production - Top 10 States') + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

    df = data %&gt;% filter(Year == 1990)%&gt;% group_by(Location) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) df$Location = tolower(df$Location) #Add States without data States = data.frame(Location = tolower(as.character( missingStates = States$Location[!(States$Location %in% df$Location)] appendData = data.frame(Location=missingStates,ThousandBarrel=0) df = rbind(df,appendData) states_map &lt;- map_data("state") ggplot(df, aes(map_id = Location)) + geom_map(aes(fill=ThousandBarrel), map = states_map) + expand_limits(x = states_map$long, y = states_map$lat)

    df = data %&gt;% filter(Location == 'Texas') %&gt;% group_by(Year,Month) %&gt;% summarise(ThousandBarrel = sum(ThousandBarrel)) p = ggplot(df,aes(x=Month,y=ThousandBarrel)) p + geom_line(stat='identity',aes(group=Year,col=Year)) + ggtitle('Oil Production By Year in the U.S.') + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

    To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    slickR – the last carousel you’ll ever need

    Fri, 04/07/2017 - 20:46

    (This article was first published on, and kindly contributed to R-bloggers)

    We are happy to bring the slick JavaScript library to R. It is self-described as “the last carousel you’ll ever need”. This carousel is based on putting the elements of the carousel in a div HTML tag. This makes the carousel very versatile in what can be placed inside. Regular objects that are placed in a carousel can be for example: images, plots, tables, gifs, videos, iframes and even htmlwidgets.


    This tool helps review multiple outputs in an efficient manner and saves much needed space in documents and Shiny applications, while creating a user friendly experience.

    These carousels can be used directly from the R console, from RStudio, in Shiny apps and R Markdown documents.




    Github (dev)

    devtools::install_github('metrumresearchgroup/slickR') Examples

    There are many options within the slick carousel, to get started with the basics we will show a few examples in the rest of the post. If you want to try out any of the examples you can go to this site where they are rendered and can be tested out.

    library(svglite) library(lattice) library(ggplot2) library(rvest) library(reshape2) library(dplyr) library(htmlwidgets) library(slickR) Images

    Some web scraping for the images example….

    #NBA Team Logos nbaTeams=c("ATL","BOS","BKN","CHA","CHI","CLE","DAL","DEN","DET","GSW", "HOU","IND","LAC","LAL","MEM","MIA","MIL","MIN","NOP","NYK", "OKC","ORL","PHI","PHX","POR","SAC","SAS","TOR","UTA","WAS") teamImg=sprintf("",nbaTeams) #Player Images a1=read_html('')%>%html_nodes(css = '#my-teams-table a') a2=a1%>%html_attr('href') a3=a1%>%html_text() team_table=read_html('')%>%html_table() team_table=team_table[[1]][-c(1,2),] playerTable=team_table%>%melt(,id='X1')%>%arrange(X1,variable) playerName=a2[grepl('[0-9]',a2)]'rbind',lapply(strsplit(playerName,'[/]'),function(x) x[c(8,9)])) playerId=playerId[playerId[,1]!='phi',] playerTable$img=sprintf('',playerId[,1]) Simple carousel

    Let’s start easy: show the team logos one at a time with dots underneath

    slickR(obj = teamImg,slideId = 'ex1',height = 100,width='100%') Grouped Images

    There are players on each team, so lets group the starting five together and have each dot correspond with a team:

    slickR( obj = playerTable$img, slideId = c('ex2'), slickOpts = list( initialSlide = 0, slidesToShow = 5, slidesToScroll = 5, focusOnSelect = T, dots = T ), height = 100,width='100%' ) Replacing the dots

    Sometimes the dots won’t be informative enough so we can switch them with an HTML object, such as text or other images.
    We can pass to the customPaging argument javascript code using the htmlwidgets::JS function.

    Replace dots with text cP1=JS("function(slick,index) {return '<a>'+(index+1)+'</a>';}") slickR( obj = playerTable$img, slideId = 'ex3', dotObj = teamImg, slickOpts = list( initialSlide = 0, slidesToShow = 5, slidesToScroll = 5, focusOnSelect = T, dots = T, customPaging = cP1 ), height=100,width='100%' ) Replace dots with Images cP2=JS("function(slick,index) {return '<a><img src= ' + dotObj[index] + ' width=100% height=100%></a>';}") #Replace dots with Images s1 <- slickR( obj = playerTable$img, dotObj = teamImg, slickOpts = list( initialSlide = 0, slidesToShow = 5, slidesToScroll = 5, focusOnSelect = T, dots = T, customPaging = cP2 ),height = 100,width='100%' ) #Putting it all together in one slickR call s2 <- htmltools::tags$script( sprintf("var dotObj = %s", jsonlite::toJSON(teamImg)) ) htmltools::browsable(htmltools::tagList(s2, s1)) Plots

    To place plots directly into slickR we use svglite to convert a plot into svg code using xmlSVG and then convert it into a character object

    plotsToSVG=list( #Standard Plot xmlSVG({plot(1:10)},standalone=TRUE), #lattice xyplot xmlSVG({print(xyplot(x~x,data.frame(x=1:10),type="l"))},standalone=TRUE), #ggplot xmlSVG({show(ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species))+ geom_point())},standalone=TRUE), #lattice dotplot xmlSVG({print(dotplot(variety ~ yield | site , data = barley, groups = year, key = simpleKey(levels(barley$year), space = "right"), xlab = "Barley Yield (bushels/acre) ", aspect=0.5, layout = c(1,6), ylab=NULL)) },standalone=TRUE) ) #make the plot self contained SVG to pass into slickR,function(sv){paste0("data:image/svg+xml;utf8,",as.character(sv))}) slickR(,slideId = 'ex4',slickOpts = list(dots=T), height = 200,width = '100%') Synching Carousels

    There are instances when you have many outputs at once and do not want to go through all, so you can combine two carousels one for viewing and one for searching.

    slickR(rep(,each=5),slideId = c('ex5up','ex5down'), slideIdx = list(1:20,1:20), synchSlides = c('ex5up','ex5down'), slideType = rep('img',4), slickOpts = list(list(slidesToShow=1,slidesToScroll=1), list(dots=F,slidesToScroll=1,slidesToShow=5, centerMode=T,focusOnSelect=T) ), height=100,width = '100%' ) Iframes

    Since the carousel can accept any html element we can place iframes in the carousel.
    There are times when you may want to put in different DOMs rather than an image in slick. Using slideType you can specify which DOM is used in the slides.
    For example let’s put the help Rd files from ggplot2 into a slider using the helper function getHelp. (Dont forget to open the output to a browser to view the iframe contents).

    geom_filenames=ls("package:ggplot2",pattern = '^geom') slickR(unlist(lapply(geom_filenames,getHelp,pkg = 'ggplot2')),slideId = 'ex6',slideType = 'iframe',height = '400px',width='100%',slickOpts = list(dots=T,slidesToShow=2,slidesToScroll=2)) htmlwidgets

    Finally, we can really leverage R and place self contained htmlwidgets in iframes (like leaflets and plotly) and use them in a carousel. This solves a problem of how to run many htmlwidgets at once outside of Shiny.

    library(leaflet) library(plotly) l <- leaflet() %>% addTiles() htmlwidgets::saveWidget(l,'leaflet.html') p <- iris%>%ggplot(aes(x=Sepal.Length,y=Sepal.Width))+geom_point() pL= ggplotly(p) htmlwidgets::saveWidget(pL,'ggplotly.html') slickR(c(rep(paste0(readLines('leaflet.html'),collapse='\n'),4), rep(paste0(readLines('ggplotly.html'),collapse='\n'),4)), slideId = c('leaf','plot'), slideIdx = list(1:4,5:8), slideType = rep('iframe',2), slickOpts = list(list(dots=T,slidesToShow=2,slidesToScroll=2), list(dots=T,slidesToShow=2,slidesToScroll=2)), height='200px',width='100%')

    Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings. To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website. Contact: For questions and comments, feel free to email me at: or open an issue for bug fixes or enhancements at github.

    To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Weekly Bulletin Vol – III

    Fri, 04/07/2017 - 20:25

    (This article was first published on R programming, and kindly contributed to R-bloggers)

    This week’s R bulletin will cover topics like how to read select columns, the difference between boolean operators, converting a factor to a numeric and changing memory available to R. We will also cover functions like data, format, tolower, toupper, and strsplit function. Hope you like this R weekly bulletin. Enjoy reading!

    Shortcut Keys

    1. To change the working directory – Ctrl+Shift+K
    2. To show history – Ctrl+4
    3. To display environment – Ctrl+8

    Problem Solving Ideas Read a limited number of columns from a dataset

    To read a specific set of columns from a dataset one can use the “fread” function from the data.table package. The syntax for the function to be used for reading/dropping a limited number of columns is given as:

    fread(input, select, stringsAsFactors=FALSE)
    fread(input, drop, stringsAsFactors=FALSE)

    One needs to specify the columns names or column numbers with the select parameter as shown below.

    library(data.table) data = fread("data.txt") print(data)




    # Reading select columns data = fread("data.txt", select = c("DATE", "OPEN", "CLOSE")) print(data)




    Alternatively, you can use the drop parameter to indicate which columns should not be read:

    data = fread("data.txt", drop = c(2:3, 6)) print(data)




    Difference between the boolean operators & and &&

    In R we have the “&” Boolean operator which is the equivalent to the AND operator in excel. Similarly, we have | in R which is the equivalent of OR in excel. One also finds Boolean operators && and || in R. Although, these operators have the same meaning as & and |, they behave in a slightly different manner.

    Difference between & and &&: The shorter form “&” is vectorized, meaning it can be applied on a vector. See the example below.





    The longer form &&, evaluates left to right examining only the first element of each vector. Thus in the example below, it will first evaluate whether -1 > = 0 (which is FALSE), and then check whether -1 <= 0 (which is TRUE). After this, it will evaluate FALSE & TRUE, which gives the final answer as FALSE.





    Thus the && operator is to be used when we have vectors of length one. For vectors of length greater than one, we should use the short form.

    How to convert a factor to a numeric without a loss of information

    Consider the following factor “x” created using the factor, sample, and the runif functions. Let us try to convert this factor to a numeric.


    x = factor(sample(runif(5), 6, replace = TRUE)) print(x)

    [1] 0.660900804912671 0.735364762600511 0.479244165122509 0.397552277892828
    [5] 0.660900804912671 0.660900804912671
    4 Levels: 0.397552277892828 0.479244165122509 … 0.735364762600511


    [1] 3 4 2 1 3 3

    As can be seen from the output, upon conversion we do not get the original values. R does not have a function to execute such conversion error-free. To overcome this problem, one can use the following expression:


    [1] 0.6609008 0.7353648 0.4792442 0.3975523 0.6609008 0.6609008

    One can also use the as.numeric(as.character(x)) expression, however the as.numeric(levels(x))[x] is slightly more efficient in converting the factor to an integer/numeric.

    Replace a part of URL using R

    Suppose you want to scrap data from a site having different pages, and want to replace the url using R. We can do this by using the parse_url function and build_url function from the “httr” package. The following example illustrates the method to replace a part in url.


    library(httr) # In the following url, the value 2 at the end signifies page 2 of the books # catergory under the section "bestsellers". testURL = "" # Entering the url and parsing it using the parse_ulr function parseURL = parse_url(testURL) print(parseURL)

    [1] “http”

    [1] “”


    [1] “gp/bestsellers/books/1318158031/ref=zg_bs_nav_b_1_b”



    [1] “2”



    [1] “url”

    # Assigning the next page number (i.e page no.3) and creating the new url parseURL$fragment = 3 newURL &lt;- build_url(parseURL) print(newURL)

    [1] “”

    Increasing (or decreasing) the memory available to R processes

    In R there is a memory.limit() function which gives you the amount of available memory in MB. At times, windows users may get the error that R has run out of memory. In such cases, you may set the amount of available memory.



    The unit is MB. You may increase this value up to 2GB or the maximum amount of physical RAM you have installed. If you have R already installed and subsequently install more RAM, you may have to reinstall R in order to take advantage of the additional capacity.

    On 32-bit Windows, R can only use up to 3Gb of RAM, regardless of how much you have installed. There is a 64-bit version of R for Windows available from REvolution Computing, which runs on 64-bit Windows, and can use the entire RAM available.

    Functions Demystified data function

    R program includes a number packages which often come with different data sets. These data sets can be accessed using the data function. This function loads specified data sets or lists the available data sets. To check the available data sets from all the R packages installed in your application use the data() expression. This will display all the available data sets under each package.

    Example: data()
    To view data sets from a particular package, specify the package name as the argument to the data function.


    data(package = "lubridate")

    To load a particular dataset, use the name of the dataset as the argument to the data function.


    # loading the USDCHF dataset from the timeSeries package library(timeSeries) data(USDCHF) x = USDCHF print(head(USDCHF))





    # loading the sample_matrix dataset from the xts package library(xts) data(sample_matrix) x = sample_matrix print(head(sample_matrix))








    format function

    The format function is used to format the numbers so that they can appear clean and neat on the reports. The function takes a number of arguments to control the formatting of the result. The main arguments for the format function are given below.

    format(x, trim, digits, decimal.mark, nsmall, big.mark, big.interval, small.mark,
    small.interval, justify)

    1) x: the number to be formatted. 2) trim: takes a logical value. If FALSE, it adds spaces to right-justify the result. If TRUE, it suppresses the leading spaces. 3) digits: how many significant digits of numeric values to display. 4) decimal mark: the format in which to display the decimal mark. 5) nsmall: the minimum number of digits after the decimal point. 6) big.mark: the mark between intervals before the decimal point. 7) small mark: the mark between intervals after the decimal point.


    format(34562.67435, digits=7, decimal.mark=".",big.mark=" ", small.mark=",",small.interval=2)

    [1] “34 562.67”

    In the example above we chose to display 7 digits, and are using point(.) as a decimal mark. The small interval value is 2, which means we want 2 digits to be displayed after the decimal point. The big.mark format displays a space after the first two digits.

    tolower, toupper, and strsplit

    The tolower and toupper functions help convert strings to lowercase and uppercase respectively.
    Example 1:

    toupper("I coded a strategy in R")


    tolower("I coded a strategy in R")

    [1] “i coded a strategy in r”

    Example 2:

    df ="NIFTY.csv")) print(head(df, 4))







    [1] “date” “time” “close” “high” “low” “open” “volume”

    The strsplit function is used to break/split strings at the specified split points. The function returns a list and not a character vector or a matrix. In the example below, we are splitting the expression on the spaces.


    strsplit("I coded a strategy in R", split = " ")

    [1] “I” “coded” “a” “strategy” “in” “R”

    Download the PDF Now!














    The post R Weekly Bulletin Vol – III appeared first on .

    To leave a comment for the author, please follow the link and comment on their blog: R programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The faces of R, analyzed with R

    Fri, 04/07/2017 - 20:12

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Maëlle Salmon recently created a collage of profile pictures of people who use the #rstats hashtag in their Twitter bio to indicate their use of R. (I've included a detail below; click to see the complete version at Maëlle's blog.)

    Naturally, Maëlle created the collage using R itself. Matching Twitter bios were found using the search_users function in the rtweet package, which also provides the URL of the profile image to be downloaded using the httr package. From there, Maëlle used the magick package to resize the pictures and assemble the collage.

    Now, you'll notice that while many people use their face as their Twitter profile picture, others use a logo or some other kind of design. Colin Fay used the Microsoft Computer Vision API to analyze the profile pictures and generate a descriptive caption for each. (Once again, the process was automated using R; you can find the R code at Colin's blog post.) Some of the generated captions are straightforward: "a woman posing for a picture". Some of the captions are, well, a bit off the mark: "a person on a surf board in a skate park". (The API apparently thinks the R logo looks like a surfboard; captions like this at least had a lower confidence score.) Nonetheless, the captions provide a tool for collecting together similar images; here, for example, are those given the caption "a person on a surf board in a skate park:

    If you'd like to play around with the computer vision captions yourself, you'll just need a free API key and the code from Colin's blog post, linked below.

    Colin FAY: Playing with #RStats and Microsoft Computer Vision API

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Bio7 2.5 for Windows and Linux Released

    Fri, 04/07/2017 - 17:30

    (This article was first published on R, and kindly contributed to R-bloggers)


    A new release of Bio7 is available for Windows (64-bit) and Linux (64-bit).
    The MacOSX version will be released soon, too. This release comes with a plethora of new functions for R.

    • Bio7 is now based on Eclipse 4.6.3
    • Redesigned all Bio7 icons and created new icons, too
    • Nearly all icons are now available for high dpi displays
    • Added a dark Theme for Bio7

    • Added the default Eclipse theme as optional to Bio7
    • Added a new action for Linux and MacOSX to download and install Rserve (see Video in R section)
    • Added JavaScript editor support (default opened in text editor but if Eclipse JavaScript editor installed it can be opened and executed in the advanced editor)
    • Added a new action for the execution of JavaScript (and ImageJ macro)
    • JavaScript can be executed in the current opened browser if enabled in the preferences
    • Added JavaScript support to the console

    • Custom views are now entitled with the id of the view
    • Added a new „Document“ perspective for LaTeX, Sweave, knitr and rmarkdown documents
    • Added a new PDF viewer in the JavaFX browser to display PDF’s from LaTeX, knitr, rmarkdown and R plots („Display“)

    • Added options and actions to fullscreen a PDF on different monitors (press the key ‘t’ and the PDF will be centered and the toolbar will be removed)
    • Added several PDF options (scroll to selected page, etc.)
    • Added a new JavaFX browser for *.html and *.pdf files based on PDF.js
    • PDF files can be opened within the Navigator view with a double-click of the mouse device (also popular image types with ImageJ).
    • Added drag and drop support for the browser for *.html and *.pdf files
    • All editor fonts can now be scaled with the key combination ‘Ctlr +’
    • Added new global preferences for the editor fonts
    • Added preferences links for an improved navigation
    • Updated the default scientific Java libraries
    • Implemented a reload of the JavaFX Browser to avoid a cached display
    • Improved the resizing of the Quadgrid (Hexgrid) view
    • Added a new preference to disable the scrollbars of the Quadgrid (Hexgrid) view
    • Updated R to 3.3.3 on Windows
    • Added an action to download and install ‘Rserve cooperative mode’  for Linux and MacOSX
    • Improved the speed of the „Load packages“ dialog
    • Improved the speed of the „Install packages“ dialog.
    • Improved the speed of the refresh action in the R-Shell view
    • Added browser preferences to select the browser and the display
    • Added a new toolbar action for Sweave
    • Added an extra view for plots
    • Added the package install dialog as an view
    • Added an option to open a HTML package info for a selected package in the package install view (see video below).

    • Added a preference for a custom Rserve client connection port if you start the Rserve server with a different port
    • Added code completion in the R-Shell view for s3 and s4 methods (see video below)

    • Added an option to automatically open code completion when typing in the R-Shell
    • R-Shell R help now recognizes the selected browser
    • Added several actions to create, built, and test R packages to the context menu of the Navigator view (using the ‘devtools’ package)
    • Added a Shiny test action in the context menu of the Navigator view which opens a browser with the running shiny application (select a project folder with the server and ui files – to stop the server use the interrupt action in the R-Shell console or press ‘Strg c’ in the Console).
    • Improved the transfer of variables and sheet names from the Table view and LibreOffice
    • Improved the display of the R preferences (opening the default preferences structure)
    R editor
    • Improved the code completion in general
    • Code completion dialogs are now synchronized with the other dialogs
    • Added code completion for
      • S3 objects (after $)
      • S4 object slots (after @)
      • the library function (library())
      • the data function (data())
    • Added new individual images to the code completion
    • Improved the close parentheses functions
    • Improved the visual appearence of the hoover dialog
    • Corrected the code completion images to match the Outline view images
    • Improved the display of markers
    • Added a popup display for the ‚str‘ function when hoovering over a workspace variable
    • Fonts can now be scaled with the key shortcuts ‘Strg +’
    • Added automatic indention after parentheses (loops, function, etc.)
    • Added preferences for the formatR package
    • Added an formatR action to format editor selections only
    • Added a refactor rename method for a selected scope
    • Added roxygen code completion in comments

    • Added an action to create roxygen templates from selected function, S3, S4, Reference and R6 classes (see video below)

    • In the popup menu added several actions of the devtools package to create, test and install R packages (select a project or regular folder)
    • Added plot preferences to open plot images individually and an option to execute an ImageJ macro after plot creation (e.g. to scale a plot after creation, etc.)
    • Added an ‘Open File’ and ‘Save File’ dialog to create file templates with selected files in the R editor and the R-Shell view (see video)
    • Added a new shortcut to create an assign operator (‘ALT   _’)
    • Added a new shortcut to create a pipe operator (‘CTRL SHIFT M’)
    • Added simple editor templates for S3, S4, Reference and R6 classes
    • Improved the preferences navigation
    • Reorganized some preferences
    • Updated the parser to ANTLR 4.6
    R Markdown
    • Improved the R markdown editor
    • Header and R chunks are now displayed in the Outline view
    • Added spell checking to the rmarkdown editor

    • Compiled documents are now displayed in a special custom view of Bio7 if enabled (by default enabled)
    • Added font and color preferences to the editor
    • Added new browser preferences (use of JavaFX webkit browser or system browser) and additional preferences for the control of the output
    • Added preferences to compile the document automatically after a specific time intervall (adjustable in the preferences – see video below)

    • Added code templates for popular markdown commands
    • Added an option to open word documents embedded in a custom view with Word (Windows only)

    • Added a key shortcut for the action to compile markdown documents
    • Improved the compile actions in the Bio7 toolbar (new icon) and removed the popup actions in the Navigator view
    • Added a key shortcut for the compilation
    • By default Latex documents are opened in the new PDF viewer (custom view)
    • Added new Bio7 LaTeX preferences to compile documents easily (without the need to create a LaTeX project with Texclipse)
    • Added new preferences to optional use XeLaTex and cleanup auxiliary files in the project folder
    • Added an option to compile BibTeX along with the *.tex file (several compilation iterations are performed)

    • Updated ImageJ to version1.51m
    • Improved the ImageJ menus to display all ImageJ menus correctly and nested (with key shortcuts and seperators)
    • ImageJ plugins now extend the respective SWT menu analogue to the ImageJ AWT menu.
    • Plugins, scripts and macros are now displayed in their defined menus or submenus.
    • Some popular image types can now be opened (double-click) directly from a Navigator folder (registered as a fake editor type)
    • Added two preferences to define the install location of plugins and macros (e.g external locations for teams).
    • Plugins now contribute to all ImageJ menus (SWT) like it is known from ImageJ
    • Dramatically improved the compatibility for ImageJ plugins (partially AWT)
    • Added a menu option to open images in a nested JavaFX panel (fullscreen on different monitors possible)
    • Addded JavaScript support for ImageJ (default editor is the text editor but a advanced Eclipse JavaScript editor can easily be installed)
    • Added a ImageJ macro wizard action to create ImageJ macros
    • Added an JavaScript wizard action to create a JavaScript file
    • JavaScript and ImageJ macros can now be executed with a toolbar (JavaScript) action.


    • Updated WorldWind to version 2.1
    • Made improvements for the dark theme
    Installation: Windows:

    Just download the *.zip distribution file from and unzip it in your preferred location. Bio7 comes bundled with the latest Java Runtime Environment, R and Rserve distribution and works out of the box.


    Download and extract the installation file from

    For Linux you have to install R and Rserve.

    To install Rserve open the R shell and then execute the menu action “Options->Install Rserve (coop. mode)”. This will download an install Rserve in your default R library location, see video below (please make sure that your default Linux R library install location has writing permissions!).

    The special version of Rserve can also be downloaded here:

    For a manual installation in the R prompt type the following command to install the compiled package (replace with your file path!):

    install.packages(“Users/yourName/Downloads/Rserve_1.8-4_Mac_cooperative.tgz”, repos=NULL)

    For more information please consult the Bio7 User Guide.


    To leave a comment for the author, please follow the link and comment on their blog: R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R⁶ — RStudio Server Client? Make An App For That!

    Fri, 04/07/2017 - 17:05

    (This article was first published on R –, and kindly contributed to R-bloggers)

    RStudio is a great way to work through analyses tasks, and I suspect most folks use the “desktop” version of the product on their local workstations.

    The fine folks at RStudio also make a server version (the codebase for RStudio is able to generate server or desktop and they are generally in 100% feature parity when it comes to interactive use). You only need to set it up on a compatible server and then access it via any modern web browser to get virtually the same experience you have on the desktop.

    I use RStudio Server as well as RStudio Desktop and have never been comfortable mixing web browsing tasks and analysis tasks in the same browser (it’s one of the many reasons I dislike jupyter notebooks). I also keep many apps open and inevitably would try to cmd-tab (I’m on macOS) between apps to find the RStudio server one only to realize I shld have been keyboard tabbing through Chrome tabs.

    Now, it’s not too hard to fire up a separate Chrome or Safari instance to get a separate server but it’d be great if there was a way to make it “feel” more like an app — just like RStudio Desktop. Well, it turns out there is a way using nativefier.

    If you use the Slack standalone desktop client, the Atom text editor or a few other “modern” apps, they are pretty much just web pages wrapped in a browser shell using something like Electron. Jia Hao came up with the idea of being able to do the same thing for any web page.

    To create a standalone RStudio Server client for a particular RStudio Server instance, just do the following after installing nativefier:

    nativefier "" --name "RStudio @ Example"

    Replace the URL with the one you currently use in-browser (and, please consider using SSL/TLS when connecting over the public internet) and use any name that will be identifiable to you. You get a safe, standalone application and will never have to worry about browser crashes impacting your workflow.

    There are many customizations you can make to the app shell and you can even use your own icons to represent servers differently. It’s all super-simple to setup and get working.

    Note that for macOS folks there has been a way pre-nativefier to do this same thing with a tool called Fluid but it uses the Apple WebKit/Safari shell vs the Chrome shell and I prefer the Chrome shell and cross-platform app-making ability.

    Hopefully this quick R⁶ tip will help you corral your RStudio Server connections. And, don’t forget to join in on the R⁶ bandwagon and share your own quick tips, snippets and hints to help the broader R community level-up their #rstats work.

    To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data Structures Exercises (Part-1)

    Fri, 04/07/2017 - 17:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    R Programming has various Data Structures for efficient manipulation of Data.
    Following are the list of data structures supported by R.

    1. Vectors
    2. Lists
    3. Matrix
    4. Data frame
    This exercise helps through various operations of R Data structures.

    Answers to the exercises are available here.

    Exercise 1

    Create an atomic vector of Character type, double type, logical type , integer type, complex type and raw type

    Exercise 2

    Check whether each of the created data structure is of vector type as well as check the class of each of the data.

    Exercise 3

    Create a List of heterogeneous data, which include numeric, character and logical vectors and prints the lists.

    Exercise 4

    Create a matrix of 3 rows and 4 columns which stores data from 1 to 12 and arrange the value row wise.

    Exercise 5

    Create a matrix of 3 rows and 4 columns which stores random data and arrange the matrix column wise

    Exercise 6

    Create two 2 x 2 matrices with random values and perform all arithmetic operations with those two and print the resultant matrices

    Exercise 7

    Create a random matrix having values between 1 and 1000 and print the matrix and transpose of the matrix

    Exercise 8

    Create Data Frames which contain details of 5 employees and display the same.

    Exercise 9

    Create Data Frames which contain details of 5 employees and display summary of the data

    Exercise 10

    Create a Data Frame of 5 employees and display the details of First and Last Employee and Names of All employees.

    Related exercise sets:
    1. Matrix exercises
    2. 3D plotting exercises
    3. Basic Operations Exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    packcircles version 0.2.0 released

    Fri, 04/07/2017 - 12:05

    (This article was first published on Last Resort Software, and kindly contributed to R-bloggers)

    Version 0.2.0 of the packcircles package has just been published on CRAN. This package provides functions to find non-overlapping arrangements of circles in bounded and unbounded areas.

    The package how has a new circleProgressiveLayout function. It uses an efficient deterministic algorithm to arrange circles by consecutively placing each one externally tangent to two previously placed circles while avoiding overlaps. It was adapted from a version written in C by Peter Menzel who lent his support to creating this R/Rcpp version.

    The underlying algorithm is described in the paper: Visualization of large hierarchical data by circle packing by Weixin Wang, Hui Wang, Guozhong Dai, and Hongan Wang. Published in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2006, pp. 517-520 (Available from ACM).

    Here is a small example of the new function, taken from the package vignette:

    t <- theme_bw() +
    theme(panel.grid = element_blank(),


    # circle areas
    areas <- 1:1000

    # arrange circles from small to large
    packing1 <- circleProgressiveLayout(areas)
    dat1 <- circleLayoutVertices(packing1)

    # arrange same circles from large to small
    packing2 <- circleProgressiveLayout( rev(areas) )
    dat2 <- circleLayoutVertices(packing2)

    dat <- rbind(
    cbind(dat1, set = 1),
    cbind(dat2, set = 2) )

    ggplot(data = dat, aes(x, y)) +
    geom_polygon(aes(group = id, fill = -id),
    colour = "black", show.legend = FALSE) +

    scale_fill_distiller(palette = "RdGy") +

    coord_equal() +

    labeller = as_labeller(
    c('1' = "small circles first",
    '2' = "big circles first"))

    The package vignette for the new function has more examples including how to use ggplot2 and ggiraph to create an SVG image of a circle layout with pre-defined colours and dynamic labelling.

    For those who prefer simpler things, the package can also be used quite successfully with base graphics.

    Note for existing users

    This new version has also seen a general tidy up of the existing functions in an attempt to make things consistent across the package. In particular:

    • circleLayout has been replaced by a new function circleRepelLayout. The new function accepts circles sizes as either areas or radii, with the default being area, unlike circleLayout which assumed sizes were radii. The type of size value can be specified using the sizetype argument.
    • circlePlotData has been replaced by a new function circleLayoutVertices. The new function has a sizetype argument to specify whether the input circle sizes are areas or radii. By default, radius is assumed as this matches the output of the layout functions (I’ll wait to hear if this causes any confusion).

    In both cases the original function has been retained for backwards compatibility but is deprecated and will be removed in a future release.

    To leave a comment for the author, please follow the link and comment on their blog: Last Resort Software. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    #3: Follow R-devel

    Fri, 04/07/2017 - 05:10

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Welcome to the third post in the rarely relevant R recommendation series, or R4 for short.

    Today will be brief, but of some importance. In order to know where R is going next, few places provide a better vantage point than the actual ongoing development.

    A few years ago, I mentioned to Duncan Murdoch how straightforward the setup of my CRANberries feed (and site) was. After all, static blog compilers converting textual input to html, rss feed and whatnot have been around for fifteen years (though they keep getting reinvented). He took this to heart and built the (not too pretty) R-devel daily site (which also uses a fancy diff tool as it shows changes in NEWS) as well as a more general description of all available sub-feeds. I follow this mostly through blog aggregations — Google Reader in its day, now Feedly. A screenshot is below just to show that it doesn’t have to be ugly just because it is on them intertubes:

    This shows a particularly useful day when R-devel folded into the new branch for what will be the R 3.4.0 release come April 21. The list of upcoming changes is truly impressive and quite comprehensive — and the package registration helper, focus of posts #1 and #2 here, is but one of these many changes.

    One function I learned about that day is tools::CRAN_package_db(), a helper to get a single (large) data.frame with all package DESCRIPTION information. Very handy. Others may have noticed that CRAN repos now have a new top-level file PACKAGES.rds and this function does indeed just fetch it–which you could do with a similar one-liner in R-release as well. Still very handy.

    But do read about changes in R-devel and hence upcoming changes in R 3.4.0. Lots of good things coming our way.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Microsoft R Open 3.3.3 now available

    Fri, 04/07/2017 - 00:38

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Microsoft R Open (MRO), Microsoft's enhanced distribution of open source R, has been upgraded to version 3.3.3, and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.3.3, upgrades the installer, and updates the bundled packages.

    R 3.3.3 makes just a few minor fixes compared to R 3.3.2 (see the full list of changes here), so you shouldn't encounter any compatibility issues when upgrading from MRO 3.3.2. For CRAN packages, MRO 3.3.3 points to CRAN snapshot taken on March 15, 2017 but as always, you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages). 

    MRO is supported on Windows, Mac and Linux. MRO 3.3.3 is 100% compatible with R 3.3.3, and you can use it with any of the 10,000+ packages available on CRAN. Here are some highlights of new packages released since the last MRO update.

    We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. To download Microsoft R Open, simply follow the link below.

    MRAN: Download Microsoft R Open


    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Weighted Linear Support Vector Machine

    Fri, 04/07/2017 - 00:25

    (This article was first published on DataScience+, and kindly contributed to R-bloggers)

    Consider the spam vs ham data from this site. Let us do some basic analysis on the data with R version 3.3.3, 64 bits on qn windows machine

    setwd("your directory") sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE)

    Next, check the proportion of spam and ham in your data set

    prop.table(table(sms_data$type)) ham spam 0.8659849 0.1340151

    As you can see, the proportion of ham in the data set is almost 6.6 times higher than spam. Let us select some amount of data from this entire data set to train a model for our classifier.

    simply_text<-Corpus(VectorSource(sms_data$text)) cleaned_corpus<-tm_map(simply_text, tolower) cleaned_corpus<-tm_map(cleaned_corpus,removeNumbers) cleaned_corpus<-tm_map(cleaned_corpus,removeWords,stopwords()) sms_dtm<-DocumentTermMatrix(cleaned_corpus) sms_train<-cleaned_corpus[1:3000] sms_test<-cleaned_corpus[3001:5574] freq_term=(findFreqTerms(sms_dtm,lowfreq=10, highfreq = Inf)) sms_freq_train<-DocumentTermMatrix(sms_train, list(dictionary=freq_term)) sms_freq_test<-DocumentTermMatrix(sms_test,list(dictionary=freq_term)) print(NCOL(sms_freq_train)) print(NCOL(sms_freq_test)) y_train<-factor(sms_data$type[1:3000]) y_test<-factor(sms_data$type[3001:5574]) prop.table(table(y_train)) print(NCOL(sms_freq_train)) print(NCOL(sms_freq_test)) [1] 856 [1] 856 ham spam 0.8636667 0.1363333

    Notice that the proportion of spam and ham in the training data set is similar to that of the entire data. One of the widely used classifiers is Linear Support Vector Machine. From my last writing on Linear Support Vector Machine, you can find that in case of Linear SVM we solve the following optimization problem.

    $$Minimize_{w,b}\frac{\vert\vert w\vert\vert^{2}}{2}$$
    subject to the constraint
    $$y_{i}(w^{T}x_{i}+b)\geq \frac{1}{\vert \vert w\vert\vert}~\forall ~x_{i}~in ~Training~Data~set$$.

    Here \(w\) is a weight vector while \(b\) is a scalar also known as bias. For linearly inseparable case, we introduce some penalty \(0\leq \xi_{i} \leq 1\) in the objective function and our constraint.
    $$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C\sum_{i=1}^{n}\xi_{i})$$
    subject to the constraints
    $$(1)~y_{i}(w^{T}x_{i}+b)\geq 1-\xi_{i}~\forall ~x_{i}~in ~Training~Data~set~of~size~n$$ and
    $$(2)~\xi_{i}\geq 0~\forall ~n~points~in~the~Training~Data~set$$

    \(C\) is a user defined parameter which is known as regularization parameter. The regularization parameter is a special parameter. It tells the optimizer what it should minimize more between \(\frac{\vert\vert w\vert\vert^{2}}{2}\) and \(\sum_{i=1}^{n}\xi_{i}\). For example if \(C=0\), then only \(\frac{\vert\vert w\vert\vert^{2}}{2}\) will be minimized whereas if \(C\) is a large value, then \(\sum_{i=1}^{n}\xi_{i}\) will be minimized.

    In this example case where the class proportions are so skewed, the choice of the regularization parameter will be a concern. Notice that \(C\) is the scalar-weight of the penalty of mis-classification. Intuitively you can think that, since the proportion of ham is 6.6 times higher than spam samples in the training data, the linear SVM may get biased towards classifying hams more accurately as mis-classifying lots of ham will lead to more aggregated penalty.

    Thus, instead of using Linear SVM directly on such data set, it is better to use weighted Linear SVM where instead of using one regularization parameter, we use two separate regularization parameters, \(C_{1}, C_{2}\) where \(C_{1}\) (respectively \(C_{2}\)) is the weight on penalty of mis-classifying a ham sample (respectively a spam sample). So this time our objective function becomes,

    $$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C_{1}\sum_{i=1}^{n_{1}}\xi_{i} +C_{2}\sum_{j=1}^{n_{2}}\xi_{j})$$
    where \(n_{1}\) (respective \(n_{2}\)) are the number of hams (respectively spams) in our training data.

    How to Fix the Values for \(C_{1},C_{2}\)

    One simple way is to set
    $$C_{1}=\frac{1}{n_{1}}$$ and

    The R code for doing this is as follows:

    y<-((sms_data$type)) wts<-1/table(y) print(wts) ham spam 0.000207168 0.001338688

    Now train your weighted linear SVM. An easy way of doing so is as follows:

    library(e1071) sms_freq_matrx<-as.matrix(sms_freq_train) sms_freq_dtm< sms_freq_matrx_test<-as.matrix(sms_freq_test) sms_freq_dtm_test< trained_model<-svm(sms_freq_dtm, y_train, type="C-classification", kernel="linear", class.weights = wts)

    Let us use the model for some predictions

    y_predict<-predict(trained_model, sms_freq_dtm_test) library(gmodels) CrossTable(y_predict,y_test,prop.chisq = FALSE) | y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 2002 | 233 | 2235 | | 0.896 | 0.104 | 0.868 | | 0.895 | 0.689 | | | 0.778 | 0.091 | | -------------|-----------|-----------|-----------| spam | 234 | 105 | 339 | | 0.690 | 0.310 | 0.132 | | 0.105 | 0.311 | | | 0.091 | 0.041 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

    Of the 2236 hams, 2002 got correctly classified as ham while 234 has got mis-classified as spam. However, the case is not so shiny with spams. Of the 338 spams, 105 got correctly classified as spam while 233 has got mis-classified as ham.

    Now let us see what happens without using the weights.

    trained_model2<-svm(sms_freq_dtm, y_train, type="C-classification", kernel="linear") y_predict<-predict(trained_model2, sms_freq_dtm_test) library(gmodels) CrossTable(y_predict,y_test,prop.chisq = FALSE) | y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1922 | 224 | 2146 | | 0.896 | 0.104 | 0.834 | | 0.860 | 0.663 | | | 0.747 | 0.087 | | -------------|-----------|-----------|-----------| spam | 314 | 114 | 428 | | 0.734 | 0.266 | 0.166 | | 0.140 | 0.337 | | | 0.122 | 0.044 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

    Well, the number of spams correctly classified increased but the number of ham correctly classified decreased.
    Let us repeat our experiment by increasing the values of \(C_{1}\) and \(C_{2}\).

    wts<-100/table(y) print(wts) ham spam 0.0207168 0.1338688

    The results presented below show that the mis-classification for spam has reduced and the accuracy for spam classification has increased. However, the accuracy for ham has decreased.

    | y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1908 | 219 | 2127 | | 0.897 | 0.103 | 0.826 | | 0.853 | 0.648 | | | 0.741 | 0.085 | | -------------|-----------|-----------|-----------| spam | 328 | 119 | 447 | | 0.734 | 0.266 | 0.174 | | 0.147 | 0.352 | | | 0.127 | 0.046 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

    We continued our experiments further by setting

    wts<-10/table(y) print(wts) ham spam 0.00207168 0.01338688

    The results are

    | y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1931 | 230 | 2161 | | 0.894 | 0.106 | 0.840 | | 0.864 | 0.680 | | | 0.750 | 0.089 | | -------------|-----------|-----------|-----------| spam | 305 | 108 | 413 | | 0.738 | 0.262 | 0.160 | | 0.136 | 0.320 | | | 0.118 | 0.042 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

    One can repeat the experiment with different values of wts and see the results.

      Related Post

      1. Logistic Regression Regularized with Optimization
      2. Analytical and Numerical Solutions to Linear Regression Problems
      3. How to create a loop to run multiple regression models
      4. Regression model with auto correlated errors – Part 3, some astrology
      5. Regression model with auto correlated errors – Part 2, the models

      To leave a comment for the author, please follow the link and comment on their blog: DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...