Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 41 min 14 sec ago

Singular Value Decomposition (SVD): Tutorial Using Examples in R

Wed, 08/02/2017 - 05:38

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

If you have ever looked with any depth at statistical computing for multivariate analysis, there is a good chance you have come across the singular value decomposition (SVD). It is a workhorse for techniques that decompose data, such as correspondence analysis and principal components analysis. In this post I explain, at an intuitive level, how it works. I demonstrate this using examples in R. If you have not come across the SVD before, skip this post! It is only for that rare connoisseur, who has heard of it, wants to understand it a bit better, but is averse to lots of maths.

A singular value decomposition case study in R

The table below shows the standardized residuals from a contingency table showing the relationship between education and readership of a newspaper. The R code used to generate the table is below. More about this data and R code, and why it is interesting, will be available in my forthcoming post about the maths of correspondence analysis. = matrix(c(5, 18, 19, 12, 3, 7, 46, 29, 40, 7, 2, 20, 39, 49, 16), nrow = 5, dimnames = list( "Level of education" = c("Some primary", "Primary completed", "Some secondary", "Secondary completed", "Some tertiary"), "Category of readership" = c("Glance", "Fairly thorough", "Very thorough"))) O = / sum( E = rowSums(O) %o% colSums(O) Z = (O - E) / sqrt(E) How to compute the SVD

The table above is a matrix of numbers. I am going to call it Z. The singular value decomposition is computed using the svd function. The following code computes the singular value decomposition of the matrix Z, and assigns it to a new object called SVD, which contains one vector, d, and two matrices, u and v. The vector, d, contains the singular values. The first matrix, u, contains the left singular vectors, and v contains the right singular vectors. The left singular vectors represent the rows of the input table, and the right singular vectors represent their columns.

SVD = svd(Z)

Recovering the data

The singular value decomposition (SVD) has four useful properties. The first is that these two matrices and vector can be “multiplied” together to re-create the original input data, Z.  In the data we started with (Z), we have a value of -0.064751 in the 5th row, 2nd column. We can work this out from the results of the SVD by multiplying each element of d with the elements of the 5th row of u and the 2nd row v.

That is: -0.064751  = 0.2652708*0.468524*(-0.4887795) + 0.1135421*(-0.0597979)*0.5896041 + 0*(-0.6474922)*(-0.6430097)

This can be achieved in R using the code:

sum(SVD$d * SVD$u[5, ] * SVD$v[2, ])

Better yet, if we want to recompute the whole table of numbers at once, we can use a bit of matrix algebra:

SVD$u %*% diag(SVD$d) %*% t(SVD$v)

Now, at first glance this property may not seem so useful. Indeed, it does not even seem very clever. We started with a table of 15 numbers. Now, we have one vector and two tables, containing a total of 27 numbers. We seem to be going backwards!

Reducing the data

The second useful property of the SVD relates to the values in d. They are sorted in descending order (ties are possible). Why is this important? Take a look at the last value in d. It is 2.71825390754254E-17. In reality, this is 0 (computers struggle to compute 0 exactly). When recovering the data, we can ignore the last value of d, and also the last column of each of u and v. Their values are multiplied by 0 and thus are irrelevant. Now, we only have 18 numbers to look at. This is still more than the 15 we started with.

The values of d tell us the relative importance of each of the columns in u and v in describing the original data. We can compute the variance in the original data (Z) that is explained by the columns by first squaring the values in d, and then expressing these as proportions. If you run the following  Rcode, it shows that the first dimension explains 85% of variance in the data.

variance.explained = prop.table(svd(Z)$d^2)

So, if we are happy to ignore 15% of the information in the original data, we only need to look at the first column in u and the first column in v. Now we have to look at less than half the numbers that we started with.

Halving the number of numbers to consider may not seem like a sufficient benefit. However, the bigger the data set, the bigger the saving. For example, if we had a table with 20 rows and 20 columns, we may only need to look at the first couple of columns, only needing to consider 10% of the number of values that we started with. This is the basic logic of techniques like principle components analysis and correspondence analysis. In addition to reducing the number of values we need to look at, this also allows us to chart the values, which saves more time. There is rarely a good way to chart 20 columns of data, but charting 2 columns is usually straightforward.

Two more properties

The third property of the SVD is that the rows of u represents the row categories of the original table, and the rows of v represent the column categories. The fourth property is that the columns of u are orthogonal to each other, and the columns of v are orthogonal to each other. With these two properties combined, we end up with considerable simplicity in future analyses. For example, this allows us to compute uncorrelated principal components in principal components analysis and to produce plots of correspondence analysis. I will walk through this in detail in my forthcoming post on the math of correspondence analysis.

All the R code in this post has been run using Displayr. Anyone can explore SVD and the R code used in this post by logging into Displayr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Five kinds of weather you’ll meet in America

Wed, 08/02/2017 - 05:06

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)



The USA is a large country. How different are people’s experiences of the weather depending on where they live?

To look into this question, we downloaded high temperature data for over 1,300 airport weather stations in the contiguous USA for every day for five years (2012-2016 inclusive).

We then used k-means clustering, a workhorse of machine learning, to cluster weather stations according to how similar their high temperatures are.

After some exploring, we settled on five clusters because it captures the gist of what is going on.

The result is shown above, where the letters A through E denote the different clusters (which were ordered by their average temperature in the last week of the year). We see broad East-West stripes, with a few patches of cooler temperatures in the Rocky Mountains, and something unusual going on in coastal California and Oregon.

How different are the clusters? To look at this, we plot the average high temperature in each cluster for each week of the year.


This was eye opening, and gave us two basic generalizations about the weather in the USA

1. As you more North and South, the temperature patterns are similar, just vertically shifted.
2. The Pacific coast is different

On the Pacific coast, temperatures are pretty steady over the year. California and Florida both have nice warm winters, but when you look at the summers, you can see why they put the movie studios in Hollywood. Low variance makes it easy to plan.

Speaking of variance, look how cluster A (Minnesota and Maine) is actually hotter than cluster D (Pacific Coast) around the middle of the year.

Another cool factoid is that the American experience is pretty similar in summer (less than 20 degrees between cluster A and E) and highly varied in winter (about 45 degrees between cluster D and cluster A).

R, ggplot2, tidyverse, etc. code for those who wish to reproduce the analysis.

We scraped the temperatures ourselves, but we’ll save you the trouble and let you download the temperature data here. Just create a subdirectory called “data” and expand there. Leave 5 the yearly files in gzip (.gz) because R reads and writes .gz files seamlessly.

The post Five kinds of weather you’ll meet in America appeared first on Decision Science News.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Postcard from JSM

Wed, 08/02/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Baltimore has the reputation of being a tough town: hot in the summer and gritty, but the convention center hosting the Joint Statistical Meetings is a pretty cool place to be. There are thousands of people here and so many sessions (over 600) that it’s just impossible to get an overview of all that’s going on. So, here are couple of snapshots from an R-focused, statistical tourist.

First Snapshot: What’s in a Vector?

Yesterday, Gabe Becker showed some eye opening work he is doing with Luke Tierney and Thomas Kalibera on features that are making their way into the R core. Gabe started his presentation by running the following code:

x = 1 : 1e15 head(x) tail(x) x < 4 sort(x, decreasing = TRUE)

Don’t try this at home. It will almost certainly be unpleasant if you do, but it all ran instantaneously on Gabe’s laptop. This is a simple example of the new “ALTREP Framework” that promises considerable performance improvements to basic statistical calculations by exploiting two insights:

  1. Sometimes things are easy if you have some previous knowledge about your data
  2. Maybe it’s not always necessary to work with contiguous blocks of memory

For example, Gabe’s code ran so quickly because his new experimental version of R “knew” that x was sorted and that there were no NAs. As Gabe put it: “Knowing is a lot more than half the battle”. Another thing that R will eventually “know” is not to convert strings to characters until absolutely necessary. Preliminary work suggests that this feature will provide significant speedup to ordinary things like working with the model design matrix.

A really, exciting aspect of the new work is that the R will export the ALTREP Framework making it possible for package developers to take advantage of knowledge about their particular data structures. If things go well, the new framework should be part of a 2018 release.

Second Snapshot: Statistical Learning with Big Data

In his talk on Monday, Professor Rob Tibshirani covered considerable ground that included definitions of Machine Learning and Statistical Learning, “A Consumer Report” for five popular machine learning algorithms and three case studies, one of which involved the use of a relatively new algorithm: Generalized Random Forests. The definitions:

  • Machine learning constructs algorithms that can learn from data
  • Statistical learning is a branch of Statistics that was born in response to Machine Learning, emphasizing statistical models and assessment of uncertainity

I think, capture the difference between that computer science and statistics mindsets.

The Consumer Report, captured in the following table, summarizes Professor Tibshirani’s current thinking on the strengths and weaknesses of the machine learning algorithms.

Although the table does offer definite advice on model selection, Professor Tibshirani also advised that practitioners experiment with several methods when dealing with difficult problems.

That’s it for now: wish you were here. I will write again soon.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Transfer Learning with Keras in R

Wed, 08/02/2017 - 02:00

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

In my last posts ([here]( and here, I described how one can detect logos in images with R. The first results were promising and achieved a classification accuracy of ~50%. In this post i will detail how to do transfer learning (using a pre-trained network) to further improve the classification accuracy.

The dataset is a combination of the Flickr27-dataset, with 270 images of 27 classes and self-scraped images from google image search. In case you want to reproduce the analysis, you can download the set here.

In addition to the previous post, this time I wanted to use pre-trained image models, to see how they perform on the task of identifing brand logos in images.
First, I tried to adapt the official example on the Keras-rstudio website.

Here is a copy of the instructions:

# create the base pre-trained model base_model <- application_inception_v3(weights = 'imagenet', include_top = FALSE) # add our custom layers predictions <- base_model$output %>% layer_global_average_pooling_2d() %>% layer_dense(units = 1024, activation = 'relu') %>% layer_dense(units = 200, activation = 'softmax') # this is the model we will train model <- keras_model(inputs = base_model$input, outputs = predictions) # first: train only the top layers (which were randomly initialized) # i.e. freeze all convolutional InceptionV3 layers for (layer in base_model$layers) layer$trainable <- FALSE # compile the model (should be done *after* setting layers to non-trainable) model %>% compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy') # train the model on the new data for a few epochs model %>% fit_generator(...) # at this point, the top layers are well trained and we can start fine-tuning # convolutional layers from inception V3. We will freeze the bottom N layers # and train the remaining top layers. # let's visualize layer names and layer indices to see how many layers # we should freeze: layers <- base_model$layers for (i in 1:length(layers)) cat(i, layers[[i]]$name, "\n") # we chose to train the top 2 inception blocks, i.e. we will freeze # the first 172 layers and unfreeze the rest: for (i in 1:172) layers[[i]]$trainable <- FALSE for (i in 173:length(layers)) layers[[i]]$trainable <- TRUE # we need to recompile the model for these modifications to take effect # we use SGD with a low learning rate model %>% compile( optimizer = optimizer_sgd(lr = 0.0001, momentum = 0.9), loss = 'categorical_crossentropy' ) # we train our model again (this time fine-tuning the top 2 inception blocks # alongside the top Dense layers model %>% fit_generator(...)

It is basically a three step process; 1) load an existing model and add some layers, 2) train the extended model on your own data, 3) set more layers trainable and fine-tune the model on your own data.

I have to admit that I struggled to make it work on my data. That’s why I will try to detail my approach. Hopefully it helps you to get better started with transfer learning.

Here is the full example that I got to work:

require(keras) ### Keras transfer learning example ################### Section 1 ######################### img_width <- 32 img_height <- 32 batch_size <- 8 train_directory <- "flickrData/train" test_directory <- "flickrData/test" train_generator <- flow_images_from_directory(train_directory, generator = image_data_generator(), target_size = c(img_width, img_height), color_mode = "rgb", class_mode = "categorical", batch_size = batch_size, shuffle = TRUE, seed = 123) validation_generator <- flow_images_from_directory(test_directory, generator = image_data_generator(), target_size = c(img_width, img_height), color_mode = "rgb", classes = NULL, class_mode = "categorical", batch_size = batch_size, shuffle = TRUE, seed = 123) train_samples = 498 validation_samples = 177 ################### Section 2 ######################### #base_model <- application_inception_v3(weights = 'imagenet', include_top = FALSE) base_model <- application_vgg16(weights = 'imagenet', include_top = FALSE) ### use vgg16 - as inception won't converge --- ################### Section 3 ######################### ## add your custom layers predictions <- base_model$output %>% layer_global_average_pooling_2d(trainable = T) %>% layer_dense(64, trainable = T) %>% layer_activation("relu", trainable = T) %>% layer_dropout(0.4, trainable = T) %>% layer_dense(27, trainable=T) %>% ## important to adapt to fit the 27 classes in the dataset! layer_activation("softmax", trainable=T) # this is the model we will train model <- keras_model(inputs = base_model$input, outputs = predictions) ################### Section 4 ######################### for (layer in base_model$layers) layer$trainable <- FALSE ################### Section 5 ######################### model %>% compile( loss = "categorical_crossentropy", optimizer = optimizer_rmsprop(lr = 0.003, decay = 1e-6), ## play with the learning rate metrics = "accuracy" ) hist <- model %>% fit_generator( train_generator, steps_per_epoch = as.integer(train_samples/batch_size), epochs = 20, validation_data = validation_generator, validation_steps = as.integer(validation_samples/batch_size), verbose=2 ) ### saveable data frame obejct. histDF <- data.frame(acc = unlist(hist$history$acc), val_acc=unlist(hist$history$val_acc), val_loss = unlist(hist$history$val_loss),loss = unlist(hist$history$loss))

The key change to the Rstudio sample code is to use a different pre-trained model. I use VGG16 over Inception_v3 as the later would work. (The Inception-model would not pick up any information and accuracy remains around the base rate.)
But let’s see what the code does step by step.
In section 1 the image data is prepared and loaded. One that setting that is important to set while loading the images is “class_mode = “categorical””, as we have 27 different image classes/labels to assign. In section 2 we load the pretrained image model.
In section 3 we add custom layers. (In this case I copied the last layers from the previous post.) It is important to set the last layers to the number of labels (27) and the activation function to softmax.
In section 4 we set the layers of the loaded image model to non-trainable.
Section 5 sets the optimiser and finally trains the model. If the training process does not show improvements in terms of decreasing loss, try to increase the learning rate.
The learning process is documented in the hist-object, which can be easily plotted.

After 50 Traing-epochs the accuracy is at 55% on the training 35% on the validation set.
I assume that the accuracy can be further improved by training the full model or at least set more layers trainable and fine tune the full model as it is detailed in the R-Studio case. However, so far I tried various parameter settings without success. There is basically no improvement on the established 50%, yet.

I will update the post, as soon as I have figured it out.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A modern database interface for R

Tue, 08/01/2017 - 23:41

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At the useR! conference last month, Jim Hester gave a talk about two packages that provide a modern database interface for R. Those packages are the odbc package (developed by Jim and other members of the RStudio team), and the DBI package (developed by Kirill Müller with support from the R Consortium).

To communicate with databases, a common protocol is ODBC. ODBC is an open, cross-platform standard for interfacing with databases, whatever the database software and variant of the SQL language it implements. R has long had the RODBC package created by Brian Ripley, but the new odbc package provides an updated alternative.

The odbc package is a from-the-ground-up implementation of an ODBC interface for R that provides native support for additional data types (including dates, timestamps, raw binary, and 64-bit integers) and parameterized queries. The odbc package provides connections with any ODBC-compliant database, and has been comprehensively tested on SQL Server, PostgreSQL and MySQL. Benchmarks show that it's also somewhat faster than RODBC: 3.2 times faster for reads, and 1.9 times faster for writes.

With the odbc package (and the DBI package, which provides low-level connectivity to the database), you create a database connection like this:

con <- dbConnect( ... ) # provide server, username, password

With the connection established, you can the use functions like dbListTables (get a list of tables in the database), dbReadTable (return the data from a table), and dbGetQuery (execute a SQL query in the database, and return the results). That's pretty standard stuff. But the real power comes in being able to use high-level functions from the dplyr package and have the data processing run in the database, instead of in the local R session.

When the dbplyr package is installed, you can create a "tibble" from a database connection, like this:

salesdata <- tbl(con, "ussales")

It doesn't matter how big the ussales table is, because no data is brought into R until you perform an operation on the salesdata object. For example, you can use a dplyr pipe to get the first 20 records from the state of California, like this:

tabledata %>% filter(state == "CA") %>% head(20)

Behind the scenes, this creates a SQL query (which you can inspect with the show_query function, if you like) to move the computation from within R to the database. This process works for all the dplyr verbs: filter becomes WHERE, left_join becomes LEFT JOIN, group_by becomes BY, etc. In every case (including the example above) all the filtering happens in the database, and only 20 rows of data are actually returned to R. 

You can see all of this in action in Jim's talk embedded below. (Jim's slides are also available here.) In the demo starting at the 8:20 mark, Jim connects to SQL Server (here running in a VM on his laptop, in non-demo situations more likely to be on a remote machine). You'll then see him query tables on the database, and perform dplyr operations on the data in the process. You'll even get a lesson on the importance of sanitizing user inputs to prevent SQL injection, and a solution based on parameterized queries. 

The odbc package is built on top of the DBI package, which provides a low-level interface to databases and some of the functions used in Jim's demo above. The DBI package has been around for a while, but has recently undergone a major upgrade thanks to Kirill Müller and funding from the R Consortium. If you're interested in the implementation of the DBI package and how it's evolving, check out Kirill's talk from useR! 2017: Improving DBI.

The odbc and DBI packages are available on CRAN now. To use them with your database, you'll also need an ODBC driver installed on the local machine. Most likely you already have one (Windows includes an ODBC driver compatible with SQL Server, for example), but if not RStudio provides a useful guide to finding ODBC drivers (which also come bundled with commercial RStudio products). 

Channel9: odbc – A modern database interface

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

HebRew (using Hebrew in R)

Tue, 08/01/2017 - 22:54

(This article was first published on, and kindly contributed to R-bloggers)

Adi Sarid (Tel Aviv university and Sarid Research Institute LTD.) July-2017 Background

A while back I participated in an R workshop, in the annual convention of the Israeli Association for Statistics. I had the pleasure of talking with Tal Galili and Jonathan Rosenblatt which indicated that a lot of Israeli R users run into difficulties with Hebrew with R.
My firm opinion is that its best to keep everything in English, but sometimes you simply don’t have a choice. For example, I had to prepare a number of R shiny dashboards to Hebrew speaking clients, hence Hebrew was the way to go, in a kind of English-Hebrew “Mishmash” (mix).
I happened to run into a lot of such difficulties in the past, so here are a few pointers to get you started when working in R with Hebrew.
This post deals with Reading and writing files which contain Hebrew characters.
Note, there is also a bit to talk about in the context of Shiny apps which contain Hebrew and using right-to-left in shiny apps, and using Hebrew variable names. Both work with some care, but I won’t cover them here.
If you have any other questions you’d like to see answered, feel free to contact me Reading and writing files with Hebrew characters

R can read and write files in many formats. The common formats for small to medium data sets include the comma separated values (*.csv), and excel files (*.xlsx, *.xls). Each such read/write action is facilitated using some kind of “encoding”. Encoding, in simple terms, is a definition of a character set which help you operating system to interpret and represent the character as it should (לדוגמה, תווים בעברית). There are a number of relevant character sets (encodings) when Hebrew is concerned:

  • UTF-8
  • ISO 8859-8
  • Windows-1255

When you try to read files in which there are Hebrew characters, I usually recommend trying to read them in that order- UTF-8 is commonly used by a lot of applications, since it covers a lot of languages.

Using csv files with Hebrew characters

Here’s an example for something that can go wrong, and a possible solution. In this case I’ve prepared a csv file which encoded with UTF-8. When using R’s standard read.csv function, this is what happens: <- read.csv("") ## ן...... X...... X............ ## 1 ׳¨׳•׳ ׳™ 25 ׳—׳™׳₪׳” ## 2 ׳ž׳•׳˜׳™ 77 ׳”׳¨׳¦׳œ׳™׳” ## 3 ׳“׳ ׳™ 13 ׳×׳œ-׳׳‘׳™׳‘ ׳™׳₪׳• ## 4 ׳¨׳¢׳•׳× 30 ׳§׳¨׳™׳× ׳©׳ž׳•׳ ׳” ## 5 ׳“׳ ׳” 44 ׳‘׳™׳× ׳©׳׳Ÿ

Oh boy, that’s probably not what the file’s author had in mind. Let’s try to instruct read.csv to use a different encoding. <- read.csv("", encoding = "UTF-8") ## X.U.FEFF.שם גיל מגורים ## 1 רוני 25 חיפה ## 2 מוטי 77 הרצליה ## 3 דני 13 תל-אביב יפו ## 4 רעות 30 קרית שמונה ## 5 דנה 44 בית שאן

A bit better isn’t it? However, not perfect. We can read the Hebrew, but there is a weird thing in the header “X.U.FEFF”.
A better way to read and write files (much more than just encoding aspects – it’s quicker reading large files) is using the readr package which is part of the tidyverse. On a side note, if you haven’t already, install.packages(tidyverse), it’s a must. It includes readr but a lot more goodies (read on).
Now, for some tools you get with readr:

library(readr) locale("he") ## ## Numbers: 123,456.78 ## Formats: %AD / %AT ## Timezone: UTC ## Encoding: UTF-8 ## ## Days: יום ראשון (יום א׳), יום שני (יום ב׳), יום שלישי (יום ג׳), יום ## רביעי (יום ד׳), יום חמישי (יום ה׳), יום שישי (יום ו׳), יום ## שבת (שבת) ## Months: ינואר (ינו׳), פברואר (פבר׳), מרץ (מרץ), אפריל (אפר׳), מאי (מאי), ## יוני (יוני), יולי (יולי), אוגוסט (אוג׳), ספטמבר (ספט׳), ## אוקטובר (אוק׳), נובמבר (נוב׳), דצמבר (דצמ׳) ## AM/PM: לפנה״צ/אחה״צ guess_encoding("") ## # A tibble: 2 × 2 ## encoding confidence ## ## 1 UTF-8 1.00 ## 2 KOI8-R 0.98

First we used locale() which knows the date format and default encoding for the language (UTF-8 in this case). On it’s own locale() does nothing than output the specs of the locale, but when used in conjuction with read_csv it tells read_csv everything it needs to know. Also note the use of guess_encoding which reads the first “few” lines of a file (10,000 is the default) which helps us, well… guess the encoding of a file. You can see that readr is pretty confident we need the UTF-8 here (and 98% confident we need a Korean encoding, but first option wins here…) <- read_csv(file = "", locale = locale(date_names = "he", encoding = "UTF-8")) ## Parsed with column specification: ## cols( ## שם = col_character(), ## גיל = col_integer(), ## מגורים = col_character() ## ) ## # A tibble: 5 × 3 ## שם גיל מגורים ## ## 1 רוני 25 חיפה ## 2 מוטי 77 הרצליה ## 3 דני 13 תל-אביב יפו ## 4 רעות 30 קרית שמונה ## 5 דנה 44 בית שאן

Awesome isn’t it? Note that the resulting is a tibble and not a data.frame (read about tibbles).
The package readr has tons of functions features to help us with reading (writing) and controlling the encoding, so I definitely recommend it. By the way, try using read_csv without setting the locale parameter and see what happens. What about files saved by Excel?

Excel files are not the best choice for storing datasets, but the format is extremely common for obvious reasons.

CSV files which were saved by excel

In the past, I had run to a lot of difficulties trying to load CSV files which were saved by excel into R. Excel seems to save them in either “Windows-1255” or “ISO-8859-8”, instead of “UTF-8”. The default read by read_csv might yield something like “” instead of “שלום”. In other cases you might get a “multibyte error”. Just make sure you check the “Windows-1255” or “ISO-8859-8” encodings if the standard UTF-8 doesn’t work well (i.e., use read_csv(file, locale = locale(encoding = "ISO-8859-8"))). Reading directly from excel

Also, if the original is in Excel, you might want to consider reading it directly from the excel file (skipping CSVs entirely). There are a number of packages for reading excel files and I recommend using readxl, specifically read_xlsx or read_xls will do the trick (depending on file format). You don’t even have to specify the encoding, if there are Hebrew characters they will be read as they should be.


For reading csv files with Hebrew characters, it’s very convenient to use readr. The package has a lot of utilities for language encoding and localization like guess_encoding and locale.
If the original data is in excel, you might want to try skipping the csv and read the data directly from the excel format using the readxl package.
Somtimes reading files envolves a lot of trial and error – but eventually it will work.

Don’t give up!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R⁶ — Reticulating Parquet Files

Tue, 08/01/2017 - 21:50

(This article was first published on R –, and kindly contributed to R-bloggers)

The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R using reticulate as a bridge to the pyarrow module, which has the ability to natively create parquet files.

Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill.

The Python parquet process is pretty simple since you can convert a pandas DataFrame directly to a pyarrow Table which can be written out in parquet format with pyarrow.parquet. We just need to follow this process through reticulate in R:

library(reticulate) pd <- import("pandas", "pd") pa <- import("pyarrow", "pa") pq <- import("pyarrow.parquet", "pq") mtcars_py <- r_to_py(mtcars) mtcars_df <- pd$DataFrame$from_dict(mtcars_py) mtcars_tab <- pa$Table$from_pandas(mtcars_df) pq$write_table(mtcars_tab, path.expand("~/Data/mtcars_python.parquet"))

I wouldn’t want to do that for ginormous data frames, but it should work pretty well for modest use cases (you’re likely using Spark, Drill, Presto or other “big data” platforms for creation of larger parquet structures). Here’s how we’d do that with Drill via the sergeant package:

readr::write_csv(mtcars, "~/Data/mtcars_r.csvh") dc <- drill_connection("localhost") drill_query(dc, "CREATE TABLE dfs.tmp.`/mtcars_r.parquet` AS SELECT * FROM dfs.root.`/Users/bob/Data/mtcars_r.csvh`")

Without additional configuration parameters, the reticulated-Python version (above) generates larger parquet files and also has an index column since they’re needed in Python DataFrames (ugh), but small-ish data frames will end up in a single file whereas the Drill created ones will be in a directory with an additional CRC file (and, much smaller by default). NOTE: You can use preserve_index=False on the call to Table.from_pandas to get rid of that icky index.

It’s fairly efficient even for something like nycflights13::flights which has ~330K rows and 19 columns:

system.time( r_to_py(nycflights13::flights) %>% pd$DataFrame$from_dict() %>% pa$Table$from_pandas() %>% pq$write_table(where = "/tmp/flights.parquet") ) ## user system elapsed ## 1.285 0.108 1.398

If you need to generate parquet files in a pinch, reticulate seems to be a good way to go.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Let’s Talk Drawdowns (And Affiliates)

Tue, 08/01/2017 - 21:40

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be directed towards those newer in investing, with an explanation of drawdowns–in my opinion, a simple and highly important risk statistic.

Would you invest in this?

As it turns out, millions of people do, and did. That is the S&P 500, from 2000 through 2012, more colloquially referred to as “the stock market”. Plenty of people around the world invest in it, and for a risk to reward payoff that is very bad, in my opinion. This is an investment that, in ten years, lost half of its value–twice!

At its simplest, an investment–placing your money in an asset like a stock, a savings account, and so on, instead of spending it, has two things you need to look at.

First, what’s your reward? If you open up a bank CD, you might be fortunate to get 3%. If you invest it in the stock market, you might get 8% per year (on average) if you held it for 20 years. In other words, you stow away $100 on January 1st, and you might come back and find $108 in your account on December 31st. This is often called the compound annualized growth rate (CAGR)–meaning that if you have $100 one year, earn 8%, you have 108, and then earn 8% on that, and so on.

The second thing to look at is the risk. What can you lose? The simplest answer to this is “the maximum drawdown”. If this sounds complicated, it simply means “the biggest loss”. So, if you had $100 one month, $120 next month, and $90 the month after that, your maximum drawdown (that is, your maximum loss) would be 1 – 90/120 = 25%.

When you put the reward and risk together, you can create a ratio, to see how your rewards and risks line up. This is called a Calmar ratio, and you get it by dividing your CAGR by your maximum drawdown. The Calmar Ratio is a ratio that I interpret as “for every dollar you lose in your investment’s worst performance, how many dollars can you make back in a year?” For my own investments, I prefer this number to be at least 1, and know of a strategy for which that number is above 2 since 2011, or higher than 3 if simulated back to 2008.

Most stocks don’t even have a Calmar ratio of 1, which means that on average, an investment makes more than it can possibly lose in a year. Even Amazon, the company whose stock made Jeff Bezos now the richest man in the world, only has a Calmar Ratio of less than 2/5, with a maximum loss of more than 90% in the dot-com crash. The S&P 500, again, “the stock market”, since 1993, has a Calmar Ratio of around 1/6. That is, the worst losses can take *years* to make back.

A lot of wealth advisers like to say that they recommend a large holding of stocks for young people. In my opinion, whether you’re young or old, losing half of everything hurts, and there are much better ways to make money than to simply buy and hold a collection of stocks.


For those with coding skills, one way to gauge just how good or bad an investment is, is this:

An investment has a history–that is, in January, it made 3%, in February, it lost 2%, in March, it made 5%, and so on. By shuffling that history around, so that say, January loses 2%, February makes 5%, and March makes 3%, you can create an alternate history of the investment. It will start and end in the same place, but the journey will be different. For investments that have existed for a few years, it is possible to create many different histories, and compare the Calmar ratio of the original investment to its shuffled “alternate histories”. Ideally, you want the investment to be ranked among the highest possible ways to have made the money it did.

To put it simply: would you rather fall one inch a thousand times, or fall a thousand inches once? Well, the first one is no different than jumping rope. The second one will kill you.

Here is some code I wrote in R (if you don’t code in R, don’t worry) to see just how the S&P 500 (the stock market) did compared to how it could have done.

require(downloader) require(quantmod) require(PerformanceAnalytics) require(TTR) require(Quandl) require(data.table) SPY <- Quandl("EOD/SPY", start_date="1990-01-01", type = "xts") SPYrets <- na.omit(Return.calculate(SPY$Adj_Close)) spySims <- list() set.seed(123) for(i in 1:999) { simulatedSpy <- xts(sample(coredata(SPYrets), size = length(SPYrets), replace = FALSE), colnames(simulatedSpy) <- paste("sampleSPY", i, sep="_") spySims[[i]] <- simulatedSpy } spySims <-, spySims) spySims <- cbind(spySims, SPYrets) colnames(spySims)[1000] <- "Original SPY" dailyReturnCAGR <- function(rets) { return(prod(1+rets)^(252/length(rets))-1) } rets <- sapply(spySims, dailyReturnCAGR) drawdowns <- maxDrawdown(spySims) calmars <- rets/drawdowns ranks <- rank(calmars) plot(density(as.numeric(calmars)), main = 'Calmars of reshuffled SPY, realized reality in red') abline(v=as.numeric(calmars[1000]), col = 'red')

This is the resulting plot:

That red line is the actual performance of the S&P 500 compared to what could have been. And of the 1000 different simulations, only 91 did worse than what happened in reality.

This means that the stock market isn’t a particularly good investment, and that you can do much better using tactical asset allocation strategies.


One site I’m affiliated with, is AllocateSmartly. It is a cheap investment subscription service ($30 a month) that compiles a collection of asset allocation strategies that perform better than many wealth advisers. When you combine some of those strategies, the performance is better still. To put it into perspective, one model strategy I’ve come up with has this performance:

In this case, the compound annualized growth rate is nearly double that of the maximum loss. For those interested in something a bit more aggressive, this strategy ensemble uses some fairly conservative strategies in its approach.


In conclusion, when considering how to invest your money, keep in mind both the reward, and the risk. One very simple and important way to understand risk is how much an investment can possibly lose, from its highest, to its lowest value following that peak. When you combine the reward and the risk, you can get a ratio that tells you about how much you can stand to make for every dollar lost in an investment’s worst performance.

Thanks for reading.

NOTE: I am interested in networking opportunities, projects, and full-time positions related to my skill set. If you are looking to collaborate, please contact me on my LinkedIn here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Showing Some Respect for Data Munging

Tue, 08/01/2017 - 18:42

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In this post, I’d like to focus on data munging, e.g. the process of acquiring and arranging data (typically in a tidy manner) prior to data analysis. It’s common knowledge that data scientists spend an enormous amount of time munging data, but data analysis, modeling, and visualization get most of the attention at presentations, on blogs and in the popular imagination. In my current role as a “data scientist,” I spend far more time doing data munging than I do modeling. So let’s show some respect for this process by dedicating a post to it!

The data come from a website on which consumers can rate their perceptions of different products in the consumer beverage category. Consumers select a specific beverage, give it an overall rating, and can indicate their perceptions of the drink (specifically, the presence or absence of different characteristics).

Our end goal, which I will discuss in my next post, is to use the ratings to perform a product segmentation, which will give us insight into how consumers perceive products within the beverage category. In order to reach that goal, we have to distill the information about the different beverages and the information from the consumer reviews into a single matrix that will allow us to derive these answers (with the right type of analysis).

As the past couple of posts focused on data analysis using Python, in this exercise I’ll switch back to R.

The Data

As is often the case, the data we need do not exist in a single tidy dataset. There are two different tables that contain different pieces of information about the beverages and the consumer evaluations.

The Review Data

The first dataset contains basic information on the reviews. These data contain one line per consumer review, and have a column describing the review id number, the drink name, the drink category (there are three different categories of beverages in these data), and the overall rating score. Here’s a sample of what the data look like:

The Perceptions Data

The second dataset contains information on the consumer perceptions. Consumers can indicate their perceptions of each drink on a number of different dimensions. In this post, I’ll focus on the feelings induced by each beverage, characterized by sensations such as “happy” and “energetic.” The dataset contains one line per perception, and has columns indicating the perception “Group” (here we’re focusing on feelings, but other perceptions such as flavor are also present in the data), and the specific sensation within the feeling group (e.g. happy, energetic, etc.). This dataset can contain multiple lines per consumer review, if consumers indicated the presence of multiple characteristics for a given beverage. The id number in the perceptions data corresponds to the id number in the review data (though, as is often the case with real-world data, they do not share the same variable name). The head of the second dataset looks like this:

Step 1: Aggregate the Perceptions Data to the Review Level

We first need to aggregate the perceptions data to the review level, in order to merge them in with the review dataset.

We have a number of different “Groups” of perceptions (e.g. feeling, flavor, etc.), and within each different group, we can have different sensations (e.g. happy, energetic, etc.). We can therefore see the combination of group and sensation to be the sensible unit of aggregation; the combination of both variables provides the best descriptor of the information provided by the consumers.

The perceptions dataset is called “perceptions_reviewid”. We can simply concatenate the group and sensation columns using base R:

# combine group and sensation into one variable called "grp_sens"
perceptions_reviewid$grp_sens <- paste0(perceptions_reviewid$Group, "_",

The head of our dataset now looks like this:

In aggregating this information to the review level, we are faced with two basic issues: 1) transforming the data from the “long” to the “wide” format and 2) representing the information contained in the combination of group and sensation (the “grp_sens” variable) in a numeric format.

The “long” to the “wide” format conversion expresses our desire to aggregate the data to the level of the review, in order to merge it in with our review dataset. We currently have multiple rows per id (a “long” format); we want to end up with a single row per id with the different levels of “grp_sens” (representing the group and sensation information) each contained in one column (a “wide” format).

The representation of the data in numeric format is critical for further aggregation and analysis. Currently, we have the presence or absence of a given perception indicated with text; in our aggregated dataset we would like the presence of a given sensation (one sensation per column) indicated as a 1, and its absence as a 0.

I used the tidyr and dplyr packages, along with this super-helpful StackOverflow response, in order to achieve this.

library(plyr); library(dplyr)

# long-to-wide transformation with numeric indicators
# for presence/absence of each group/sensation combination
# at the review level
perceptions_review_level <- perceptions_reviewid %>%
group_by(id_perceptions, grp_sens) %>%
summarize(n = n()) %>% ungroup() %>%
spread(grp_sens, n, fill=0)

Essentially, we first group the data by id and grp_sens and count the frequency of each combination (as each person can only choose each combination once, the only value this can take will be 1). We then ungroup the data and spread out the grp_sens variable into the columns, filling each column with the number of observations per review (which, as we note above, will only take the value of 1). We then fill missing values (which occur when a given level of grp_sens did not occur in a review) with 0. Voilà!

Step 2: Merge the Review Data and Aggregated Perceptions Data

Now we need to merge the two review-level datasets. Our review dataset is simply called “reviews” and our wide-format perceptions dataset is called “perceptions_review_level.”

Before merging datasets, I always like to do some basic descriptive statistics on the key variables used in the merge. This type of checking can save lots of trouble down the road in the analysis pipeline, and is one of the “insider” tips and tricks that isn’t mentioned enough when discussing data munging.

We will be merging on the id variable which is present in both datasets (albeit with a different name in each dataset). Let’s see how many id’s overlap with each other, and ensure that there are no duplicate id’s in either dataset. (Duplicate id’s will mess up many merges/joins, causing a dataset to increase dramatically in length, depending on the extent of the duplication).

# number of perceptions id's that correspond to an id in the review data
sum(reviews$id_reviews %in% perceptions_review_level$id_perceptions)
# 105324

# number of review id's that correspond to an id in the perception data
sum(perceptions_review_level$id_perceptions %in% reviews$id_reviews )
# 105324

# are there duplicate ids for the review data?
# 0

# are there duplicate ids for the perceptions data?
# 0

We have the same number of matching id’s in both datasets: 105,324, and no duplicate id’s in either dataset, which means that we can proceed with our merge.

We will use dplyr to merge the datasets, specifying the id variables as the “key” in the merge. We’ll use an inner join, because we can only analyze beverages with information from both sources. We will immediately check to see whether our merged dataset has the correct number of rows (105,324, the number of matching id’s in both datasets).

# merge the review and perception data

# inner join because we only want those with values in both datasets
# it does us no good to have information in only one
# (e.g. drink name but no perceptions, perceptions but no drink name)
review_level_omnibus <- dplyr::inner_join(reviews, perceptions_review_level,
by = c("id_reviews" = "id_perceptions"))

# check the dimensions: rows should be equal to
# the number of matching id's above
# 105324

As expected, our merged dataset (called “review_level_omnibus”) has 105,324 rows, the same as the number of matching id’s between the two datasets. The head of our merged dataset looks like this:

   Step 3: Aggregate the Merged Data to the Beverage Level

We now have 1 row per review, with differing numbers of reviews per beverage. This is simply due to how the data were generated- consumers (website visitors) could leave reviews of whatever beverages they wanted to, and so our sample is inherently unbalanced in this regard. While many techniques are available for investigating consumer preferences with perfectly balanced samples (e.g. where every member of a consumer panel evaluates every beverage on every perceptive dimension), we will have to aggregate our data to the beverage level for our analysis. This will allow us to understand consumer perceptions of the beverage category across all raters, but it ignores potential variation in perceptions between individual consumers.

There are two different aggregations I would like to do on these data. The first concerns the numeric consumer evaluations. For each drink, we will take the average of the rating variable, as well as the average of the 0/1 perception variables. Because the perceptions data are coded with zeroes and ones, when we take the average of these variables, we are simply calculating the percentage of the reviews per beverage that contained a given perception (e.g. the percentage of  reviews for a given beverage that indicated “happy” feelings).

# get the mean values per beverage: rating and perceptions
drink_level_perceptions <- review_level_omnibus %>%
group_by(Name) %>% summarise_each_(funs(mean),
names(review_level_omnibus[,c(4:ncol(review_level_omnibus))]) )

This aggregation yields a dataset whose head looks like this:

The second aggregation concerns the Category (there are 3 different beverage categories in our data) and the number of reviews for each beverage. Category is constant across each beverage (e.g. each beverage only has one category), while the number of reviews can vary across beverages. We aggregate the data to the beverage level using dplyr:

# get the category and number of reviews per beverage
drink_level_rest <- review_level_omnibus %>%
group_by(Name) %>% summarize(Category = Category[1],
Num_Obs = n())

Which yields a dataset whose head looks like this:

(Note- I didn’t think it was possible to combine these two different aggregations in a single dplyr chain. If I’m wrong, please let me know in the comments!)

Step 4: Merge the Beverage-Level Dataset

We will now merge these two aggregated datasets together on the beverage name (“Name”). (I’ve done the type of checks I describe above on the merging variable, but am not showing them here for the sake of brevity).

# merge drink-level datasets together
drink_level_omnibus <- dplyr::left_join(drink_level_rest,
drink_level_perceptions, by = "Name")

The head of our final merged beverage-level dataset looks like this:

This is exactly the data format we need to perform our segmentation of the beverage category.

Conclusion: Respect the Munging!

In this post, we took consumer review data about beverages and did an extensive data munging exercise in order to distill the information into a smaller, tidy dataset which we will use for analysis. We first examined the review dataset and the perceptions dataset and understood their structure. We next transformed the perceptions dataset from the long to the wide format, and coded the perceptions with numeric values. We then merged the review and perceptions data together to create a review-level dataset with data from both sources. Finally, we aggregated the review-level dataset to the beverage level, creating a single dataset which contains information on ratings, perceptions, category and number of reviews per beverage. After all of this work, we now have a dataset that we can use to analyze the different beverages!

Data munging and data analysis go hand-in-hand, and we can’t do any analyses without first getting our data in shape. While impressive modeling and visualization will probably always attract more attention, I’m glad I took the time to devote a post to data munging, the less-glamorous-but-equally-important component of any good data analysis.

In the next post, I will describe some analyses I have done on the dataset created here, in order to understand consumer perceptions of products within the beverage category. Stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)

Tue, 08/01/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In today’s set you will have to use the stuff you’ve seen in the first fourth installment of this series of exercise set, but in a more practical setting. Take this as a fun test before we start learning cool stuff like A/B testing, conditional probability and the Bayes theorem. I hope you will enjoy doing it!

Answers to the exercises are available here.

Exercise 1
A company makes windows who should be able to withstand wind of 120 km/h. The quality assurance department of that company has for mandate to make sure that the failure rate of those windows is less than 1% for each batch of windows produced by their factory. To do so, they choose randomly 10 windows per batch of 150 and place them in a wind tunnel where they are tested.

  1. Which probability function should be used to compute the number of failing engine in a QA test if the failure rate is 1%?
  2. What is the probability that a windows work correctly during the QA test?
  3. What is the probability that no windows breaks during the test?
  4. What is the probability that up to 3 windows breaks during the test?
  5. Simulate this process to estimate the average amount of engine failure during the test

Exercise 2
A team of biologist is interested in a type of bacteria who seems to be resistant to extreme change to their environment. In a particular study they put a culture of bacteria in an acidic solution, observed how many days 250 individual bacteria would survive and created this dataset. Find the 90% confidence interval for the mean of this dataset.

Exercise 3
The MNIST database is a large dataset of handwritten digits used by data scientist and computer science experts as a reference to test and compare the effectiveness of different machine learning and computer vision algorithms. If a state of the art algorithm can identify the handwritten digits in this dataset 99,79% of the time and we use this algorithm on a set of 1000 digits:

  1. What is the probability that this algorithm doesn’t recognize 4 digits?
  2. What is the probability that this algorithm doesn’t recognize 6 or 7 digits?
  3. What is the probability that this algorithm doesn’t recognize 3 digits or less?
  4. If we use this algorithm on a set of 3000 digits, what is the probability that it fails more than 10 times?

Exercise 4
A custom officer in an airport as to check the luggage of every passenger that goes through custom. If 5% of all passenger travels with forbidden substances or objects:

  1. What is the chance that the fourth traveler who is checked has a forbidden item in his luggage?
  2. What is the probability that the first traveler caught with forbidden item is caught before the fourth traveler?
Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Work with about different binomial and logistic regression techniques
  • Know how to compare regression models and choose the right fit
  • And much more

Exercise 5
A start-up want to know if their marketing push in a specific market has been successful. To do so, they interview 1000 people in a survey and ask them if they know their product. Of that number, 710 where able to identify or name their product. Since the start-up has limited resource, they decided that they would reallocate half the marketing budget to their data science department if more than 70% of the market knew about their product.

  1. Simulate the result of the survey by creating a matrix containing 710 ones representing the positive response and 290 zeros representing the negative response to the survey.
  2. Use bootstrapping to compute the proportion of positive answer that is smaller than 95% of the other possible proportion.
  3. What is the percentage of bootstrapped proportion smaller than 70%?
  4. As a consequence of your last answer, what the start-up should do?

Exercise 6
A data entry position need to be filed at a tech company. After doing the interview process, human resource selected the two ideal candidate to do a final test where they had to complete a sample day of work (they take data entry really seriously in this company). The first candidate did his work with an average time of 5 minutes for each form and a variance of 35 minutes while the second did it with a mean of 6.5 minutes and a variance of 25. Assuming that the time needed by an employer to fill in a form follow a normal distribution:

  1. Simulate the work of both candidates by generating 200 points of data from both distributions.
  2. Use bootstrapping to compute the 95% confidence interval for both means.
  3. Can we conclude that a candidate is faster than the other?

Exercise 7
A business wants to launch a product in a new market. Their study show that to be viable a market must be composed of at least 60% of potential consumer making more than 35 000$. If the last census show that the salary of this population follow an exponential distribution with a mean of 60000 and that the rate of an exponential distribution is equal to 1/mean, should this business launch their product in this market?

Exercise 8
A batch of 1000 ohms resistance are scheduled to be solder to two other 200 ohms resistance to create a serial circuit of 1400 ohms. But no manufacturing process is perfect and no resistance has perfectly the value it supposed to have. Suppose that the first resistance is made following a normal process that makes batch of resistance with a mean of 998 ohms and a standard deviation of 5.2 ohms, while the two other come from another process who produce batch of resistance with a mean of 202 and a variance of 2.25. What is the percentage of circuits will have for resistance a value in the interval [1385,1415]? (Note: you can use bootstrap to solve this problem or you can use the fact that the sum of two normal distributions is equal to another normal distribution whose mean is equal to the sum of their two means. The variance the new distribution is calculated the same way. You can learn more here)

Exercise 9
A probiotic supplement company claim that three kinds of bacteria are present in equal part in each of their pill. An independent laboratory is hired to test if this company respects this claim. After taking a small sample of five pills, they get the following dataset where the numbers are in millions.

In this dataset, the rows represent pills used in the sample and each column represents a different kind of bacteria. For each kind of bacteria:

  1. Compute the mean.
  2. Compute the variance.
  3. Compute the quartile.
  4. Compute the range which is define by the maximum value minus the minimum value.

Exercise 10
A shipping company estimate that the delivery delays of his shipment, in hours, follow a student distribution with a parameter of 6. What is the proportion of delivery that are between 1 hours late and 3 hours late?

Related exercise sets:
  1. Nonparametric Tests Exercises
  2. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-3)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a website with pkgdown: a short guide

Tue, 08/01/2017 - 14:16

As promised in my last post, here is a short guide with some tips and tricks for building a documentation website for an R package using pkgdown.

In the end, this guide ended up way longer than I was expecting, but I hope you’ll find it useful, although it often replicates information already available in pkgdown documentation !

Prerequisites To build a website using pkgdown, all you need to have is an R package hosted on Git Hub, with a file structure “tweaked” with some functionality provided by devtools.  Assuming you are using RStudio, and that you didn’t already do this, open the project corresponding to your package and (as a minimum) run: require(devtools)
use_vignette("test") #substitute with the name of your package

Since to use pkgdown your package must be on Git Hub, you may also want:  use_github_links() , which will populate automatically some fields in the DESCRIPTION file successively used to build the home page of your website. Other possibly useful commands (depending on the status of your package) may be:  use_travis()
use_cran_badge() (see devtools documentation for further info)

At this point, within your package file structure you should have a README.Rmd file (which is used also to create the “” file for Git Hub), and a file. You should also have a man subfolder containing the .Rd files documenting your functions (usually auto-generated from roxygen comments using devtools::document()). Finally, you should have a my_package.Rmd file in the vignettes subfolder (which is used for example by devtools::build_vignette() to automatically create a vignette for the package).

Getting Started: creating a bare-bones website

To create a standard pkgdown site, just run:


build_site() will do several things:

  • create a “docs” subfolder in your file structure, where it will place all the material needed for rendering the website;
  • knit README.Rmd to “docs/index.html”. This will be the home page of your website;
  • knit to “docs/news/index.html” (in this way, any time you update, the news section of the website can be automatically updates);
  • knit all your “Rd” files to “docs/reference/” as html files inheriting the name of the function (e.g., “docs/reference/myfun_1.html” “docs/reference/myfun_2.html”, etc.). A “docs/reference/index.html” page is also created: this is a simple html page linking to the html documentation pages for the different functions (see for example here). By default, all functions will be included and listed in alphabetical order.
  • knit any Rmd files in your “vignettes” subfolder to “docs/articles” as single html files;
  • Scrape your package for various other info (e.g., Authors, Git Hub address, License, Citation, badges, etc.) to create additional material on the right-hand side bar of the home page;
  • Put everything together by some magic to build a working website, and open a preview in RStudio Viewer or your browser.

The resulting website will look more or less like this:

“Standard” website built by pkgdown::build_site()

, with your main vignette linked in Getting Started, and all the other Rmds found in vignettes (if any) linked-to in the Articles drop down menu.

Considering that all is needed is to run build_site() (in particular if the package is already using devtools tweaks), I’d say that this is already a nice result ! However, spending some time in better configuring the structure of the site and tweaking some small things allows to achieve a much nicer result, as explained below.

Customizing appearence and structure of the website: the _pkgdown.yaml file

Your pkgdown website can be further customized by creating and customizing a text file named _pkgdown.yaml in the root folder of your project. The file needs to have three main sections, which I will describe here using the current .yaml file used in the MODIStsp Website as an example (the complete file can be found here).

Preamble Section

This is quite straightforward: first of all, give a title to your website and provide its URL. Then, select a template to customize its appearance from the ones available at bootswatch. Finally, add the GoogleAnalytics tracking code if you wish.

title: MODIStsp
bootswatch: flatly
ganalytics: UA-12345678-0

Reference Section

Here, you can configure the page of the website containing the index to the documentation of your functions (docs/reference/index.html). Instead than having a simple list in alphabetical order, you can define different groups and decide which functions to put in each group.

Each group is defined by a title, a description (use ~ for no description), and a contents section containing an indented list of functions to be included in that group. Syntax and indentation rules must be strictly followed but are very simple. Looking at the example below should suffice for understanding it. In this case I decided to use only two groups: one for exported functions, and one for everything else, but you can set-up as many groups as you need.

- title: Exported Functions
desc: Functions exported by MODIStsp
- '`MODIStsp`'
- '`MODIStsp_extract`'
- '`MODIStsp_addindex`'
- '`install_MODIStsp_launcher`'
- title: Internals
desc: Internal functions and helpers
- '`MODIStsp_GUI`'
- '`MODIStsp_process`'
- '`MODIStsp_process_indexes`'
- '`MODIStsp_process_QA_bits`'
- '`MODIStsp_read_xml`'
- '`lpdaac_getmod_dirs`'
- '`lpdaac_getmod_dates`'
- '`lpdaac_getmod_names`'
- '`MODIStsp_check_files`'
- '`MODIStsp_vrt_create`'
- '`bbox_from_file`'
- '`reproj_bbox`' Navbar Section

The last section of _pkgdown.yaml is named navbar, and is where most the customization occurs.

In the first line, you can choose if using a standard or inverse color scheme for your boootstrap template. The only way to understand what this does is to try both and see which one you do prefer.

type: inverse

Next, you can define what content will be accessible through buttons or menus on the left side of the top-menu of the website.

  • Buttons linking to a single page are described by:
    1. a title text or an icon name (chosen from that will be shown on the button;
    2. a hyperlink to the page that will be accessed through the button (Note that the hyperlinks are built relative to the root of the “docs” folder created by pkgdown – no need to specify full paths !).
  • Dropdown menus giving access to multiple pages are described by:
    1. a title text or an icon name;
    2. a “menu:” line;  
    3. an indented list of the text entries that will appear in the menu, with the associated hyperlinks.

In the example below, you can see how the content should be indented and organized:

- icon: fa-home
href: index.html
- text: "How To"
- text: Installation
href: articles/installation.html
- text: Interactive Execution - the MODIStsp GUI
href: articles/interactive_execution.html
- text: Non-Interactive Execution from within R
href: articles/noninteractive_execution.html
- text: Standalone Execution and Scheduled Processing
href: articles/standalone_execution.html
- text: Outputs Format and Naming Conventions
href: articles/output.html
- text: Accessing and Analyzing Processed Data from R
href: articles/analyze.html
- text: "Examples of Use"
href: articles/examples.html

In the case of MODIStsp website, I decided to not provide a link to the “full” vignette (which was linked from “Getting Started” in the “bare-bones” website). Instead, I divided the contents of that vignette in six shorter articles accessible from the “How To” dropdown menu.

To do that, I just had to create six separate Rmd files within the vignettes folder. All Rmds within that folder are automatically knitted by pkgdown when launching either build_site() or build_articles(), and are therefore available for linking from the menu.

Finally, in the last section of _pkgdown.yaml you can specify what content should be accessible from the right-hand side of the top menu. Syntax and indentation are identical to what described above.

- text: "faq"
icon: fa-question-circle-o
href: articles/faq.html
- icon: fa-newspaper-o
text: "news"
href: news/index.html
- icon: fa-file-code-o
text: "functions"
href: reference/index.html
- icon: fa-github fa-lg
text: "github"

In MODIStsp website, we are using the right side of the top menu bar to provide access to:

  • a static page containing a FAQ; 
  • the auto-generated news page (by linking to news/index.html); 
  • the function documentation index page (by linking to reference/index.html); 
  • an icon linking to the MODIStsp repository on Git Hub. 

(From the example, you can see that it is possible to associate the “buttons” including wit both an icon and a short title text. In that case, the icon and the text will are shown one after the other.)

Once your .yaml file is complete, just run build_site() again and check the results. Then iterate ad-libitum until you are satisfied by the resulting structure.

Fine tuning

When you are satisfied with the structure of the website, you can start tweaking its contents to achieve a better-looking final result. Here I’m just sharing some tips and tricks I learnt while building our website:

  • If (like me) you wish to have a different “home page” in the website and in the main Git Hub page, you can create a new index.Rmd file in the root of the package. If index.Rmd is found, it is used instead than README.Rmd to populate the home page;
  • Text in the “standard” output is a bit too “compressed” for my taste. You can increase the spacing between the main sections of an article by simply adding a
    at the end of each main section;
  • Similarly, you can add line separators between sections by simply adding a line of underscores under each section;
  • To reduce the “wall of text” effect, you can put any of the icons in an article by inserting its “full html specification” in the text of the corresponding Rmd. (For example,”I want a github icon here: class=”fa fa-github aria-hidden=”true”>” would render in the website with a “Git Hub octopus” icon at the end);
  • Of course, you can add any image/R plot by linking/creating it in the Rmds of the different articles or of the home page;
  • If you build the site using build_site(run_dont_run = TRUE), the examples with the “dont_run” specification in the roxygen comment will be run, and their results appear in the documentation page of each function. I find this really useful to illustrate the typical behaviour of functions;
  • To provide modifiers to the standard pkgdown.css and pkgdown.js files, create a new folder named “pkgdown” in your project root. As described in pkgdown documentation, “the content of files “extra.css and “extra.js” placed in this folder will be copied to docs/ and inserted into the  after the default pkgdown CSS and JSS“. I for example added the following lines in extra.css to have some additional syntax highlighting in the code snippets:
  • .fl {color: blue;}
    .fu {color: blue;} /* function */
    .ch,.st {color: orange;} /* string */
    .kw {color: black;} /* keyword */
    .co {color: green;} /* comment */

    .message { color: gray; font-weight: bolder;}
    .error { color: red; font-weight: bolder;}
    .warning { color: purple; font-weight: bolder;

, but if you are more fluent than me in css and js you can probably tweak appearance a lot more.

  • you don’t need to run the whole build_site() every time you wish to adjust something and check the results. You can use build_reference(), build_news(), build_articles() and build_home() to update just some sections. 

Below, you can see how the .yaml file described before and the other small “tweaks” improved the appearance of MODIStsp homepage :

Final Home Page of the MODIStsp website

Deploying the website to Git Hub and the web

Finally, it’s time to deploy the website. To do that:

  1. Commit and push your changes to the remote;
  2. Login on Git Hub, navigate to your repo and select “Settings”.
  3. Scroll down to find the “Git Hub pages” subsection and, under “Source”, select “master branch/docs folder” (from this, you can see that it is fundamental that your website material is pushed to the master branch)
  4. click on the link of your website and… congratulations: you just published your new pkgdown website !
(PS: if you find any errors in this guide, or you think additional clarification is needed, please leave a comment to this post !) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to use H2O with R on HDInsight

Mon, 07/31/2017 - 23:24

(This article was first published on Revolutions, and kindly contributed to R-bloggers) is an open-source AI platform that provides a number of machine-learning algorithms that run on the Spark distributed computing framework. Azure HDInsight is Microsoft's fully-managed Apache Hadoop platform in the cloud, which makes it easy to spin up and manage Azure clusters of any size. It's also easy to to run H2O on HDInsight: H2O AI Platform is available as an application on HDInsight, which pre-installs everything you need as the cluster is created.

You can also drive H2O from R, but the R packages don't come auto-installed on HDInsight. To make this easy, the Azure HDInsight team has provided a couple of scripts that will install the necessary components on the cluster for you. These include RStudio (to provide an R IDE on the cluster) and the rsparkling package. With these components installed, from R you can:

  • Query data in Spark using the dplyr interface, and add new columns to existing data sets.
  • Convert data for training, validation, and testing to "H2O Frames" in preparation for modeling.
  • Apply any of the machine learning models provided by Sparkling Water to your data, using the distributed computing capabilities provided by the HDInsight platform.

For details on how to install the R components on HDInsight, follow the link below.

Azure Data Lake & Azure HDInsight Blog: Run in R on Azure HDInsight


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Counterfactual estimation on nonstationary data, be careful!!!

Mon, 07/31/2017 - 22:00

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

In a recent paper that can be downloaded here, Carvalho, Masini and Medeiros show that estimating counterfactuals in a non-stationary framework (when I say non-stationary it means integrated) is a tricky task. It is intuitive that the models will not work properly in the absence of cointegration (spurious case), but what the authors show is that even with cointegration, the average treatment effect (ATE) converges to a non-standard distribution. As a result, standard tests on the ATE will identify treatment effects in several cases that there is no effect at all.

For those who are not familiar with counterfactual models: normally, these models have a treated unit (we want to know the effects of the treatment) and several untreated units that we call peers. There units may be cities, countries, companies, etc. Assuming that only one unit was treated and that there is no contamination, we can use information from the peers to project the behaviour of the treated unit as if it was not treated, which is the counterfactual. The models should also be able to identify if the treatment had no effects on the treated unit.

Simulation tests

Here I am going to generate cointegrated data with no treatment to show the behaviour of Synthetic Control and the ArCo (Other techniques will have the same features). The data is very simple and it is based on the mentioned article’s example. We are going to generate 100 observations of 10 time-series, 9 random walks and one that is the sum of these random walks plus an error, which is the treated unit. However, I will not include any treatment and I want the models to show that there was no treatment. The hypothesis to be tested is that at we had some treatment in the treated unit.

library(ArCo) library(glmnet) library(Synth) T=100 # Number of Observations n=10 # Number of Time Series set.seed(1) # Seed for replication # Generate random walk peers peers=matrix(0,T,n-1) for(i in 2:T){ peers[i,]=peers[i-1,]+rnorm(n-1) } # Generate the treated unit y1=rowSums(peers)+rnorm(T) # Put all the TS together Y=cbind(y1,peers) # Plot all TS. The black line is the "treated unit" matplot(Y,type="l")

First, I am going to estimate the ArCo. The $delta show the average treatment effect with 95\% confidence interval. As you can see, the test indicated that the treatment was statistically significant (the zero is very far from the confidence interval). The plot also show a dislocated counterfactual (blue line).

# Estimate the ArCo in the non-stationary data arco=fitArCo(list(Y),cv.glmnet,predict,treated.unity = 1,t0=51) arco$delta ## LB delta UB ## Variable1 -6.157713 -5.236705 -4.315697 plot(arco)

Now I am going to estimate the Synthetic Control. The Synth package does not have an implementation of a formal test. However, we can have a good idea from the counterfactual plot, which also indicates that the treatment was effective (even though there was no treatment).

# Estimate the Synthetic Control in the nonstationary data # Put the data in the Synth package notation sY=as.vector(Y) timeid=rep(1:T,ncol(Y)) unitid=sort(rep(1:ncol(Y),T)) synthdata=data.frame(time=timeid,unit=unitid,Y=sY) synthdataprep&lt;- dataprep( foo = synthdata, predictors = c("Y"), predictors.op = "mean", dependent = "Y", unit.variable = "unit", time.variable = "time", treatment.identifier = 1, controls.identifier = c(2:10), time.predictors.prior = c(1:50), time.optimize.ssr = c(1:50), time.plot = 1:100 ) SC=synth(synthdataprep) path.plot(SC,synthdataprep) abline(v=50,col="blue",lty=2)

As you can see, both the ArCo and the Synthetic Control wrongly rejected the null hypothesis of no treatment. The best way to solve the problem is to take the first difference of all non-stationary time-series. However, there is an important thing to point out here. Imagine the random walk:

Suppose we have a treatment in that simply makes only for . For any we will have and for we will have . Therefore, if we take the first difference of the treatment will have an impact only at , which makes it impossible for any counterfactual model to identify an intervention. In this example, we would need an intervention that changed the drift of the Random Walk in a way that for all we have:

This will be better illustrated in an example. The plot below shows the difference between the two treatments. The red line is the case we can’t identify if we take the first difference. The Blue line changes the drift of the Random Walk, and therefore we can identify. It is impossible, with the tools I am using here, to identify if we are in the black line or in the red line.

y=rep(0,T) yt1=yt2=y d=1 c=10 for(i in 2:T){ e=rnorm(1) y[i]=y[i-1]+e if(i==51){ yt1[i]=c+yt1[i-1]+e }else{ yt1[i]=yt1[i-1]+e } if(i&gt;=51){ yt2[i]=d+yt2[i-1]+e }else{ yt2[i]=yt2[i-1]+e } } plot(yt2,type="l",col=4,ylab="") lines(yt1,col=2) lines(y,col=1) legend("topleft",legend=c("untreated","treated by a constant in t0","treated in the drift"),col=c(1,2,4),bty="n",seg.len=0.5,cex=0.8,lty=1)

Now let’s get back to the first example. The results clearly indicated an intervention where there was none. Now I will show what happens if we take the first difference.

# Take the first difference dY=diff(Y,1) # Estimate the ArCo darco=fitArCo(list(dY),cv.glmnet,predict,treated.unity = 1,t0=50) darco$delta ## LB delta UB ## Variable1 -0.6892162 0.02737802 0.7439722 plot(darco)

# Adjust the data to the Synth and estimate the model dsY=as.vector(dY) timeid=rep(2:T,ncol(Y)) unitid=sort(rep(1:ncol(Y),T-1)) dsynthdata=data.frame(time=timeid,unit=unitid,Y=dsY) dsynthdataprep&lt;- dataprep( foo = dsynthdata, predictors = c("Y"), predictors.op = "mean", dependent = "Y", unit.variable = "unit", time.variable = "time", treatment.identifier = 1, controls.identifier = c(2:10), time.predictors.prior = c(2:50), time.optimize.ssr = c(2:50), time.plot = 2:100 ) dSC=synth(dsynthdataprep) ## ## X1, X0, Z1, Z0 all come directly from dataprep object. ## ## ## **************** ## optimization over w weights: computing synthtic control unit ## ## ## ## **************** ## **************** ## **************** ## ## MSPE (LOSS V): 8.176593 ## ## solution.v: ## 1 ## ## solution.w: ## 2.9e-09 1.4e-08 3.7e-09 0.9999999 7.4e-09 2.9e-09 1.16e-08 6.9e-09 4e-09 path.plot(dSC,dsynthdataprep) abline(v=50,col="blue",lty=2)

As you can see, the $delta in the ArCo showed that the treatment effect was statistically zero, which is true. The Synthetic Control plot also showed that there was no difference cause by the intervention even though the model adjustment was not very good.

Finally, note that we have two scenarios. If there is no intervention and we do not take the first difference we will probably wrongly reject the null hypothesis of no intervention. However, if we are in the “red line” case of the plot and we take the first difference, we will not reject the null when it may be false depending on the type of variable we are dealing with.


Carvalho, Carlos, Ricardo Masini, and Marcelo C. Medeiros. “The perils of counterfactual analysis with integrated processes.” (2016).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

15 Jobs for R users (2017-07-31) – from all over the world

Mon, 07/31/2017 - 20:41
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs


More New Jobs
  1. Full-Time
    Financial Analyst/Modeler @ Mesa, Arizona, U.S. MD Helicopters – Posted by swhalen
    MesaArizona, United States
    31 Jul2017
  2. Full-Time
    Research volunteer in Cardiac Surgery @ Philadelphia, Pennsylvania, U.S. Thomas Jefferson University – Posted by CVSurgery
    PhiladelphiaPennsylvania, United States
    31 Jul2017
  3. Freelance
    Fit GLM distribution models to data using R nandini arora
    31 Jul2017
  4. Freelance
    Financial services startup looking for freelance Shiny/R/UI developer incuhed
    29 Jul2017
  5. Freelance
    Data Scientists – PhD Paradise – Germany Data Science Talent – Posted by datasciencetalent
    Frankfurt am MainHessen, Germany
    22 Jul2017
  6. Full-Time
    Technical Project Manager Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    15 Jul2017
  7. Full-Time
    Software Developer Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    15 Jul2017
  8. Full-Time
    Software Developer (with R experience) @ Arlington, Virginia, U.S. Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    14 Jul2017
  9. Full-Time
    Marketing Manager RStudio – Posted by
    14 Jul2017
  10. Full-Time
    Solutions Engineer RStudio – Posted by nwstephens
    14 Jul2017
  11. Full-Time
    Lead Data Scientist Golden Rat Studios – Posted by goldenrat
    Los AngelesCalifornia, United States
    13 Jul2017
  12. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    12 Jul2017
  13. Freelance
    R/Shiny and JavaScript Developer Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB) – Posted by vbremerich
    10 Jul2017
  14. Part-Time
    Data Science Bootcamp Mentor Thinkful – Posted by baronowicz
    9 Jul2017
  15. Part-Time
    Call for Help: R/Shiny Developer Fantasy Football Analytics – Posted by val.pinskiy
    9 Jul2017


In you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Machine Learning Explained: Dimensionality Reduction

Mon, 07/31/2017 - 20:13

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Dealing with a lot of dimensions can be painful for machine learning algorithms. High dimensionality will increase the computational complexity, increase the risk of overfitting (as your algorithm has more degrees of freedom) and the sparsity of the data will grow. Hence, dimensionality reduction will project the data in a space with less dimension to limit these phenomena.
In this post, we will first try to get an intuition of what is dimensionality reduction, then we will focus on the most widely used techniques.

Spaces and variables with many dimensions

Let’s say we own a shop and want to collect some data on our clients. We can collect their ages, how frequently they come to our shop, how much they spend on average and when was the last time they came to our shop. Hence each of our clients can be represented by four variables (ages, frequency, spendings and last purchase date) and can be seen as a point in a four dimension space. Later on, variables and dimensions will have the same meaning.

Now, let’s add some complexity and some variables by using images. How many dimensions are used to represent the image below?

There are 8 by 8 pixels, hence 64 pixels are needed and each of these pixels is represented by a quantitative variable. 64 variables are needed!
Even bigger, how many variables do you need to represent the picture below?

Its resolution is 640 by 426 and you need three color channels (Red, Green, and Blue). So this picture requires no more than … 640*426*3= 817,920 variables. That’s how you go from four to almost one million of dimensions (and you can go well above!)

Dimensionality reduction in everyday life

Before seeing any algorithm, everyday life provides us a great example of dimensionality reduction.

Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.

Why should you reduce the number of dimensions?

Dimensionality reduction has several advantages from a machine learning point of view.

  • Since your model has fewer degrees of freedom, the likelihood of overfitting is lower. The model will generalize more easily on new data.
  • If you use feature selection or linear methods (such as PCA), the reduction will promote the most important variables which will improve the interpretability of your model.
  • Most of features extraction techniques are unsupervised. You can train your autoencoder or fit your PCA on unlabeled data. This can be really helpful is you have a lot of unlabeled data (for instance text or images) and labeling is time-consuming and expensive.
Features selection as a basic reduction

The most obvious way to reduce dimensionality is to remove some dimensions and to select the more suitable variables for the problem.

Here are some ways to select variables:

  • Greedy algorithms which add and remove variables until some criterion is met. For instance, the stepwise regression with forward selection will add at each step the variable which improve the fit in the most significant way
  • Shrinking and penalization methods, which will add cost for having too many variables. For instance, L1 regularization will cut some variables’ coefficient to zero. Regularization limits the space where the coefficients can live in.
  • Selection of models based on criteria taking the number of dimensions into accounts such as the adjusted R², AIC or BIC. Contrary to regularization, the model is not trained to optimize these criteria. The criteria are computed after fitting to compare several models.
  • Filtering of variables using correlation, VIF or some “distance measure” between the features.
Features Engineering and extraction to reduce dimensionality

While features selection is efficient, it is brutal since variables are either kept or removed. However in some interval or for some modalities, removed variable may be useful while kept variable may be redundant. Features extraction (or engineering) seeks to keep only the intervals or modalities of the features which contain information.

PCA and Kernel PCA

Principal component analysis (or PCA), is a linear transformation of the data which looks for the axis where the data has the most variance. PCA will create new variables which are linear combinations of the original ones, these new variables will be orthogonal (i.e. correlation equals to zero). PCA can be seen as a rotation of the initial space to find more suitable axis to express the variability in the data.
On the new variables have been created, you can select the most important ones. The threshold is up-to-you and depends on how much variance you want to keep. You can check the tutorial below to see a working R example:

R Basics: PCA with R

Since PCA is linear, it mostly works on linearly separable data. Hence, if you want to perform classification on other data (Like the donut below), linear PCA will probably make you lose a lot of information.

Example of Kernel PCA from Scikit-Learn

On the other hand, kernel PCA can work on nonlinearly separable data. The trick is simple:

  • before applying PCA, a function is applied project the initial space in a bigger space where the data are separable. The function can be a polynomial, radial or some activation function.
  • Then a PCA is applied in this new space. (NB: It is not exactly PCA but close to it).

Hence you can easily separate the two rings of the donut, which will improve the performance of classifiers. Kernel PCA raises two problems, you need to find the right kernel and the new variables are hard to understand for humans.

LDA and Kernel LDA

Linear discriminant analysis is similar to PCA but is supervised (while PCA does not require labels). The goal of the LDA is not to maximize variance but to create linear surfaces to separate the different groups. The new features will be axes that separate the data when the data are projected on them.

As with PCA, you can also apply the kernel trick to apply LDA on non linearly separable data.

ICA and Kernel ICA

Independant component analysis aims at creating independent variables from the original variables.
The typical example Cocktail party problem: you are in a room with lots of different people having different conversations and your goal is to separate the different conversations. If you have a lot of microphones in the room, each and every of them will get a linear combination of all the conversation (some kind of noise). The goal of ICA is to disentangle these conversations from the noise.
This can be seen as dimensionality reduction if you have 200 microphones but only ten independent conversations, you should be able to represent them with ten independant variables from the ICA.


Autoencoder is a powerful method to reduce the dimensionality of data. It is composed of a neural network (it can be feed-forward, convolutional or recurrent, most of the architecture can be adapted into an autoencoder) which will try to learn its input. For instance, an autoencoder trained on images will try to reconstruct these images.

In addition to this, the autoencoder has a bottleneck, the number of neurons in the autoencoder’s hidden layers will be smaller than the number of input variables. Hence, the autoencoder has to learn a compressed representation of the data where the number of dimensions is the number of neuron in the smallest hidden layer.

The main advantages of autoencoder are their non-linearity and how flexible they can be.


T-SNE or t-distributed stochastic neighbor embedding is mainly used to fit data from large dimensions in 2D or 3D space. That is the technique we used to fit and visualize the twitter data in our analysis of French election. The main idea behind T-SNE is that nearby point in the original space should be close in the low dimensional space while distant point should also be distant in the smaller space.
T-sne is highly non-linear, is originally non-parametric, dependant on the random seed and does not keep distance alike. Hence, I mainly use it to plot high dimension and to visualise cluster and similarity.

NB: There is also a parametric variant which seems less widely used than the original T-SNE.

Here ends our presentation of the most widely used dimensionality reduction techniques.




The post Machine Learning Explained: Dimensionality Reduction appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Google Vision API in R – RoogleVision

Mon, 07/31/2017 - 19:54

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

Using the Google Vision API in R Utilizing RoogleVision

After doing my post last month on OpenCV and face detection, I started looking into other algorithms used for pattern detection in images. As it turns out, Google has done a phenomenal job with their Vision API. It’s absolutely incredible the amount of information it can spit back to you by simply sending it a picture.

Also, it’s 100% free! I believe that includes 1000 images per month. Amazing!

In this post I’m going to walk you through the absolute basics of accessing the power of the Google Vision API using the RoogleVision package in R.

As always, we’ll start off loading some libraries. I wrote some extra notation around where you can install them within the code.

# Normal Libraries library(tidyverse) # devtools::install_github("flovv/RoogleVision") library(RoogleVision) library(jsonlite) # to import credentials # For image processing # source("") # biocLite("EBImage") library(EBImage) # For Latitude Longitude Map library(leaflet) Google Authentication

In order to use the API, you have to authenticate. There is plenty of documentation out there about how to setup an account, create a project, download credentials, etc. Head over to Google Cloud Console if you don’t have an account already.

# Credentials file I downloaded from the cloud console creds = fromJSON('credentials.json') # Google Authentication - Use Your Credentials # options("googleAuthR.client_id" = "") # options("googleAuthR.client_secret" = "") options("googleAuthR.client_id" = creds$installed$client_id) options("googleAuthR.client_secret" = creds$installed$client_secret) options("googleAuthR.scopes.selected" = c("")) googleAuthR::gar_auth() Now You’re Ready to Go

The function getGoogleVisionResponse takes three arguments:

  1. imagePath
  2. feature
  3. numResults

Numbers 1 and 3 are self-explanatory, “feature” has 5 options:


These are self-explanatory but it’s nice to see each one in action.

As a side note: there are also other features that the API has which aren’t included (yet) in the RoogleVision package such as “Safe Search” which identifies inappropriate content, “Properties” which identifies dominant colors and aspect ratios and a few others can be found at the Cloud Vision website

Label Detection

This is used to help determine content within the photo. It can basically add a level of metadata around the image.

Here is a photo of our dog when we hiked up to Audubon Peak in Colorado:

dog_mountain_label = getGoogleVisionResponse('dog_mountain.jpg', feature = 'LABEL_DETECTION') head(dog_mountain_label) ## mid description score ## 1 /m/09d_r mountain 0.9188690 ## 2 /g/11jxkqbpp mountainous landforms 0.9009549 ## 3 /m/023bbt wilderness 0.8733696 ## 4 /m/0kpmf dog breed 0.8398435 ## 5 /m/0d4djn dog hiking 0.8352048

All 5 responses were incredibly accurate! The “score” that is returned is how confident the Google Vision algorithms are, so there’s a 91.9% chance a mountain is prominent in this photo. I like “dog hiking” the best – considering that’s what we were doing at the time. Kind of a little bit too accurate…

Landmark Detection

This is a feature designed to specifically pick out a recognizable landmark! It provides the position in the image along with the geolocation of the landmark (in longitude and latitude).

My wife and I took this selfie in at the Linderhof Castle in Bavaria, Germany.

us_castle <- readImage('us_castle_2.jpg') plot(us_castle)

The response from the Google Vision API was spot on. It returned “Linderhof Palace” as the description. It also provided a score (I reduced the resolution of the image which hurt the score), a boundingPoly field and locations.

  • Bounding Poly – gives x,y coordinates for a polygon around the landmark in the image
  • Locations – provides longitude,latitude coordinates
us_landmark = getGoogleVisionResponse('us_castle_2.jpg', feature = 'LANDMARK_DETECTION') head(us_landmark) ## mid description score ## 1 /m/066h19 Linderhof Palace 0.4665011 ## vertices locations ## 1 25, 382, 382, 25, 178, 178, 659, 659 47.57127, 10.96072

I plotted the polygon over the image using the coordinates returned. It does a great job (certainly not perfect) of getting the castle identified. It’s a bit tough to say what the actual “landmark” would be in this case due to the fact the fountains, stairs and grounds are certainly important and are a key part of the castle.

us_castle <- readImage('us_castle_2.jpg') plot(us_castle) xs = us_landmark$boundingPoly$vertices[[1]][1][[1]] ys = us_landmark$boundingPoly$vertices[[1]][2][[1]] polygon(x=xs,y=ys,border='red',lwd=4)

Turning to the locations – I plotted this using the leaflet library. If you haven’t used leaflet, start doing so immediately. I’m a huge fan of it due to speed and simplicity. There are a lot of customization options available as well that you can check out.

The location = spot on! While it isn’t a shock to me that Google could provide the location of “Linderhof Castle” – it is amazing to me that I don’t have to write a web crawler search function to find it myself! That’s just one of many little luxuries they have built into this API.

latt = us_landmark$locations[[1]][[1]][[1]] lon = us_landmark$locations[[1]][[1]][[2]] m = leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(lng = lon, lat = latt, zoom = 5) %>% addMarkers(lng = lon, lat = latt) m

Face Detection

My last blog post showed the OpenCV package utilizing the haar cascade algorithm in action. I didn’t dig into Google’s algorithms to figure out what is under the hood, but it provides similar results. However, rather than layering in each subsequent “find the eyes” and “find the mouth” and …etc… it returns more than you ever needed to know.

  • Bounding Poly = highest level polygon
  • FD Bounding Poly = polygon surrounding each face
  • Landmarks = (funny name) includes each feature of the face (left eye, right eye, etc.)
  • Roll Angle, Pan Angle, Tilt Angle = all of the different angles you’d need per face
  • Confidence (detection and landmarking) = how certain the algorithm is that it’s accurate
  • Joy, sorrow, anger, surprise, under exposed, blurred, headwear likelihoods = how likely it is that each face contains that emotion or characteristic

The likelihoods is another amazing piece of information returned! I have run about 20 images through this API and every single one has been accurate – very impressive!

I wanted to showcase the face detection and headwear first. Here’s a picture of my wife and I at “The Bean” in Chicago (side note: it’s awesome! I thought it was going to be really silly, but you can really have a lot of fun with all of the angles and reflections):

us_hats_pic <- readImage('us_hats.jpg') plot(us_hats_pic)

us_hats = getGoogleVisionResponse('us_hats.jpg', feature = 'FACE_DETECTION') head(us_hats) ## vertices ## 1 295, 410, 410, 295, 164, 164, 297, 297 ## 2 353, 455, 455, 353, 261, 261, 381, 381 ## vertices ## 1 327, 402, 402, 327, 206, 206, 280, 280 ## 2 368, 439, 439, 368, 298, 298, 370, 370 ## ## landmarks... landmarks ## rollAngle panAngle tiltAngle detectionConfidence landmarkingConfidence ## 1 7.103324 23.46835 -2.816312 0.9877176 0.7072066 ## 2 2.510939 -1.17956 -7.393063 0.9997375 0.7268016 ## joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood ## 1 VERY_LIKELY VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY ## 2 VERY_LIKELY VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY ## underExposedLikelihood blurredLikelihood headwearLikelihood ## 1 VERY_UNLIKELY VERY_UNLIKELY VERY_LIKELY ## 2 VERY_UNLIKELY VERY_UNLIKELY VERY_LIKELY us_hats_pic <- readImage('us_hats.jpg') plot(us_hats_pic) xs1 = us_hats$fdBoundingPoly$vertices[[1]][1][[1]] ys1 = us_hats$fdBoundingPoly$vertices[[1]][2][[1]] xs2 = us_hats$fdBoundingPoly$vertices[[2]][1][[1]] ys2 = us_hats$fdBoundingPoly$vertices[[2]][2][[1]] polygon(x=xs1,y=ys1,border='red',lwd=4) polygon(x=xs2,y=ys2,border='green',lwd=4)

Here’s a shot that should be familiar (copied directly from my last blog) – and I wanted to highlight the different features that can be detected. Look at how many points are perfectly placed:

my_face_pic <- readImage('my_face.jpg') plot(my_face_pic)

my_face = getGoogleVisionResponse('my_face.jpg', feature = 'FACE_DETECTION') head(my_face) ## vertices ## 1 456, 877, 877, 456, NA, NA, 473, 473 ## vertices ## 1 515, 813, 813, 515, 98, 98, 395, 395 ## landmarks ## landmarks ... ## rollAngle panAngle tiltAngle detectionConfidence landmarkingConfidence ## 1 -0.6375801 -2.120439 5.706552 0.996818 0.8222974 ## joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood ## 1 VERY_LIKELY VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY ## underExposedLikelihood blurredLikelihood headwearLikelihood ## 1 VERY_UNLIKELY VERY_UNLIKELY VERY_UNLIKELY head(my_face$landmarks) ## [[1]] ## type position.x position.y position.z ## 1 LEFT_EYE 598.7636 192.1949 -0.001859295 ## 2 RIGHT_EYE 723.1612 192.4955 -4.805475700 ## 3 LEFT_OF_LEFT_EYEBROW 556.1954 165.2836 15.825399000 ## 4 RIGHT_OF_LEFT_EYEBROW 628.8224 159.9029 -23.345352000 ## 5 LEFT_OF_RIGHT_EYEBROW 693.0257 160.6680 -25.614508000 ## 6 RIGHT_OF_RIGHT_EYEBROW 767.7514 164.2806 7.637372000 ## 7 MIDPOINT_BETWEEN_EYES 661.2344 185.0575 -29.068363000 ## 8 NOSE_TIP 661.9072 260.9006 -74.153710000 ... my_face_pic <- readImage('my_face.jpg') plot(my_face_pic) xs1 = my_face$fdBoundingPoly$vertices[[1]][1][[1]] ys1 = my_face$fdBoundingPoly$vertices[[1]][2][[1]] xs2 = my_face$landmarks[[1]][[2]][[1]] ys2 = my_face$landmarks[[1]][[2]][[2]] polygon(x=xs1,y=ys1,border='red',lwd=4) points(x=xs2,y=ys2,lwd=2, col='lightblue')

Logo Detection

To continue along the Chicago trip, we drove by Wrigley field and I took a really bad photo of the sign from a moving car as it was under construction. It’s nice because it has a lot of different lines and writing the Toyota logo isn’t incredibly prominent or necessarily fit to brand colors.

This call returns:

  • Description = Brand name of the logo detected
  • Score = Confidence of prediction accuracy
  • Bounding Poly = (Again) coordinates of the logo
wrigley_image <- readImage('wrigley_text.jpg') plot(wrigley_image)

wrigley_logo = getGoogleVisionResponse('wrigley_text.jpg', feature = 'LOGO_DETECTION') head(wrigley_logo) ## mid description score vertices ## 1 /g/1tk6469q Toyota 0.3126611 435, 551, 551, 435, 449, 449, 476, 476 wrigley_image <- readImage('wrigley_text.jpg') plot(wrigley_image) xs = wrigley_logo$boundingPoly$vertices[[1]][[1]] ys = wrigley_logo$boundingPoly$vertices[[1]][[2]] polygon(x=xs,y=ys,border='green',lwd=4)

Text Detection

I’ll continue using the Wrigley Field picture. There is text all over the place and it’s fun to see what is captured and what isn’t. It appears as if the curved text at the top “field” isn’t easily interpreted as text. However, the rest is caught and the words are captured.

The response sent back is a bit more difficult to interpret than the rest of the API calls – it breaks things apart by word but also returns everything as one line. Here’s what comes back:

  • Locale = language, returned as source
  • Description = the text (the first line is everything, and then the rest are indiviudal words)
  • Bounding Poly = I’m sure you can guess by now
wrigley_text = getGoogleVisionResponse('wrigley_text.jpg', feature = 'TEXT_DETECTION') head(wrigley_text) ## locale ## 1 en ## description ## 1 RIGLEY F\nICHICAGO CUBS\nORDER ONLINE AT GIORDANOS.COM\nTOYOTA\nMIDWEST\nFENCE\n773-722-6616\nCAUTION\nCAUTION\n ORDER ## vertices ## 1 55, 657, 657, 55, 210, 210, 852, 852 ## 2 343, 482, 484, 345, 217, 211, 260, 266 wrigley_image <- readImage('wrigley_text.jpg') plot(wrigley_image) for(i in 1:length(wrigley_text$boundingPoly$vertices)){ xs = wrigley_text$boundingPoly$vertices[[i]]$x ys = wrigley_text$boundingPoly$vertices[[i]]$y polygon(x=xs,y=ys,border='green',lwd=2) }

That’s about it for the basics of using the Google Vision API with the RoogleVision library. I highly recommend tinkering around with it a bit, especially because it won’t cost you a dime.

While I do enjoy the math under the hood and the thinking required to understand alrgorithms, I do think these sorts of API’s will become the way of the future for data science. Outside of specific use cases or special industries, it seems hard to imagine wanting to try and create algorithms that would be better than ones created for mass consumption. As long as they’re fast, free and accurate, I’m all about making my life easier! From the hiring perspective, I much prefer someone who can get the job done over someone who can slightly improve performance (as always, there are many cases where this doesn’t apply).

Please comment if you are utilizing any of the Google API’s for business purposes, I would love to hear it!

As always you can find this on my GitHub

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Upcoming Talk at the Bay Area R Users Group (BARUG)

Mon, 07/31/2017 - 17:00

(This article was first published on R –, and kindly contributed to R-bloggers)

Next Tuesday (August 8) I will be giving a talk at the Bay Area R Users Group (BARUG). The talk is titled Beyond Popularity: Monetizing R Packages. Here is an abstract of the talk:

This talk will cover my initial foray into monetizing an open source R package. In 2015, a year after publishing the initial version of choroplethr on CRAN, I made a concerted effort to try and monetize the package. In this workshop you will learn:

1. Why I decided to monetize choroplethr.
2. The monetization strategy I used.
3. Three tactics that can help you monetize your own R package.

I first gave this talk at the San Francisco EARL conference in June. The talk was well-received there, and I am looking forward to giving it again!

The talk is free to attend, but requires registration.

Learn More Here

The post Upcoming Talk at the Bay Area R Users Group (BARUG) appeared first on

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sparklyr 0.6

Mon, 07/31/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce a new release of the sparklyr package, available in CRAN today! sparklyr 0.6 introduces new features to:

  • Distribute R computations using spark_apply() to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.
  • Connect to External Data Sources using spark_read_source(), spark_write_source(), spark_read_jdbc() and spark_write_jdbc().
  • Use the Latest Frameworks including dplyr 0.7, DBI 0.7, RStudio 1.1 and Spark 2.2.

and several improvements across:

  • Spark Connections add a new Databricks connection that enables using sparklyr in Databricks through mode="databricks", add support for Yarn Cluster through master="yarn-cluster" and connection speed was also improved.
  • Dataframes add support for sdf_pivot(), sdf_broadcast(), cbind(), rbind(), sdf_separate_column(), sdf_bind_cols(), sdf_bind_rows(), sdf_repartition() and sdf_num_partitions().
  • Machine Learning adds support for multinomial regression in ml_logistic_regression(), weights.column for GLM, ml_model_data() and a new ft_count_vectorizer() function for ml_lda().
  • Many other improvements, from initial support for broom over ml_linear_regression() and ml_generalized_linear_regression(), dplyr support for %like%, %rlike% and %regexp%, sparklyr extensions now support download_scalac() to help you install the required Scala compilers while developing extensions, Hive database management got simplified with tbl_change_db() and src_databases() to query and switch between Hive databases. RStudio started a joint effort with Microsoft to support a cross-platform Spark installer under

Additional changes and improvements can be found in the sparklyr NEWS file.

Updated documentation and examples are available at For questions or feedback, please feel free to open a sparklyr github issue or a sparklyr stackoverflow question.

Distributed R

sparklyr 0.6 provides support for executing distributed R code through spark_apply(). For instance, after connecting and copying some data:

library(sparklyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris)

We can apply an arbitrary R function, say jitter(), to each column over each row as follows:

iris_tbl %>% spark_apply(function(e) sapply(e[,1:4], jitter)) ## # Source: table [?? x 4] ## # Database: spark_connection ## Sepal_Length Sepal_Width Petal_Length Petal_Width ## ## 1 5.102223 3.507372 1.406654 0.1990680 ## 2 4.900148 3.002006 1.396052 0.2002922 ## 3 4.699807 3.204126 1.282652 0.2023850 ## 4 4.618854 3.084675 1.508538 0.2119644 ## 5 4.985322 3.596079 1.388837 0.1846146 ## 6 5.381947 3.881051 1.686600 0.3896673 ## 7 4.613713 3.400265 1.404120 0.2954829 ## 8 4.995116 3.408897 1.493193 0.1945901 ## 9 4.418538 2.916306 1.392230 0.1976161 ## 10 4.891340 3.096591 1.498078 0.1174069 ## # ... with 140 more rows

One can also group by columns to apply an operation over each group of rows, say, to perform linear regression over each group as follows:

spark_apply( iris_tbl, function(e) broom::tidy(lm(Petal_Width ~ Petal_Length, e)), names = c("term", "estimate", "std.error", "statistic", "p.value"), group_by = "Species" ) ## # Source: table [?? x 6] ## # Database: spark_connection ## Species term estimate std.error statistic p.value ## ## 1 versicolor (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01 ## 2 versicolor Petal_Length 0.33105360 0.03750041 8.8279995 1.271916e-11 ## 3 virginica (Intercept) 1.13603130 0.37936622 2.9945505 4.336312e-03 ## 4 virginica Petal_Length 0.16029696 0.06800119 2.3572668 2.253577e-02 ## 5 setosa (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01 ## 6 setosa Petal_Length 0.20124509 0.08263253 2.4354220 1.863892e-02

Packages can be used since they are automatically distributed to the worker nodes; however, using spark_apply() requires R to be installed over each worker node. Please refer to Distributing R Computations for additional information and examples.

External Data Sources

sparklyr 0.6 adds support for connecting Spark to databases. This feature is useful if your Spark environment is separate from your data environment, or if you use Spark to access multiple data sources. You can use spark_read_source(), spark_write_source with any data connector available in Spark Packages. Alternatively, you can use spark_read_jdbc() and spark_write_jdbc() and a JDBC driver with almost any data source.

For example, you can connect to Cassandra using spark_read_source(). Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section.

config <- spark_config() config[["sparklyr.defaultPackages"]] <- c( "datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11") sc <- spark_connect(master = "local", config = config) spark_read_source(sc, "emp", "org.apache.spark.sql.cassandra", list(keyspace = "dev", table = "emp"))

To connect to MySQL, one can download the MySQL connector and use spark_read_jdbc() as follows:

config <- spark_config() config$`` <- "~/Downloads/mysql-connector-java-5.1.41/mysql-connector-java-5.1.41-bin.jar" sc <- spark_connect(master = "local", config = config) spark_read_jdbc(sc, "person_jdbc", options = list( url = "jdbc:mysql://localhost:3306/sparklyr", user = "root", password = "", dbtable = "person"))

Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section. See also crassy, an sparklyr extension being developed to read data from Cassandra with ease.

Dataframe Functions

sparklyr 0.6 includes many improvements for working with DataFrames. Here are some additional highlights.

x_tbl <- sdf_copy_to(sc, data.frame(a = c(1,2,3), b = c(2,3,4))) y_tbl <- sdf_copy_to(sc, data.frame(b = c(3,4,5), c = c("A","B","C"))) Pivoting DataFrames

It is now possible to pivot (i.e. cross tabulate) one or more columns using sdf_pivot().

sdf_pivot(y_tbl, b ~ c, list(b = "count")) ## # Source: table [?? x 4] ## # Database: spark_connection ## b A B C ## ## 1 4 NaN 1 NaN ## 2 3 1 NaN NaN ## 3 5 NaN NaN 1 Binding Rows and Columns

Binding DataFrames by rows and columns is supported through sdf_bind_rows() and sdf_bind_cols():

sdf_bind_rows(x_tbl, y_tbl) ## # Source: table [?? x 3] ## # Database: spark_connection ## a b c ## ## 1 1 2 ## 2 2 3 ## 3 3 4 ## 4 NaN 3 A ## 5 NaN 4 B ## 6 NaN 5 C sdf_bind_cols(x_tbl, y_tbl) ## # Source: lazy query [?? x 4] ## # Database: spark_connection ## a b.x b.y c ## ## 1 1 2 3 A ## 2 3 4 5 C ## 3 2 3 4 B Separating Columns

Separate lists into columns with ease. This is especially useful when working with model predictions that are returned as lists instead of scalars. In this example, each element in the probability column contains two items. We can use sdf_separate_column() to isolate the item that corresponds to the probability that vs equals one.

library(magrittr) mtcars[, c("vs", "mpg")] %>% sdf_copy_to(sc, .) %>% ml_logistic_regression("vs", "mpg") %>% sdf_predict() %>% sdf_separate_column("probability", list("P[vs=1]" = 2)) ## # Source: table [?? x 7] ## # Database: spark_connection ## vs mpg id58fb64e07a38 rawPrediction probability prediction ## ## 1 0 21.0 0 1 ## 2 0 21.0 1 1 ## 3 1 22.8 2 1 ## 4 1 21.4 3 1 ## 5 0 18.7 4 0 ## 6 1 18.1 5 0 ## 7 0 14.3 6 0 ## 8 1 24.4 7 1 ## 9 1 22.8 8 1 ## 10 1 19.2 9 0 ## # ... with 22 more rows, and 1 more variables: `P[vs=1]` Machine Learning Multinomial Regression

sparklyr 0.6 adds support for multinomial regression for Spark 2.1.0 or higher:

iris_tbl %>% ml_logistic_regression("Species", features = c("Sepal_Length", "Sepal_Width")) ## Call: Species ~ Sepal_Length + Sepal_Width ## ## Coefficients: ## (Intercept) Sepal_Length Sepal_Width ## [1,] -201.5540 73.19269 -59.83942 ## [2,] -214.6001 75.09506 -59.43476 ## [3,] 416.1541 -148.28775 119.27418 Improved Text Mining with LDA

ft_tokenizer() was introduced in sparklyr 0.5 but sparklyr 0.6 introduces ft_count_vectorizer() and vocabulary.only to simplify LDA:

library(janeaustenr) lines_tbl <- sdf_copy_to(sc,austen_books()[c(1,3),]) lines_tbl %>% ft_tokenizer("text", "tokens") %>% ft_count_vectorizer("tokens", "features") %>% ml_lda("features", k = 4) ## An LDA model fit on 1 features ## ## Topics Matrix: ## [,1] [,2] [,3] [,4] ## [1,] 0.8996952 0.8569472 1.1249431 0.9366354 ## [2,] 0.9815787 1.1721218 1.0795771 0.9990090 ## [3,] 1.1738678 0.8600233 0.9864862 0.9573433 ## [4,] 0.8951603 1.0730703 0.9562389 0.8899160 ## [5,] 1.0528748 1.0291708 1.0699833 0.8731401 ## [6,] 1.1857015 1.0441299 1.0987713 1.1247574 ## ## Estimated Document Concentration: ## [1] 13.52071 13.52172 13.52060 13.51963

The vocabulary can be printed with:

lines_tbl %>% ft_tokenizer("text", "tokens") %>% ft_count_vectorizer("tokens", "features", vocabulary.only = TRUE) ## [1] "jane" "sense" "austen" "by" "sensibility" ## [6] "and"

That’s all for now, disconnecting:

spark_disconnect(sc) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data visualization with googleVis exercises part 9

Sun, 07/30/2017 - 18:26

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Histogram & Calendar chart

This is part 9 of our series and we are going to explore the features of two interesting types of charts that googleVis provides like histogram and calendar charts.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:

To run this example we will first create an experimental data frame with:
Hist=data.frame(A=rpois(100, 10),
B=rpois(100, 20),
C=rpois(100, 30))

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. All charts require an Internet connection.


It is quite simple to create a Histogram with googleVis. We will use the “Hist” data frame we just created. You can see the variables of your data frame with head().
Look at the example below to create a simple histogram:
HistC <- gvisHistogram(Hist)

Exercise 1

Create a list named “HistC” and pass to it the “Hist” data frame as a histogram. HINT: Use gvisHistogram().

Exercise 2

Plot the the histogram. HINT: Use plot().


To add a legend to your chart you can use:
legend="{ position: 'top' }")

Exercise 3

Add a legend to the bottom of your histogram and plot it. HINT: Use list().

To decide the colours of your bars you can use:
colors="['black', 'green', 'yellow']")

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 4

Change the colours of the histogram’s bars to red, green and blue and plot it. HINT: Use colors.

To set the dimensions of your histogram you can use:
width=400, height=400)

Exercise 5

Set width of your histogram to 500, its height to 400 and plot it.

Calendar chart

It is quite simple to create a Calendar Chart with googleVis. We will use the “Cairo” data set. You can see the variables of “Cairo” with head().
Look at the example below to create a simple calendar chart:
CalC <- gvisCalendar(Cairo)

Exercise 6

Create a list named “CalC” and pass to it the “Cairo” data set as a calendar chart. HINT: Use gvisCalendar().

Exercise 7

Plot the the calendar chart. HINT: Use plot().


You can add title to your chart and set the dimensions with:

Exercise 8

Add a title to your calendar chart, set height to 500 and plot it. HINT: Use list().

You can change the features of your labels with:
options=list(calendar="{yearLabel: { fontName: 'Times-Roman',
fontSize: 26, color: 'black', bold: false})

Exercise 9

Add labels to your chart ,set the font of your labels to “Times-Roman”, their size to 30, their color to black, make them bold and plot the chart.

To find more options about the cells you can use:
cellSize: 15,
cellColor: { stroke: 'red', strokeOpacity: 0.5 },
focusedCellColor: {stroke:'red'}

Exercise 10

Set the size of the cells to 10, the focused color to green and plot the chart.

Related exercise sets:
  1. Data Visualization with googleVis exercises part 2
  2. Data Visualization with googleVis exercises part 4
  3. Data Visualization with googleVis exercises part 3
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Matching, Optimal Transport and Statistical Tests

Sun, 07/30/2017 - 14:54

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

To explain the “optimal transport” problem, we usually start with Gaspard Monge’s “Mémoire sur la théorie des déblais et des remblais“, where the the problem of transporting a given distribution of matter (a pile of sand for instance) into another (an excavation for instance). This problem is usually formulated using distributions, and we seek the “optimal” transport from one distribution to the other one. The formulation, in the context of distributions has been formulated in the 40’s by Leonid Kantorovich, e.g. from the distribution on the left to the distribution on the right.

Consider now the context of finite sets of points. We want to transport mass from points to points . It is a complicated combinatorial problem. For 4 points, there are only 24 possible transfer to consider, but it exceeds 20 billions with 15 points (on each side). For instance, the following one is usually seen as inefficient

while the following is usually seen as much better

Of course, it depends on the cost of the transport, which depends on the distance between the origin and the destination. That cost is usually either linear or quadratic.

There are many application of optimal transport in economics, see eg Alfred’s book Optimal Transport Methods in Economics. And there are also applications in statistics, that what I’ve seen while I was discussing with Pierre while I was in Boston, in June. For instance if we want to test whether some sample were drawn from the same distribution,

npoints <- 25
mu1 <- c(1,1)
mu2 <- c(0,2)
Sigma1 <- diag(1, 2, 2)
Sigma2 <- diag(1, 2, 2)
Sigma2[2,1] <- Sigma2[1,2] <- -0.5
Sigma1 <- 0.4 * Sigma1
Sigma2 <- 0.4 *Sigma2
X1 <- rmnorm(npoints, mean = mu1, Sigma1)
X2 <- rmnorm(npoints, mean = mu2, Sigma2)
plot(X1[,1], X1[,2], ,col="blue")
points(X2[,1], X2[,2], col = "red")

Here we use a parametric model to generate our sample (as always), and we might think of a parametric test (testing whether mean and variance parameters of the two distributions are equal).

or we might prefer a nonparametric test. The idea Pierre mentioned was based on optimal transport. Consider some quadratic loss

ground_p <- 2
p <- 1
w1 <- rep(1/npoints, npoints)
w2 <- rep(1/npoints, npoints)
C <- cost_matrix_Lp(t(X1), t(X2), ground_p)
a <- transport(w1, w2, costm = C^p, method = "shortsimplex")

then it is possible to match points in the two samples

nonzero <- which(a$mass != 0)
from_indices <- a$from[nonzero]
to_indices <- a$to[nonzero]
for (i in from_indices){
segments(X1[from_indices[i],1], X1[from_indices[i],2], X2[to_indices[i], 1], X2[to_indices[i],2])

Here we can observe two things. The total cost can be seen as rather large

> cost=function(a,X1,X2){
nonzero <- which(a$mass != 0)
d=function(i) (X1[naa$from[i],1]-X2[naa$to[i],1])^2+(X1[naa$from[i],2]-X2[naa$to[i],2])^2
> cost(a,X1,X2)
[1] 9.372472

and the angle of the transport direction is alway in the same direction (more or less)

> angle=function(a,X1,X2){
nonzero <- which(a$mass != 0)
d=function(i) (X1[naa$from[i],2]-X2[naa$to[i],2])/(X1[naa$from[i],1]-X2[naa$to[i],1])
> mean(angle(a,X1,X2))
[1] -0.3266797

> library(plotrix)
> ag=(angle(a,X1,X2)/pi)*180
> ag[ag<0]=ag[ag<0]+360
> dag=hist(ag,breaks=seq(0,361,by=1)-.5)
> polar.plot(dag$counts,seq(0,360,by=1),main=”Test Polar Plot”,lwd=3,line.col=4)

(actually, the following plot has been obtain by generating a thousand of sample of size 25)

In order to have a decent test, we need to see what happens under the null assumption (when drawing samples from the same distribution), see

Here is the optimal matching

Here is the distribution of the total cost, when drawing a thousand samples,

for(s in 1:1000){
X1a <- rmnorm(npoints, mean = mu1, Sigma1)
X1b <- rmnorm(npoints, mean = mu1, Sigma2)
ground_p <- 2
p <- 1
w1 <- rep(1/npoints, npoints)
w2 <- rep(1/npoints, npoints)
C <- cost_matrix_Lp(t(X1a), t(X1b), ground_p)
ab <- transport(w1, w2, costm = C^p, method = "shortsimplex")

So our cost of 9 obtained initially was not that high. Observe that when drawing from the same distribution, there is now no pattern in the optimal transport

polar.plot(dag$counts,seq(0,360,by=1),main="Test Polar Plot",lwd=3,line.col=4)


Nice isn’t it? I guess I will spend some time next year working on those transport algorithm, since we have great R packages, and hundreds of applications in economics…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...