Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 15 min ago

Using R: reshape2 to tidyr

Sun, 12/17/2017 - 12:35

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms.

The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data.

But Hadley Wickham’s definition is straightforward:

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add).

And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix.

The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2.

library(reshape2) fake_data <- data.frame(id = 1:20, variable1 = runif(20, 0, 1), variable2 = rnorm(20)) melted <- melt(fake_data, id.vars = "id")

The id.vars argument is to tell the function that the id column is the key, a column that tells us which individual each observation comes from. As the name suggests, id.vars can name multiple columns in a vector.

So the is the data before:

id variable1 variable2 1 1 0.938173781 0.852098580 2 2 0.408216233 0.261269134 3 3 0.341325188 1.796235963 4 4 0.958889279 -0.356218000

And this is after. This time the data frame doesn’t become wider. There are still three columns. But we go from 20 rows to 40: two variables times 20 individuals.

id variable value 1 1 variable1 0.938173781 2 2 variable1 0.408216233 3 3 variable1 0.341325188 4 4 variable1 0.958889279

And now: tidyr. tidyr is the new tidyverse package for rearranging data like this.

The tidyr equivalent of the melt function is called gather. There are two important differences that messed with my mind at first.

The melt and gather functions take the opposite default assumption about what columns should be treated as keys and what columns should be treated as containing values. In melt, as we saw above, we need to list the keys to keep them with each observation. In gather, we need to list the value columns, and the rest will be treated as keys.

Also, the second and third arguments (and they would be the first and second if you piped something into it), are the variable names that will be used in the long form data. In this case, to get a data frame that looks exactly the same as the first, we will stick with ”variable” and ”value”.

Here are five different ways to get the same long form data frame as above:

library(tidyr) melted <- gather(fake_data, variable, value, 2:3) ## Column names instead of indices melted <- gather(fake_data, variable, value, variable1, variable2) ## Excluding instead of including melted <- gather(fake_data, variable, value, -1) ## Excluding using column name melted <- gather(fake_data, variable, value, -id) ## With pipe melted <- fake_data %>% gather(variable, value, -id)

Usually, this is the transformation we need: wide to long. If we need to go the other way, we can use plyr’s cast functions, and tidyr’s gather. This code recovers the original data frame:

## plyr dcast(melted, id ~ variable) ## tidyr spread(melted, variable, value)

Postat i:computer stuff, data analysis, english Tagged: R, reshape2, tidyr

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Teaching the tidyverse to beginners

Sun, 12/17/2017 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

End October I tweeted this:

will teach #rstats soon again but this time following @drob 's suggestion of the tidyverse first as laid out here: https://t.co/js8SsUs8Nv

— Bruno Rodrigues (@brodriguesco) October 24, 2017

and it generated some discussion. Some people believe that this is the right approach, and some others think that one should first present base R and then show how the tidyverse complements it. This year, I taught three classes; a 12-hour class to colleagues that work with me, a 15-hour class to master’s students and 3 hours again to some of my colleagues. Each time, I decided to focus on the tidyverse(almost) entirely, and must say that I am not disappointed with the results!

The 12 hour class was divided in two 6 hours days. It was a bit intense, especially the last 3 hours that took place Friday afternoon. The crowd was composed of some economists that had experience with STATA, some others that were mostly using Excel and finally some colleagues from the IT department that sometimes need to dig into some data themselves. Apart from 2 people, all the other never had any experience with R.

We went from 0 to being able to do the plot below after the end of the first day (so 6 hours in). Keep in mind that practically none of them even had opened RStudio before. I show the code so you can see the progress made in just a few hours:

library(Ecdat) library(tidyverse) library(ggthemes) data(Bwages) bwages = Bwages %>% mutate(educ_level = case_when(educ == 1 ~ "Primary School", educ == 2 ~ "High School", educ == 3 ~ "Some university", educ == 4 ~ "Master's degree", educ == 5 ~ "Doctoral degree")) ggplot(bwages) + geom_smooth(aes(exper, wage, colour = educ_level)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank()) ## `geom_smooth()` using method = 'loess'

Of course some of them needed some help here and there, and I also gave them hints (for example I told them about case_when() and try to use it inside mutate() instead of nested ifs) but it was mostly due to lack of experience and because they hadn’t had the time to fully digest R’s syntax which was for most people involved completely new.

On the second day I showed purrr::map() and purrr::reduce() and overall it went quite well too. I even showed list-columns, and this is where I started losing some of them; I did not insist too much on it though, only wanted to show them the flexibility of data.frame objects. Some of them were quite impressed by list-columns! Then I started showing (for and while) loops and writing functions. I even showed them tidyeval and again, it went relatively well. Once they had the opportunity to play a bit around with it, I think it clicked (plus they have lots of code examples to go back too).

At the end, people seemed to have enjoyed the course, but told me that Friday was heavy; indeed it was, but I feel that it was mostly because 12 hours spread on 2 days is not the best format for this type of material, but we all had time constraints.

The 15 hour Master’s course was spread over 4 days, and covered basically the same. I just used the last 3 hours to show the students some basic functions for model estimation (linear, count, logit/probit and survival models). Again, the students were quite impressed by how easily they could get descriptive statistics by first grouping by some variables. Through their questions, I even got to show them scoped versions of dplyr verbs, such as select_if() and summarise_at(). I was expecting to lose them there, but actually most of them got these scoped versions quite fast. These students already had some experience with R though, but none with the tidyverse.

Finally the 3 hour course was perhaps the most interesting; I only had 100% total beginners. Some just knew R by name and had never heard/seen/opened RStudio (with the exception of one person)! I did not show them any loops, function definitions and no plots. I only showed them how RStudio looked and worked, what were (and how to install) packages (as well as the CRAN Task Views) and then how to import data with rio and do descriptive statistics only with dplyr. They were really interested and quite impressed by rio (“what do you mean I can use the same code for importing any dataset, in any format?”) but also by the simplicity of dplyr.

In all the courses, I did show the $ primitive to refer to columns inside a data.frame. First I showed them lists which is where I introduced $. Then it was easy to explain to them why it was the same for a column inside a data.frame; a data.frame is simply a list! This is also the distinction I made from the previous years; I simply mentioned (and showed really quickly) matrices and focused almost entirely on lists. Most participants, if not all, had learned to program statistics by thinking about linear algebra and matrices. Nothing wrong with that, but I feel that R really shines when you focus on lists and on how to work with them.

Overall as the teacher, I think that focusing on the tidyverse might be a very good strategy. I might have to do some adjustments here and there for the future courses, but my hunch is that the difficulties that some participants had were not necessarily due to the tidyverse but simply to lack of time to digest what was shown, as well as a total lack of experience with R. I do not think that these participants would have better understood a more traditional, base, matrix-oriented course.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining Week 15 (2017)

Sat, 12/16/2017 - 23:16

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 15 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining Week 15 (2017) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

drat 0.1.4

Sat, 12/16/2017 - 18:26

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of drat just arrived on CRAN as another no-human-can-delay-this automatic upgrade directly from the CRAN prechecks (though I did need a manual reminder from Uwe to remove a now stale drat repo URL — bad @hrbrmstr — from the README in a first attempt).

This release is mostly the work of Neal Fultz who kindly sent me two squeaky-clean pull requests addressing two open issue tickets. As drat is reasonably small and simple, that was enough to motivate a quick release. I also ensured that PACKAGES.rds will always if committed along (if we’re in commit mode), which is a follow-up to an initial change from 0.1.3 in September.

drat stands for drat R Archive Template, and helps with easy-to-create and easy-to-use repositories for R packages. Since its inception in early 2015 it has found reasonably widespread adoption among R users because repositories with marked releases is the better way to distribute code.

The NEWS file summarises the release as follows:

Changes in drat version 0.1.4 (2017-12-16)
  • Changes in drat functionality

    • Binaries for macOS are now split by R version into two different directories (Neal Futz in #67 addring #64).

    • The target branch can now be set via a global option (Neal Futz in #68 addressing #61).

    • In commit mode, add file PACKAGES.rds unconditionally.

  • Changes in drat documentation

    • Updated ‘README.md’ removing another stale example URL

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

More Pipes in R

Sat, 12/16/2017 - 15:58

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Was enjoying Gabriel’s article Pipes in R Tutorial For Beginners and wanted call attention to a few more pipes in R (not all for beginners).


The Kleisli arrow Khaleesi and dragon.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

digest 0.6.13

Sat, 12/16/2017 - 13:43

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A small maintenance release, version 0.6.13, of the digest package arrived on CRAN and in Debian yesterday.

digest creates hash digests of arbitrary R objects (using the ‘md5’, ‘sha-1’, ‘sha-256’, ‘crc32’, ‘xxhash’ and ‘murmurhash’ algorithms) permitting easy comparison of R language objects.

This release accomodates a request by Luke and Tomas to make the version argument of serialize() an argument to digest() too, which was easy enough to accomodate. The value 2L is the current default (and for now only permitted value). The ALTREP changes in R 3.5 will bring us a new, and more powerful, format with value 3L. Changes can be set in each call, or globally via options(). Other than we just clarified one aspect of raw vector usage in the manual page.

CRANberries provides the usual summary of changes to the previous version.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The 3rd paperback editions of my books on Cricket, now on Amazon

Sat, 12/16/2017 - 04:44

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

The 3rd  paperback edition of both my books on cricket is now available on Amazon for $12.99

a) Cricket analytics with cricketr, Third Edition ($12.99). This book is based on my R package ‘cricketr‘, available on CRAN and uses ESPN Cricinfo Statsguru

b) Beaten by sheer pace! Cricket analytics with yorkr, 3rd edition ($12.99). This is based on my R package ‘yorkr‘ on CRAN and uses data from Cricsheet

Pick up your copies today!!

Note: In the 3rd edition of  the paperback book, the charts will be in black and white. If you would like the charts to be in color, please check out the 2nd edition of these books

You may also like
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
4.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
5. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
6. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data

To see all posts see Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

“Print hello”​ is not enough. A collection of Hello world functions.

Fri, 12/15/2017 - 22:09

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)


I guess I wrote my R “hello world!” function 7 or 8 years ago while approaching R for the first time. And it is too little to illustrate the basic syntax of a programming language for a working program to a wannabe R programmer. Thus, here follows a collection of basic functions that may help a bit more than the famed piece of code.

###################################################### ############### Hello world functions ################ ###################################################### ################################## # General info fun <- function( arguments ) { body } ################################## foo.add <- function(x,y){   x+y } foo.add(7, 5) ---------------------------------- foo.above <- function(x){   x[x>10] } foo.above(1:100) ---------------------------------- foo.above_n <- function(x,n){   x[x>n] } foo.above_n(1:20, 12) ---------------------------------- foo = seq(1, 100, by=2) foo.squared = NULL for (i in 1:50 ) {   foo.squared[i] = foo[i]^2 } foo.squared ---------------------------------- a <- c(1,6,7,8,8,9,2) s <- 0 for (i in 1:length(a)){   s <- s + a[[i]] } s ---------------------------------- a <- c(1,6,7,8,8,9,2,100) s <- 0 i <- 1 while (i <= length(a)){   s <- s + a[[i]]   i <- i+1 } s ---------------------------------- FunSum <- function(a){   s <- 0   i <- 1   while (i <= length(a)){     s <- s + a[[i]]     i <- i+1   }   print(s) } FunSum(a) ----------------------------------- SumInt <- function(n){   s <- 0   for (i in 1:n){     s <- s + i   }   print(s)   } SumInt(14) ----------------------------------- # find the maximum # right to left assignment x <- c(3, 9, 7, 2) # trick: it is necessary to use a temporary variable to allow the comparison by pairs of # each number of the sequence, i.e. the process of comparison is incremental: each time # a bigger number compared to the previous in the sequence is found, it is assigned as the # temporary maximum # Since the process has to start somewhere, the first (temporary) maximum is assigned to be # the first number of the sequence max <- x[1] for(i in x){   tmpmax = i   if(tmpmax > max){     max = tmpmax   } } max x <- c(-20, -14, 6, 2) x <- c(-2, -24, -14, -7) min <- x[1] for(i in x){   tmpmin = i   if(tmpmin < min){     min = tmpmin   } } min ---------------------------------- # n is the nth Fibonacci number # temp is the temporary variable Fibonacci <- function(n){   a <- 0   b <- 1   for(i in 1:n){     temp <- b     b <- a     a <- a + temp   }   return(a) } Fibonacci(13) ---------------------------------- # R available factorial function factorial(5) # recursive function: ff ff <- function(x) {   if(x<=0) {     return(1)   } else {     return(x*ff(x-1)) # function uses the fact it knows its own name to call itself   } } ff(5) ---------------------------------- say_hello_to <- function(name){   paste("Hello", name) }  say_hello_to("Roberto") ---------------------------------- foo.colmean <- function(y){   nc <- ncol(y)   means <- numeric(nc)   for(i in 1:nc){     means[i] <- mean(y[,i])   }   means } foo.colmean(airquality) ---------------------------------- foo.colmean <- function(y, removeNA=FALSE){   nc <- ncol(y)   means <- numeric(nc)   for(i in 1:nc){     means[i] <- mean(y[,i], na.rm=removeNA)   }   means } foo.colmean(airquality, TRUE) ---------------------------------- foo.contingency <- function(x,y){   nc <- ncol(x)   out <- list()    for (i in 1:nc){     out[[i]] <- table(y, x[,i])    }   names(out) <- names(x)   out } set.seed(123) v1 <- sample(c(rep("a", 5), rep("b", 15), rep("c", 20))) v2 <- sample(c(rep("d", 15), rep("e", 20), rep("f", 5))) v3 <- sample(c(rep("g", 10), rep("h", 10), rep("k", 20))) data <- data.frame(v1, v2, v3) foo.contingency(data,v3)

That’s all folks!

#R #rstats #maRche #Rbloggers

This post is also shared in LinkedIn and www.r-bloggers.com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NIPS 2017 Summary

Fri, 12/15/2017 - 21:52

(This article was first published on Rstats – bayesianbiologist, and kindly contributed to R-bloggers)

Some (opinionated) themes and highlights from this year’s NIPS conference:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – bayesianbiologist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Inter-operate with ‘MQTT’ Message Brokers With R (a.k.a. Live! BBC! Subtitles!)

Fri, 12/15/2017 - 05:31

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Most of us see the internet through the lens of browsers and apps on our laptops, desktops, watches, TVs and mobile devices. These displays are showing us — for the most part — content designed for human consumption. Sure, apps handle API interactions, but even most of that communication happens over ports 80 or 443. But, there are lots of ports out there; 0:65535, in fact (at least TCP-wise). And, all of them have some kind of data, and most of that is still targeted to something for us.

What if I told you the machines are also talking to each other using a thin/efficient protocol that allows one, tiny sensor to talk to hundreds — if not thousands — of systems without even a drop of silicon-laced sweat? How can a mere, constrained sensor do that? Well, it doesn’t do it alone. Many of them share their data over a fairly new protocol dubbed MQTT (Message Queuing Telemetry Transport).

An MQTT broker watches for devices to publish data under various topics and then also watches for other systems to subscribe to said topics and handles the rest of the interchange. The protocol is lightweight enough that fairly low-powered (CPU- and literal electric-use-wise) devices can easily send their data chunks up to a broker, and the entire protocol is robust enough to support a plethora of connections and an equal plethora of types of data.

Why am I telling you all this?

Devices that publish to MQTT brokers tend to be in the spectrum of what folks sadly call the “Internet of Things”. It’s a terrible, ambiguous name, but it’s all over the media and most folks have some idea what it means. In the context of MQTT, you can think of it as, say, a single temperature sensor publishing it’s data to an MQTT broker so many other things — including programs written by humans to capture, log and analyze that data — can receive it. This is starting to sound like something that might be right up R’s alley.

There are also potential use-cases where an online data processing system might want to publish data to many clients without said clients having to poll a poor, single-threaded R server constantly.

Having MQTT connectivity for R could be really interesting.

And, now we have the beginnings of said connectivity with the mqtt package.

Another Package? Really?

Yes, really.

Besides the huge potential for having a direct R-bridge to the MQTT world, I’m work-interested in MQTT since we’ve found over 35,000 of them on the default, plaintext port for MQTT (1883) alone:

There are enough of them that I don’t even need to show a base map.

Some of these servers require authentication and others aren’t doing much of anything. But, there are a number of them hosted by corporations and individuals that are exposing real data. OwnTracks seems to be one of the more popular self-/badly-hosted ones.

Then, there are others — like test.mosquitto.org — which deliberately run open MQTT servers for “testing”. There definitely is testing going on there, but there are also real services using it as a production broker. The mqtt package is based on the mosquitto C library, so it’s only fitting that we show a few examples from its own test site here.

For now, there’s really one function: topic_subscribe(). Eventually, R will be able to publish to a broker and do more robust data collection operations (say, to make a live MQTT dashboard in Shiny). The topic_subscribe() function is an all-in one tool that enables you to:

  • connect to a broker
  • subscribe to a topic
  • pass in R callback functions which will be executed on connect, disconnect and when new messages come in

That’s plenty of functionality to do some fun things.

What’s the frequencytemperature, Kenneth?

The mosquitto test server has one topic — /outbox/crouton-demo/temperature — which is a fake temperature sensor that just sends data periodically so you have something to test with. Let’s capture 50 samples and plot them.

Since we’re using a callback we have to use the tricksy <<- operator to store/update variables outside the callback function. And, we should pre-allocate space for said data to avoid needlessly growing objects. Here’s a complete code-block:

library(mqtt) # devtools::install_github("hrbrmstr/mqtt") library(jsonlite) library(hrbrthemes) library(tidyverse) i <- 0 # initialize our counter max_recs <- 50 # max number of readings to get readings <- vector("character", max_recs) # our callback function temp_cb <- function(id, topic, payload, qos, retain) { i <<- i + 1 # update the counter readings[i] <<- readBin(payload, "character") # update our larger-scoped vector return(if (i==max_recs) "quit" else "go") # need to send at least "". "quit" == done } topic_subscribe( topic = "/outbox/crouton-demo/temperature", message_callback=temp_cb ) # each reading looks like this: # {"update": {"labels":[4631],"series":[[68]]}} map(readings, fromJSON) %>% map(unlist) %>% map_df(as.list) %>% ggplot(aes(update.labels, update.series)) + geom_line() + geom_point() + labs(x="Reading", y="Temp (F)", title="Temperature via MQTT") + theme_ipsum_rc(grid="XY")

We setup temp_cb() to be our callback and topic_subscribe() ensures that the underlying mosquitto library will call it every time a new message is published to that topic. The chart really shows how synthetic the data is.

Subtitles from the Edge

Temperature sensors are just the sort of thing that MQTT was designed for. But, we don’t need to be stodgy about our use of MQTT.

Just about a year ago from this post, the BBC launched live subtitles for iPlayer. Residents of the Colonies may not know what iPlayer is, but it’s the “app” that lets UK citizens watch BBC programmes on glowing rectangles that aren’t proper tellys. Live subtitles are hard to produce well (and get right) and the BBC making the effort to do so also on their digital platform is quite commendable. We U.S. folks will likely be charged $0.99 for each set of digital subtitles now that net neutrality is gone.

Now, some clever person(s) wired up some of these live subtitles to MQTT topics. We can wire up our own code in R to read them live:

bbc_callback <- function(id, topic, payload, qos, retain) { cat(crayon::green(readBin(payload, "character")), "\n", sep="") return("") # ctrl-c will terminate } mqtt::topic_subscribe(topic = "bbc/subtitles/bbc_news24/raw", connection_callback=mqtt::mqtt_silent_connection_callback, message_callback=bbc_callback)

In this case, control-c terminates things (cleanly).

You could easily modify the above code to have a bot that monitors for certain keywords then sends windowed chunks of subtitled text to some other system (Slack, database, etc). Or, create an online tidy text analysis workflow from them.

FIN

The code is on GitHub and all input/contributions are welcome and encouraged. Some necessary TBDs are authentication & encryption. But, how would you like the API to look for using it, say, in Shiny apps? What should publishing look like? What helper functions would be useful (ones to slice & dice topic names or another to convert raw message text more safely)? Should there be an R MQTT “DSL”? Lots of things to ponder and so many sites to “test”!

P.S.

In case you are concerned about the unusually boring R package name, I wanted to use RIoT (lower-cased, of course) but riot is, alas, already taken.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting started with seplyr

Thu, 12/14/2017 - 18:36

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.



For how and why, please check out our new introductory article.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pipes in R Tutorial For Beginners

Thu, 12/14/2017 - 18:12

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

You might have already seen or used the pipe operator when you're working with packages such as dplyr, magrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives?

This tutorial will give you an introduction to pipes in R and will cover the following topics:

Are you interested in learning more about manipulating data in R with dplyr? Take a look at DataCamp's Data Manipulation in R with dplyr course.

Pipe Operator in R: Introduction

To understand what the pipe operator in R is and what you can do with it, it's necessary to consider the full picture, to learn the history behind it. Questions such as "where does this weird combination of symbols come from and why was it made like this?" might be on top of your mind. You'll discover the answers to these and more questions in this section.

Now, you can look at the history from three perspectives: from a mathematical point of view, from a holistic point of view of programming languages, and from the point of view of the R language itself. You'll cover all three in what follows!

History of the Pipe Operator in R Mathematical History

If you have two functions, let's say $f : B → C$ and $g : A → B$, you can chain these functions together by taking the output of one function and inserting it into the next. In short, "chaining" means that you pass an intermediate result onto the next function, but you'll see more about that later.

For example, you can say, $f(g(x))$: $g(x)$ serves as an input for $f()$, while $x$, of course, serves as input to $g()$.

If you would want to note this down, you will use the notation $f ◦ g$, which reads as "f follows g". Alternatively, you can visually represent this as:

Image Credit: James Balamuta, "Piping Data" Pipe Operators in Other Programming Languages

As mentioned in the introduction to this section, this operator is not new in programming: in the Shell or Terminal, you can pass command from one to the next with the pipeline character |. Similarly, F# has a forward pipe operator, which will prove to be important later on! Lastly, it's also good to know that Haskell contains many piping operations that are derived from the Shell or Terminal.

Pipes in R

Now that you have seen some history of the pipe operator in other programming languages, it's time to focus on R. The history of this operator in R starts, according to this fantastic blog post written by Adolfo Álvarez, on January 17th, 2012, when an anonymous user asked the following question in this Stack Overflow post:

How can you implement F#'s forward pipe operator in R? The operator makes it possible to easily chain a sequence of calculations. For example, when you have an input data and want to call functions foo and bar in sequence, you can write data |> foo |> bar?

The answer came from Ben Bolker, professor at McMaster University, who replied:

I don't know how well it would hold up to any real use, but this seems (?) to do what you want, at least for single-argument functions …

"%>%" <- function(x,f) do.call(f,list(x)) pi %>% sin [1] 1.224606e-16 pi %>% sin %>% cos [1] 1 cos(sin(pi)) [1] 1

About nine months later, Hadley Wickham started the dplyr package on GitHub. You might now know Hadley, Chief Scientist at RStudio, as the author of many popular R packages (such as this last package!) and as the instructor for DataCamp's Writing Functions in R course.

Be however it may, it wasn't until 2013 that the first pipe %.% appears in this package. As Adolfo Álvarez rightfully mentions in his blog post, the function was denominated chain(), which had the purpose to simplify the notation for the application of several functions to a single data frame in R.

The %.% pipe would not be around for long, as Stefan Bache proposed an alternative on the 29th of December 2013, that included the operator as you might now know it:

iris %>% subset(Sepal.Length > 5) %>% aggregate(. ~ Species, ., mean)

Bache continued to work with this pipe operation and at the end of 2013, the magrittr package came to being. In the meantime, Hadley Wickham continued to work on dplyr and in April 2014, the %.% operator got replaced with the one that you now know, %>%.

Later that year, Kun Ren published the pipeR package on GitHub, which incorporated a different pipe operator, %>>%, which was designed to add more flexibility to the piping process. However, it's safe to say that the %>% is now established in the R language, especially with the recent popularity of the Tidyverse.

What Is It?

Knowing the history is one thing, but that still doesn't give you an idea of what F#'s forward pipe operator is nor what it actually does in R.

In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.

Remember that "chaining" means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

In R, the pipe operator is, as you have already seen, %>%. If you're not familiar with F#, you can think of this operator as being similar to the + in a ggplot2 statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a "THEN".

Take, for example, following code chunk and read it aloud:

class="lang-{r}">iris %>% subset(Sepal.Length > 5) %>% aggregate(. ~ Species, ., mean)

You're right, the code chunk above will translate to something like "you take the Iris data, then you subset the data and then you aggregate the data".

This is one of the most powerful things about the Tidyverse. In fact, having a standardized chain of processing actions is called "a pipeline". Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2 friendly format, for example.

Why Use It?

R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand. Here's where %>% comes in to the rescue!

Take a look at the following example, which is a typical example of nested code:

class="lang-R"># Initialize `x` x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907) # Compute the logarithm of `x`, return suitably lagged and iterated differences, # compute the exponential function and round the result round(exp(diff(log(x))), 1)
  1. 3.3
  2. 1.8
  3. 1.6
  4. 0.5
  5. 0.3
  6. 0.1
  7. 48.8
  8. 1.1

With the help of %<%, you can rewrite the above code as follows:

class="lang-R"># Import `magrittr` library(magrittr) # Perform the same computations on `x` as above x %>% log() %>% diff() %>% exp() %>% round(1)

Does this seem difficult to you? No worries! You'll learn more on how to go about this later on in this tutorial.

Note that you need to import the magrittr library to get the above code to work. That's because the pipe operator is, as you read above, part of the magrittr library and is, since 2014, also a part of dplyr. If you forget to import the library, you'll get an error like Error in eval(expr, envir, enclos): could not find function "%>%".

Also note that it isn't a formal requirement to add the parentheses after log, diff and exp, but that, within the R community, some will use it to increase the readability of the code.

In short, here are four reasons why you should be using pipes in R:

  • You'll structure the sequence of your data operations from left to right, as apposed to from inside and out;
  • You'll avoid nested function calls;
  • You'll minimize the need for local variables and function definitions; And
  • You'll make it easy to add steps anywhere in the sequence of operations.

These reasons are taken from the magrittr documentation itself. Implicitly, you see the arguments of readability and flexibility returning.

Additional Pipes

Even though %>% is the (main) pipe operator of the magrittr package, there are a couple of other operators that you should know and that are part of the same package:

  • The compound assignment operator %<>%;
class="lang-R"># Initialize `x` x <- rnorm(100) # Update value of `x` and assign it to `x` x %<>% abs %>% sort
  • The tee operator %T>%;
class="lang-R">rnorm(200) %>% matrix(ncol = 2) %T>% plot %>% colSums

Note that it's good to know for now that the above code chunk is actually a shortcut for:

rnorm(200) %>% matrix(ncol = 2) %T>% { plot(.); . } %>% colSums

But you'll see more about that later on!

  • The exposition pipe operator %$%.
class="lang-R">data.frame(z = rnorm(100)) %$% ts.plot(z)

Of course, these three operators work slightly differently than the main %>% operator. You'll see more about their functionalities and their usage later on in this tutorial!

Note that, even though you'll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr's dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;.

How to Use Pipes in R

Now that you know how the %>% operator originated, what it actually is and why you should use it, it's time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it!

Basic Piping

Before you go into the more advanced usages of the operator, it's good to first take a look at the most basic examples that use the operator. In essence, you'll see that there are 3 rules that you can follow when you're first starting out:

  • f(x) can be rewritten as x %>% f

In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent:

class="lang-R"># Compute the logarithm of `x` log(x) # Compute the logarithm of `x` x %>% log()
  • f(x, y) can be rewritten as x %>% f(y)

Of course, there are a lot of functions that don't just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call.

This all seems quite theoretical. Let's take a look at a more practical example:

class="lang-R"># Round pi round(pi, 6) # Round pi pi %>% round(6)
  • x %>% f %>% g %>% h can be rewritten as h(g(f(x)))

This might seem complex, but it isn't quite like that when you look at a real-life R example:

class="lang-R"># Import `babynames` data library(babynames) # Import `dplyr` library library(dplyr) # Load the data data(babynames) # Count how many young boys with the name "Taylor" are born sum(select(filter(babynames,sex=="M",name=="Taylor"),n)) # Do the same but now with `%>%` babynames%>%filter(sex=="M",name=="Taylor")%>% select(n)%>% sum

Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you'll select n and lastly, you'll sum() everything.

Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log(), diff(), exp() and round() functions to perform calculations on x.

Functions that Use the Current Environment

Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let's take a look at some of them here.

Consider this example, where you use the assign() function to assign the value 10 to the variable x.

class="lang-R"># Assign `10` to `x` assign("x", 10) # Assign `100` to `x` "x" %>% assign(100) # Return `x` x

10

You see that the second call with the assign() function, in combination with the pipe, doesn't work properly. The value of x is not updated.

Why is this?

That's because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment:

class="lang-R"># Define your environment env <- environment() # Add the environment to `assign()` "x" %>% assign(100, envir = env) # Return `x` x

100

Functions with Lazy Evalution

Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn.

One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example:

class="lang-R">tryCatch(stop("!"), error = function(e) "An error") stop("!") %>% tryCatch(error = function(e) "An error")

'An error'

Error in eval(expr, envir, enclos): ! Traceback: 1. stop("!") %>% tryCatch(error = function(e) "An error") 2. eval(lhs, parent, parent) 3. eval(expr, envir, enclos) 4. stop("!")

You'll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try(), suppressMessages(), and suppressWarnings() in base R.

Argument Placeholder

There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples:

  • f(x, y) can be rewritten as y %>% f(x, .)

In some cases, you won't want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code:

% round(6)

If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call?

Take a look at this example, where the value is actually at the third position in the function call:

class="lang-R">"Ceci n'est pas une pipe" %>% gsub("une", "un", .)

'Ceci n\'est pas un pipe'

  • f(y, z = x) can be rewritten as x %>% f(y, z = .)

Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code:

class="lang-R">6 %>% round(pi, digits=.) Re-using the Placeholder for Attributes

It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

Here are some general "rules" that you can take into account when you're working with argument placeholders in nested function calls:

  • f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.))
class="lang-R"># Initialize a matrix `ma` ma <- matrix(1:12, 3, 4) # Return the maximum of the values inputted max(ma, nrow(ma), ncol(ma)) # Return the maximum of the values inputted ma %>% max(nrow(ma), ncol(ma))

12

12

The behavior can be overruled by enclosing the right-hand side in braces:

  • f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))}
class="lang-R"># Only return the maximum of the `nrow(ma)` and `ncol(ma)` input values ma %>% {max(nrow(ma), ncol(ma))}

4

To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call:

class="lang-R"># The function that you want to rewrite paste(1:5, letters[1:5]) # The nested function call with dot placeholder 1:5 %>% paste(., letters[.])
  1. '1 a'
  2. '2 b'
  3. '3 c'
  4. '4 d'
  5. '5 e'
  1. '1 a'
  2. '2 b'
  3. '3 c'
  4. '4 d'
  5. '5 e'

You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }:

class="lang-R"># The nested function call with dot placeholder and curly brackets 1:5 %>% { paste(letters[.]) } # Rewrite the above function call paste(letters[1:5])
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  5. 'e'
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  5. 'e'
Building Unary Functions

Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline:

class="lang-R">. %>% cos %>% sin

This pipeline would take some input, after which both the cos() and sin() fuctions would be applied to it.

But you're not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values.

class="lang-R"># Unary function f <- . %>% cos %>% sin f structure(function (value) freduce(value, `_function_list`), class = c("fseq", "function" ))

Remember also that you could put parentheses after the cos() and sin() functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin().

You see, building functions in magrittr very similar to building functions with base R! If you're not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result!

class="lang-R"># is equivalent to f <- function(.) sin(cos(.)) f function (.) sin(cos(.)) Compound Assignment Pipe Operations

There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this.

class="lang-R"># Load in the Iris data iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) # Add column names to the Iris data names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") # Compute the square root of `iris$Sepal.Length` and assign it to the variable iris$Sepal.Length <- iris$Sepal.Length %>% sqrt()

However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side:

class="lang-R"># Compute the square root of `iris$Sepal.Length` and assign it to the variable iris$Sepal.Length %<>% sqrt # Return `Sepal.Length` iris$Sepal.Length

Note that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator.

As a result, this operator will assign a result of a pipeline rather than returning it.

Tee Operations with The Tee Operator

The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations.

This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.

In other words, functions like plot() typically don't return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():

class="lang-R">set.seed(123) rnorm(200) %>% matrix(ncol = 2) %T>% plot %>% colSums

Exposing Data Variables with the Exposition Operator

When you're working with R, you'll find that many functions take a data argument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function.

For functions that don't have a data argument, such as the cor() function, it's still handy if you can expose the variables in the data. That's where the %$% operator comes in. Consider the following example:

class="lang-R">iris %>% subset(Sepal.Length > mean(Sepal.Length)) %$% cor(Sepal.Length, Sepal.Width)

0.336696922252551

With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot:

class="lang-R">data.frame(z = rnorm(100)) %$% ts.plot(z)

dplyr and magrittr

In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse.

In this section, you will discover how exciting it can be when you combine both packages in your R code.

For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, "select", "filter", "arrange", "mutate" and "summarize". If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.

Take an example of some traditional code that makes use of these dplyr functions:

class="lang-R">library(hflights) grouped_flights <- group_by(hflights, Year, Month, DayofMonth) flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay) summarized_flights <- summarise(flights_data, arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) final_result <- filter(summarized_flights, arr > 30 | dep > 30) final_result Year Month DayofMonth arr dep 2011 2 4 44.08088 47.17216 2011 3 3 35.12898 38.20064 2011 3 14 46.63830 36.13657 2011 4 4 38.71651 27.94915 2011 4 25 37.79845 22.25574 2011 5 12 69.52046 64.52039 2011 5 20 37.02857 26.55090 2011 6 22 65.51852 62.30979 2011 7 29 29.55755 31.86944 2011 9 29 39.19649 32.49528 2011 10 9 61.90172 59.52586 2011 11 15 43.68134 39.23333 2011 12 29 26.30096 30.78855 2011 12 31 46.48465 54.17137

When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:

class="lang-R">hflights %>% group_by(Year, Month, DayofMonth) %>% select(Year:DayofMonth, ArrDelay, DepDelay) %>% summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>% filter(arr > 30 | dep > 30)

Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the "flow" of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!

In short, dplyr and magrittr are your dreamteam for manipulating data in R!

RStudio Keyboard Shortcuts for Pipes

Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex!

With these addins, you'll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu.

Note that this package is actually a fork from RStudio's original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page.

You can download the add-ins and keyboard shortcuts here.

When Not To Use the Pipe Operator in R

In the above, you have seen that pipes are definitely something that you should be using when you're programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in "R for Data Science", in which you can best avoid them:

  • Your pipes are longer than (say) ten steps.

In cases like these, it's better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you'll also understand your code better and it'll be easier for others to understand your code.

  • You have multiple inputs or outputs.

If you aren't transforming one primary object, but two or more objects are combined together, it's better not to use the pipe.

  • You are starting to think about a directed graph with a complex dependency structure.

Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand.

  • You're doing internal package development

Using pipes in internal package development is a no-go, as it makes it harder to debug!

For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability.

In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes.

Alternatives to Pipes in R

After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following:

  • Create intermediate variables with meaningful names;

Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible!

  • Nest your code so that you read it from the inside out;

One of the possible objections that you could have against pipes is the fact that it goes against the "flow" that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don't like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy.

  • … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn
Conclusion

You have covered a lot of ground in this tutorial: you have seen where %>% comes from, what it exactly is, why you should use it and how you should use it. You've seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn't use it when you're programming in R and what alternatives you can use in such cases.

If you're interested in learning more about the Tidyverse, consider DataCamp's Introduction to the Tidyverse course.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An introduction to seplyr

Thu, 12/14/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by John Mount, Win-Vector LLC

seplyr is an R package
that supplies improved standard evaluation interfaces for many common data wrangling tasks.

The core of seplyr is a
re-skinning of dplyr's
functionality to seplyr conventions (similar to how stringr
re-skins the implementing package stringi).

Standard Evaluation and Non-Standard Evaluation

"Standard evaluation" is the name we are using for the value oriented calling convention found in many programming languages. The idea is: functions are only allowed to look at the values of their arguments and not how those values arise (i.e., they can not look at source code or variable names). This evaluation principle allows one to transform, optimize, and reason about code.

It is what lets us say the following two snippets of code are equivalent.

  • x <- 4; sqrt(x)
  • x <- 4; sqrt(4)

The mantra is:

"variables can be replaced with their values."

Which is called referential transparency.

"Non-standard evaluation" is the name used for code that more aggressively inspects its environment. It is often used for harmless tasks such as conveniently setting axis labels on plots. For example, notice the following two plots have different y-axis labels (despite plotting identical values).

plot(x = 1:3)

plot(x = c(1,2,3))

dplyr and seplyr

The dplyr authors appear to strongly prefer a non-standard evaluation interface. Many in the dplyr community have come to think a package such as dplyr requires a non-standard interface. seplyr started as an experiment to show this is not actually the case.

Syntactically the packages are deliberately similar.

We can take a dplyr pipeline:

suppressPackageStartupMessages(library("dplyr")) starwars %>% select(name, height, mass) %>% arrange(desc(height)) %>% head() ## # A tibble: 6 x 3 ## name height mass ## ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159

And re-write it in seplyr notation:

library("seplyr") starwars %.>% select_se(., c("name", "height", "mass")) %.>% arrange_se(., "desc(height)") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159

For the common dplyr-verbs (excluding mutate(), which we will discuss next) all the non-standard evaluation is saving us is a few quote marks and array designations (and we have ways of getting rid of the need for quote marks). In exchange for this small benefit the non-standard evaluation is needlessly hard to program over. For instance in the seplyr pipeline it is easy to accept the list of columns from an outside source as a simple array of names.

Until you introduce a substitution system such as rlang or wrapr::let() (which we recommend over rlang and publicly pre-dates the public release of rlang) you have some difficulty writing re-usable programs that use the dplyr verbs over "to be specified later" column names.

We are presumably not the only ones who considered this a limitation:

seplyr is an attempt to make programming a primary concern by
making the value-oriented (standard) interfaces the primary interfaces.

mutate()

The earlier "standard evaluation costs just a few quotes" becomes a bit strained when we talk about the dplyr::mutate() operator. It doesn't seem worth the effort unless you get something more in return. In seplyr 0.5.0 we introduced "the something more": planning over and optimizing dplyr::mutate() sequences.

A seplyr mutate looks like the following:

select_se(., c("name", "height", "mass")) %.>% mutate_se(., c( "height" := "height + 1", "mass" := "mass + 1", "height" := "height + 2", "mass" := "mass + 2", "height" := "height + 3", "mass" := "mass + 3" )) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 Ackbar 186 89 ## 2 Adi Gallia 190 56 ## 3 Anakin Skywalker 194 90 ## 4 Arvel Crynyd NA NA ## 5 Ayla Secura 184 61 ## 6 Bail Prestor Organa 197 NA

seplyr::mutate_se() always uses ":=" to denote assignment (dplyr::mutate() prefers "=" for assignment, except in cases where ":=" is required).

The advantage is: once we are go to the trouble to capture the mutate expressions we can treat them as data and apply procedures to them. For example we can re-group and optimize the mutate assignments.

plan <- partition_mutate_se( c("name" := "tolower(name)", "height" := "height + 0.5", "height" := "floor(height)", "mass" := "mass + 0.5", "mass" := "floor(mass)")) print(plan) ## $group00001 ## name height mass ## "tolower(name)" "height + 0.5" "mass + 0.5" ## ## $group00002 ## height mass ## "floor(height)" "floor(mass)"

Notice seplyr::partition_mutate_se() re-ordered and re-grouped the assignments so that:

  • In each group each value used is independent of values produced in other assignments.
  • All dependencies between assignments are respected by the group order.

The "safe block" assignments can then be used in a pipeline:

starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_seb(., plan) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 ackbar 180 83 ## 2 adi gallia 184 50 ## 3 anakin skywalker 188 84 ## 4 arvel crynyd NA NA ## 5 ayla secura 178 55 ## 6 bail prestor organa 191 NA

This may not seem like much. However, when using dplyr with a SQL database (such as PostgreSQL or even Sparklyr) keeping the number of dependencies in a block low is critical for correct calculation (which is why I recommend keeping dependencies low). Furthermore, on Sparklyr sequences of mutates are simulated by nesting of SQL statements, so you must also keep the number of mutates at a moderate level (i.e., you want a minimal number of blocks or groups).

Machine Generated Code

Because we are representing mutate assignments as user manipulable data we can also enjoy the benefit of machine generated code. seplyr 0.5.* uses this opportunity to introduce a simple function named if_else_device(). This device uses R's ifelse() statement (which conditionally chooses values in a vectorized form) to implement a more powerful block-if/else statement (which conditionally simultaneously controls blocks of values and assignments; SAS has such a feature).

For example: suppose we want to NA-out one of height or mass for each row of the starwars data uniformly at random. This can be written naturally using the if_else_device.

if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA") ## ifebtest_30etsitqqutk ## "runif(n())>=0.5" ## height ## "ifelse( ifebtest_30etsitqqutk, NA, height)" ## mass ## "ifelse( !( ifebtest_30etsitqqutk ), NA, mass)"

Notice the if_else_device translates the user code into a sequence of dplyr::mutate() expressions (using only the weaker operator ifelse()). Obviously the user could perform this translation, but if_else_device automates the record keeping and can even be nested. Also many such steps can be chained together and broken into a minimal sequence of blocks by partition_mutate_se() (not forcing a new dplyr::mutate() step for each if-block encountered).

When we combine the device with the partitioned we get performant database-safe code where the number of blocks is only the level of variable dependence (and not the possibly much larger number of initial value uses that a straightforward non-reordering split would give; note: seplyr::mutate_se() 0.5.1 and later incorporate the partition_mutate_se() in mutate_se()).

starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_se(., if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA")) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 4 ## name height mass ifebtest_wwr9k0bq4v04 ## ## 1 Ackbar NA 83 TRUE ## 2 Adi Gallia 184 NA FALSE ## 3 Anakin Skywalker NA 84 TRUE ## 4 Arvel Crynyd NA NA TRUE ## 5 Ayla Secura 178 NA FALSE ## 6 Bail Prestor Organa 191 NA FALSE Conclusion

The value oriented notation is a bit clunkier, but this is offset by it's greater
flexibility in terms of composition and working parametrically.

Our group has been using seplyr::if_else_device() and seplyr::partition_mutate_se() to greatly simplify porting powerful SAS procedures to R/Sparklyr/Apache Spark clusters.

This is new code, but we are striving to supply sufficient initial documentation and examples.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Comparing smooths in factor-smooth interactions II

Thu, 12/14/2017 - 17:00

(This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers)

In a previous post I looked at an approach for computing the differences between smooths estimated as part of a factor-smooth interaction using s()’s by argument. When a common-or-garden factor variable is passed to by, gam() estimates a separate smooth for each level of the by factor. Using the (Xp) matrix approach, we previously saw that we can post-process the model to generate estimates for pairwise differences of smooths. However, the by variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an ordered factor for the by variable.

When an ordered factor is passed to by, mgcv does something quite different to the model I described previously, although the end results should be similar. What mgcv does in the ordered factor case is to fit (L-1) difference smooths, where (l = 1, , L) are the levels of the factor and (L) the number of levels. These smooths model the difference between the smooth estimated for the reference level and the (l)th level of the factor. Additionally, the by variable smooth doesn’t itself estimate the smoother for the reference level; so we are required to add a second smooth to the model that estimates that particular smooth.

In pseudo code our model would be something like, for ordered factor of,

model <- gam(y ~ of + s(x) + s(x, by = of), data = df)

As with any by factor smooth we are required to include a parametric term for the factor because the individual smooths are centered for identifiability reasons. The first s(x) in the model is the smooth effect of x on the reference level of the ordered factor of. The second smoother, s(x, by = of) is the set of (L-1) difference smooths, which model the smooth differences between the reference level smoother and those of the individual levels (excluding the reference one).

Note that this model still estimates a separator smoother for each level of the ordered factor, it just does it in a different way. The smoother for the reference level is estimated via contribution from s(x) only, whilst the smoothers for the other levels are formed from the additive combination of s(x) and the relevant difference smoother from the set created by s(x, by = of). This is analogous to the situation we have when estimating an ANOVA using the default contrasts and lm(); the intercept is then an estimate of the mean response for the reference level of the factor, and the remaining model coefficients estimate the differences between the mean response of the reference level and that of the other factor levels.

This ordered-factor-smooth interaction is most directly applicable to situations where you have a reference category and you are interested in difference between that category and the other levels. If you are interested in pair-wise comparison of smooths you could use the ordered factor approach — it may be more parsimonious than estimating separate smoothers for each level — but you will still need to post-process the results in a manner similar to that described in the previous post1.

To illustrate the ordered factor difference smooths, I’ll reuse the example from the Geochimica paper I wrote with my colleagues at UCL, Neil Rose, Handong Yang, and Simon Turner (Rose et al., 2012), and which formed the basis for the previous post.

Neil, Handong, and Simon had collected sediment cores from several Scottish lochs and measured metal concentrations, especially of lead (Pb) and mercury (Hg), in sediment slices covering the last 200 years. The aim of the study was to investigate sediment profiles of these metals in three regions of Scotland; north east, north west, and south west. A pair of lochs in each region was selected, one in a catchment with visibly eroding peat/soil, and the other in a catchment without erosion. The different regions represented variations in historical deposition levels, whilst the hypothesis was that cores from eroded and non-eroded catchments would show differential responses to reductions in emissions of Pb and Hg to the atmosphere. The difference, it was hypothesized, was that the eroding soil acts as a secondary source of pollutants to the lake. You can read more about it in the paper — if you’re interested but don’t have access to the journal, send me an email and I’ll pass on a pdf.

Below I make use of the following packages

  • readr
  • dplyr
  • ggplot2, and
  • mgcv

You’ll more than likely have these installed, but if you get errors about missing packages when you run the code chunk below, install any missing packages and run the chunk again

library('readr') library('dplyr') library('ggplot2') theme_set(theme_bw()) library('mgcv')

Next, load the data set and convert the SiteCode variable to a factor

uri <- 'https://gist.githubusercontent.com/gavinsimpson/eb4ff24fa9924a588e6ee60dfae8746f/raw/geochimica-metals.csv' metals <- read_csv(uri, skip = 1, col_types = c('ciccd')) metals <- mutate(metals, SiteCode = factor(SiteCode))

This is a subset of the data used in Rose et al. (2012) — the Hg concentrations in the sediments for just three of the lochs are included here in the interests of simplicity. The data set contains 5 variables

metals # A tibble: 44 x 5 SiteCode Date SoilType Region Hg 1 CHNA 2000 thin NW 3.843399 2 CHNA 1990 thin NW 5.424618 3 CHNA 1980 thin NW 8.819730 4 CHNA 1970 thin NW 11.417457 5 CHNA 1960 thin NW 16.513540 6 CHNA 1950 thin NW 16.512047 7 CHNA 1940 thin NW 11.188840 8 CHNA 1930 thin NW 11.622222 9 CHNA 1920 thin NW 13.645853 10 CHNA 1910 thin NW 11.181711 # ... with 34 more rows
  • SiteCode is a factor indexing the three lochs, with levels CHNA, FION, and NODH,
  • Date is a numeric variable of sediment age per sample,
  • SoilType and Region are additional factors for the (natural) experimental design, and
  • Hg is the response variable of interest, and contains the Hg concentration of each sediment sample.

Neil gave me permission to make these data available openly should you want to try this approach out for yourself. If you make use of the data for other purposes, please cite the source publication (Rose et al., 2012) and recognize the contribution of the data creators; Handong Yang, Simon Turner, and Neil Rose.

To proceed, we need to create an ordered factor. Here I’m going to use the SoilType variable as that is easier to relate to conditions of the soil (rather than the Site Code I used in the previous post). I set the non-eroded level to be the reference and as such the GAM will estimate a full smooth for that level and then smooth differences between the non-eroded, and each of the eroded and thin lakes.

metals <- mutate(metals, oSoilType = ordered(SoilType, levels = c('non-eroded','eroded','thin')))

The ordered-factor GAM is fitted to the three lochs using the following

m <- gam(Hg ~ oSoilType + s(Date) + s(Date, by = oSoilType), data = metals, method = 'REML')

and the resulting smooths can be drawn using the plot() method

plot(m, shade = TRUE, pages = 1, scale = 0, seWithMean = TRUE) Estimated smooth trend for the non-eroded site (top, left), and difference smooths reflecting estimated differences between the non-eroded site and the eroded site (top, right) and thin soil site (bottom, left), respectively.

The smooth in the top left is the reference smooth trend for the non-eroded site. The other two smooths are the difference smooths between the non-eroded and eroded sites (top right).

It is immediately clear that the difference between the non-eroded and eroded sites is not significant under this model. The estimated difference is linear, which suggests the trend in the eroded site is stronger than the one estimated for the non-eroded site. However, this difference is not so large as to be an identifiably different trend.

The difference smooth for the thin soil site is considerably different to that estimated for the non-eroded site; the principal difference being the much reduced trend in the thin soil site, as indicated by the difference smooth acting in opposition to the estimated trend for the non-eroded site.

A nice feature of the ordered factor approach is that inference on these difference can be performed formally and directly using the summary() output of the estimated GAM

summary(m) Family: gaussian Link function: identity Formula: Hg ~ oSoilType + s(Date) + s(Date, by = oSoilType) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.2231 0.6789 19.478 < 2e-16 *** oSoilType.L -1.6948 1.1608 -1.460 0.15399 oSoilType.Q -4.2847 1.1990 -3.573 0.00114 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Date) 4.843 5.914 10.862 2.67e-07 *** s(Date):oSoilTypeeroded 1.000 1.000 0.471 0.498 s(Date):oSoilTypethin 3.047 3.779 10.091 1.84e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.76 Deviance explained = 82.1% -REML = 126.5 Scale est. = 20.144 n = 44

The impression we formed about the differences in trends are reinforced with actual test statistics; this is a clear advantage of the ordered-factor approach if your problem suits this different from reference situation.

One feature to note, because we used an ordered factor, the parametric term for oSoilType uses polynomial contrasts: the .L and .Q refer to the linear and quadratic terms used to represent the factor. This is not as easy to identify differences in mean Hg concentration. If you want to retain that readily interpreted parameterisation, use the SoilType factor for the parametric part:

m <- gam(Hg ~ SoilType + s(Date) + s(Date, by = oSoilType), data = metals, method = 'REML') summary(m) Family: gaussian Link function: identity Formula: Hg ~ SoilType + s(Date) + s(Date, by = oSoilType) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 16.722 1.213 13.788 4.88e-15 *** SoilTypenon-eroded -4.049 1.684 -2.405 0.022115 * SoilTypethin -6.446 1.681 -3.835 0.000553 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Date) 4.843 5.914 10.862 2.67e-07 *** s(Date):oSoilTypeeroded 1.000 1.000 0.471 0.498 s(Date):oSoilTypethin 3.047 3.779 10.091 1.84e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.76 Deviance explained = 82.1% -REML = 125.95 Scale est. = 20.144 n = 44

Now the output in the parametric terms section is easier to interpret yet we retain the behavior of the reference smooth plus difference smooths part of the fitted GAM.

References

Rose, N. L., Yang, H., Turner, S. D., and Simpson, G. L. (2012). An assessment of the mechanisms for the transfer of lead and mercury from atmospherically contaminated organic soils to lake sediments with particular reference to scotland, UK. Geochimica et cosmochimica acta 82, 113–135. doi:http://doi.org/10.1016/j.gca.2010.12.026.

  1. Except now you need to be sure to include the right set of basis functions that correspond to the pair of levels you want to compare. You can’t do that with the function I included in that post; it requires something a bit more sophisticated, but the principles are the same.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

archivist: Boost the reproducibility of your research

Thu, 12/14/2017 - 08:05

(This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers)

A few days ago Journal of Statistical Software has published our article (in collaboration with Marcin Kosiński) archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.

Why should you care? Let’s see.

Starter


Would you want to retrieve a ggplot2 object with the plot on the right?
Just call the following line in your R console.

archivist::aread('pbiecek/Eseje/arepo/65e430c4180e97a704249a56be4a7b88')

Want to check versions of packages loaded when the plot was created?
Just call

archivist::asession('pbiecek/Eseje/arepo/65e430c4180e97a704249a56be4a7b88')

Wishful Thinking?

When people talk about reproducibility, usually they focus on tools like packrat, MRAN, docker or RSuite. These are great tools, that help to manage the execution environment in which analyses are performed. The common belief is that if one is able to replicate the execution environment then the same R source code will produce same results.

And it’s true in most cases, maybe even more than 99% of cases. Except that there are things in the environment that are hard to control or easy to miss. Things like external system libraries or dedicated hardware or user input. No matter what you will copy, you will never know if it was enough to recreate exactly same results in the future. So you can hope that results will be replicated, but do not bet too high.
Even if some result will pop up eventually, how can you check if it’s the same result as previously?

Literate programming is not enough

There are other great tools like knitr, Sweave, Jupiter or others. The advantage of them is that results are rendered as tables or plots in your report. This gives you chance to verify if results obtained now and some time ago are identical.
But what about more complicated results like a random forest with 100k trees created with 100k variables or some deep neural network. It will be hard to verify by eye that results are identical.

So, what can I do?

The safest solution would be to store copies of every object, ever created during the data analysis. All forks, wrong paths, everything. Along with detailed information which functions with what parameters were used to generate each result. Something like the ultimate TimeMachine or GitHub for R objects.

With such detailed information, every analysis would be auditable and replicable.
Right now the full tracking of all created objects is not possible without deep changes in the R interpreter.
The archivist is the light-weight version of such solution.

What can you do with archivist?

Use the saveToRepo() function to store selected R objects in the archivist repository.
Use the addHooksToPrint() function to automatically keep copies of every plot or model or data created with knitr report.
Use the aread() function to share your results with others or with future you. It’s the easiest way to access objects created by a remote shiny application.
Use the asearch() function to browse objects that fit specified search criteria, like class, date of creation, used variables etc.
Use asession() to access session info with detailed information about versions of packages available during the object creation.
Use ahistory() to trace how given object was created.

Lots of function, do you have a cheatsheet?

Yes! It’s here.
If it’s not enough, find more details in the JSS article.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#13: (Much) Faster Package (Re-)Installation via Binaries

Thu, 12/14/2017 - 03:07

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the thirteenth post in the ridiculously rapid R recommendation series, or R4 for short. A few days ago we riffed on faster installation thanks to ccache. Today we show another way to get equally drastic gains for some (if not most) packages.

In a nutshell, there are two ways to get your R packages off CRAN. Either you install as a binary, or you use source. Most people do not think too much about this as on Windows, binary is the default. So why wouldn’t one? Precisely. (Unless you are on Windows, and you develop, or debug, or test, or … and need source. Another story.) On other operating systems, however, source is the rule, and binary is often unavailable.

Or is it? Exactly how to find out what is available will be left for another post as we do have a tool just for that. But today, just hear me out when I say that binary is often an option even when source is the default. And it matters. See below.

As a (mostly-to-always) Linux user, I sometimes whistle between my teeth that we "lost all those battles" (i.e. for the desktop(s) or laptop(s)) but "won the war". That topic merits a longer post I hope to write one day, and I won’t do it justice today but my main gist that everybody (and here I mean mostly developers/power users) now at least also runs on Linux. And by that I mean that we all test our code in Linux environments such as e.g. Travis CI, and that many of us run deployments on cloud instances (AWS, GCE, Azure, …) which are predominantly based on Linux. Or on local clusters. Or, if one may dream, the top500 And on and on. And frequently these are Ubuntu machines.

So here is an Ubuntu trick: Install from binary, and save loads of time. As an illustration, consider the chart below. It carries over the logic from the ‘cached vs non-cached’ compilation post and contrasts two ways of installing: from source, or as a binary. I use pristine and empty Docker containers as the base, and rely of course on the official r-base image which is supplied by Carl Boettiger and yours truly as part of our Rocker Project (and for which we have a forthcoming R Journal piece I might mention). So for example the timings for the ggplot2 installation were obtained via

time docker run --rm -ti r-base /bin/bash -c 'install.r ggplot2'

and

time docker run --rm -ti r-base /bin/bash -c 'apt-get update && apt-get install -y r-cran-ggplot2'

Here docker run --rm -ti just means to launch Docker, in ‘remove leftovers at end’ mode, use terminal and interactive mode and invoke a shell. The shell command then is, respectively, to install a CRAN package using install.r from my littler package, or to install the binary via apt-get after updating the apt indices (as the Docker container may have been built a few days or more ago).

Let’s not focus on Docker here—it is just a convenient means to an end of efficiently measuring via a simple (wall-clock counting) time invocation. The key really is that install.r is just a wrapper to install.packages() meaning source installation on Linux (as used inside the Docker container). And apt-get install ... is how one gets a binary. Again, I will try post another piece to determine how one finds if a suitable binary for a CRAN package exists. For now, just allow me to proceed.

So what do we see then? Well have a look:

A few things stick out. RQuantLib really is a monster. And dplyr is also fairly heavy—both rely on Rcpp, BH and lots of templating. At the other end, data.table is still a marvel. No external dependencies, and just plain C code make the source installation essentially the same speed as the binary installation. Amazing. But I digress.

We should add that one of the source installations also required installing additional libries: QuantLib is needed along with Boost for RQuantLib. Similar for another package (not shown) which needed curl and libcurl.

So what is the upshot? If you can, consider binaries. I will try to write another post how I do that e.g. for Travis CI where all my tests us binaries. (Yes, I know. This mattered more in the past when they did not cache. It still matters today as you a) do not need to fill the cache in the first place and b) do not need to worry about details concerning compilation from source which still throws enough people off. But yes, you can of course survive as is.)

The same approach is equally valid on AWS and related instances: I answered many StackOverflow questions where folks were failing to compile "large-enough" pieces from source on minimal installations with minimal RAM, and running out of resources and failed with bizarre errors. In short: Don’t. Consider binaries. It saves time and trouble.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RVowpalWabbit 0.0.10

Thu, 12/14/2017 - 02:02

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A boring little RVowpalWabbit package update to version 0.0.10 came in response to another CRAN request: We were switching directories to run tests (or examples) which is now discouraged, so we no longer do this as it turns that we can of course refer to the files directly as well. Much cleaner.

No new code or features were added.

We should mention once more that is parallel work ongoing in a higher-level package interfacing the vw binary — rvw — as well as plan to redo this package via the external libraries. In that sounds interesting to you, please get in touch.

More information is on the RVowpalWabbit page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Team Rtus wins Munich Re Datathon with mlr

Thu, 12/14/2017 - 01:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

Blog Post “Rtus @ Munich Re Datathon”

On the weekend of November 17. – 19. five brave data-knights from team “Rtus and the knights of the data.table” took on the challenge to compete in a datathon organized by Munich Re in its Munich-based innovation lab. Team Rtus was formed in April this year by a bunch of statistics students from LMU with the purpose to prove their data-skills in competitions with other teams from various backgrounds. The datathon was centered around the topic “the effects of climate change and hurricane events on modern reinsurance business” and after two days of very intensive battles with databases, web-crawlers and advanced machine learning modelling (using the famous R-library mlr), they managed to prevail against strong professional competitors and won the best overall price. After victories at the Datafest at Mannheim University and the Telefonica data challenge earlier this year, this was the last step to a hattrick for the core members of team Rtus.

Their way to success was the implementation of an interactive web-app that could serve as a decision support system for underwriting and tariffing units at Munich Re that are dealing with natural catastrophies. Munich Re provided the participants with databases on claims exposure in the Florida-bay, footprints of past hurricanes and tons of data on climate variable measurements over the past decades. One of the core tasks of the challenge was to define and calculate the maximum foreseeable loss and the probability of such a worst-case event to take place.

To answer the first question, they created a web app that calculates the expected loss of a hurricane in a certain region. To give the decision makers the opportunity to include their expert domain knowledge, they could interact with the app and set the shape and the location of the hurricane, which was modelled as a spatial Gaussian process. This is depicted in the first screenshot. Due to a NDA the true figures and descriptions in the app were altered.

The team recognised the existence of several critical nuclear power plants in this area. The shocking event of Fukushima in 2011 showed the disastrous effects that storm surges, a side effect of hurricanes, can have in combination with nuclear power plants. To account for this, team Rtus implemented the “Nuclear Power Plant” mode in the app. The function of this NPP-mode is shown in this figure:

In a next step, the team tried to provide evidence for the plausibility of such a worst-case event. The following image, based on the footprints of past hurricanes, shows that there were indeed hurricanes crossing the locations of the nuclear power plants:

To answer the second part of the question, they also created a simulation of several weather variables to forecast the probability of such heavy category 5 hurricane events. One rule of reliable statistic modelling is the inclusion of uncertainty measures in any prediction, which was integrated via the prediction intervals. Also, the user of the simulation is able to increase or decrease the estimated temperature trend that underlies the model. This screenshot illustrates the simulation app:

The 36 hours of intensive hacking, discussions, reiterations and fantastic team work combined with the consumption of estimated 19.68 litres of Club Mate were finally rewarded with the first place ranking and a Microsoft Surface Pro for each of the knights. “We will apply this augmentation of our weapon arsenal directly in the next data battle”, one of the knights proudly stated during the awards ceremony.

This portrait shows the tired but happy data knights (from left to right: Niklas Klein, Moritz Herrmann, Jann Goschenhofer, Markus Dumke and Daniel Schalk):

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Image Classification on Small Datasets with Keras

Thu, 12/14/2017 - 01:00

(This article was first published on TensorFlow for R, and kindly contributed to R-bloggers)

Deep Learning with R

This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com.

Training a convnet with a small dataset

Having to train an image-classification model using very little data is a common situation, which you’ll likely encounter in practice if you ever do computer vision in a professional context. A “few” samples can mean anywhere from a few hundred to a few tens of thousands of images. As a practical example, we’ll focus on classifying images as dogs or cats, in a dataset containing 4,000 pictures of cats and dogs (2,000 cats, 2,000 dogs). We’ll use 2,000 pictures for training – 1,000 for validation, and 1,000 for testing.

In Chapter 5 of the Deep Learning with R book we review three techniques for tackling this problem. The first of these is training a small model from scratch on what little data you have (which achieves an accuracy of 82%). Subsequently we use feature extraction with a pretrained network (resulting in an accuracy of 90% to 96%) and fine-tuning a pretrained network (with a final accuracy of 97%). In this post we’ll cover only the second and third techniques.

The relevance of deep learning for small-data problems

You’ll sometimes hear that deep learning only works when lots of data is available. This is valid in part: one fundamental characteristic of deep learning is that it can find interesting features in the training data on its own, without any need for manual feature engineering, and this can only be achieved when lots of training examples are available. This is especially true for problems where the input samples are very high-dimensional, like images.

But what constitutes lots of samples is relative – relative to the size and depth of the network you’re trying to train, for starters. It isn’t possible to train a convnet to solve a complex problem with just a few tens of samples, but a few hundred can potentially suffice if the model is small and well regularized and the task is simple. Because convnets learn local, translation-invariant features, they’re highly data efficient on perceptual problems. Training a convnet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any custom feature engineering. You’ll see this in action in this section.

What’s more, deep-learning models are by nature highly repurposable: you can take, say, an image-classification or speech-to-text model trained on a large-scale dataset and reuse it on a significantly different problem with only minor changes. Specifically, in the case of computer vision, many pretrained models (usually trained on the ImageNet dataset) are now publicly available for download and can be used to bootstrap powerful vision models out of very little data. That’s what you’ll do in the next section. Let’s start by getting your hands on the data.

Downloading the data

The Dogs vs. Cats dataset that you’ll use isn’t packaged with Keras. It was made available by Kaggle as part of a computer-vision competition in late 2013, back when convnets weren’t mainstream. You can download the original dataset from https://www.kaggle.com/c/dogs-vs-cats/data (you’ll need to create a Kaggle account if you don’t already have one – don’t worry, the process is painless).

The pictures are medium-resolution color JPEGs. Here are some examples:

Unsurprisingly, the dogs-versus-cats Kaggle competition in 2013 was won by entrants who used convnets. The best entries achieved up to 95% accuracy. Below you’ll end up with a 97% accuracy, even though you’ll train your models on less than 10% of the data that was available to the competitors.

This dataset contains 25,000 images of dogs and cats (12,500 from each class) and is 543 MB (compressed). After downloading and uncompressing it, you’ll create a new dataset containing three subsets: a training set with 1,000 samples of each class, a validation set with 500 samples of each class, and a test set with 500 samples of each class.

Following is the code to do this:

original_dataset_dir <- "~/Downloads/kaggle_original_data" base_dir <- "~/Downloads/cats_and_dogs_small" dir.create(base_dir) train_dir <- file.path(base_dir, "train") dir.create(train_dir) validation_dir <- file.path(base_dir, "validation") dir.create(validation_dir) test_dir <- file.path(base_dir, "test") dir.create(test_dir) train_cats_dir <- file.path(train_dir, "cats") dir.create(train_cats_dir) train_dogs_dir <- file.path(train_dir, "dogs") dir.create(train_dogs_dir) validation_cats_dir <- file.path(validation_dir, "cats") dir.create(validation_cats_dir) validation_dogs_dir <- file.path(validation_dir, "dogs") dir.create(validation_dogs_dir) test_cats_dir <- file.path(test_dir, "cats") dir.create(test_cats_dir) test_dogs_dir <- file.path(test_dir, "dogs") dir.create(test_dogs_dir) fnames <- paste0("cat.", 1:1000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(train_cats_dir)) fnames <- paste0("cat.", 1001:1500, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(validation_cats_dir)) fnames <- paste0("cat.", 1501:2000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(test_cats_dir)) fnames <- paste0("dog.", 1:1000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(train_dogs_dir)) fnames <- paste0("dog.", 1001:1500, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(validation_dogs_dir)) fnames <- paste0("dog.", 1501:2000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(test_dogs_dir)) Using a pretrained convnet

A common and highly effective approach to deep learning on small image datasets is to use a pretrained network. A pretrained network is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computer-vision problems, even though these new problems may involve completely different classes than those of the original task. For instance, you might train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as identifying furniture items in images. Such portability of learned features across different problems is a key advantage of deep learning compared to many older, shallow-learning approaches, and it makes deep learning very effective for small-data problems.

In this case, let’s consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1,000 different classes). ImageNet contains many animal classes, including different species of cats and dogs, and you can thus expect to perform well on the dogs-versus-cats classification problem.

You’ll use the VGG16 architecture, developed by Karen Simonyan and Andrew Zisserman in 2014; it’s a simple and widely used convnet architecture for ImageNet. Although it’s an older model, far from the current state of the art and somewhat heavier than many other recent models, I chose it because its architecture is similar to what you’re already familiar with and is easy to understand without introducing any new concepts. This may be your first encounter with one of these cutesy model names – VGG, ResNet, Inception, Inception-ResNet, Xception, and so on; you’ll get used to them, because they will come up frequently if you keep doing deep learning for computer vision.

There are two ways to use a pretrained network: feature extraction and fine-tuning. We’ll cover both of them. Let’s start with feature extraction.

Feature extraction

Feature extraction consists of using the representations learned by a previous network to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch.

As you saw previously, convnets used for image classification comprise two parts: they start with a series of pooling and convolution layers, and they end with a densely connected classifier. The first part is called the convolutional base of the model. In the case of convnets, feature extraction consists of taking the convolutional base of a previously trained network, running the new data through it, and training a new classifier on top of the output.

Why only reuse the convolutional base? Could you reuse the densely connected classifier as well? In general, doing so should be avoided. The reason is that the representations learned by the convolutional base are likely to be more generic and therefore more reusable: the feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computer-vision problem at hand. But the representations learned by the classifier will necessarily be specific to the set of classes on which the model was trained – they will only contain information about the presence probability of this or that class in the entire picture. Additionally, representations found in densely connected layers no longer contain any information about where objects are located in the input image: these layers get rid of the notion of space, whereas the object location is still described by convolutional feature maps. For problems where object location matters, densely connected features are largely useless.

Note that the level of generality (and therefore reusability) of the representations extracted by specific convolution layers depends on the depth of the layer in the model. Layers that come earlier in the model extract local, highly generic feature maps (such as visual edges, colors, and textures), whereas layers that are higher up extract more-abstract concepts (such as “cat ear” or “dog eye”). So if your new dataset differs a lot from the dataset on which the original model was trained, you may be better off using only the first few layers of the model to do feature extraction, rather than using the entire convolutional base.

In this case, because the ImageNet class set contains multiple dog and cat classes, it’s likely to be beneficial to reuse the information contained in the densely connected layers of the original model. But we’ll choose not to, in order to cover the more general case where the class set of the new problem doesn’t overlap the class set of the original model.

Let’s put this in practice by using the convolutional base of the VGG16 network, trained on ImageNet, to extract interesting features from cat and dog images, and then train a dogs-versus-cats classifier on top of these features.

The VGG16 model, among others, comes prepackaged with Keras. Here’s the list of image-classification models (all pretrained on the ImageNet dataset) that are available as part of Keras:

  • Xception
  • Inception V3
  • ResNet50
  • VGG16
  • VGG19
  • MobileNet

Let’s instantiate the VGG16 model.

library(keras) conv_base <- application_vgg16( weights = "imagenet", include_top = FALSE, input_shape = c(150, 150, 3) )

You pass three arguments to the function:

  • weights specifies the weight checkpoint from which to initialize the model.
  • include_top refers to including (or not) the densely connected classifier on top of the network. By default, this densely connected classifier corresponds to the 1,000 classes from ImageNet. Because you intend to use your own densely connected classifier (with only two classes: cat and dog), you don’t need to include it.
  • input_shape is the shape of the image tensors that you’ll feed to the network. This argument is purely optional: if you don’t pass it, the network will be able to process inputs of any size.

Here’s the detail of the architecture of the VGG16 convolutional base. It’s similar to the simple convnets you’re already familiar with:

summary(conv_base) Layer (type) Output Shape Param # ================================================================ input_1 (InputLayer) (None, 150, 150, 3) 0 ________________________________________________________________ block1_conv1 (Convolution2D) (None, 150, 150, 64) 1792 ________________________________________________________________ block1_conv2 (Convolution2D) (None, 150, 150, 64) 36928 ________________________________________________________________ block1_pool (MaxPooling2D) (None, 75, 75, 64) 0 ________________________________________________________________ block2_conv1 (Convolution2D) (None, 75, 75, 128) 73856 ________________________________________________________________ block2_conv2 (Convolution2D) (None, 75, 75, 128) 147584 ________________________________________________________________ block2_pool (MaxPooling2D) (None, 37, 37, 128) 0 ________________________________________________________________ block3_conv1 (Convolution2D) (None, 37, 37, 256) 295168 ________________________________________________________________ block3_conv2 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_conv3 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_pool (MaxPooling2D) (None, 18, 18, 256) 0 ________________________________________________________________ block4_conv1 (Convolution2D) (None, 18, 18, 512) 1180160 ________________________________________________________________ block4_conv2 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_conv3 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_pool (MaxPooling2D) (None, 9, 9, 512) 0 ________________________________________________________________ block5_conv1 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv2 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv3 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_pool (MaxPooling2D) (None, 4, 4, 512) 0 ================================================================ Total params: 14,714,688 Trainable params: 14,714,688 Non-trainable params: 0

The final feature map has shape (4, 4, 512). That’s the feature on top of which you’ll stick a densely connected classifier.

At this point, there are two ways you could proceed:

  • Running the convolutional base over your dataset, recording its output to an array on disk, and then using this data as input to a standalone, densely connected classifier similar to those you saw in part 1 of this book. This solution is fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. But for the same reason, this technique won’t allow you to use data augmentation.

  • Extending the model you have (conv_base) by adding dense layers on top, and running the whole thing end to end on the input data. This will allow you to use data augmentation, because every input image goes through the convolutional base every time it’s seen by the model. But for the same reason, this technique is far more expensive than the first.

In this post we’ll cover the second technique in detail (in the book we cover both). Note that this technique is so expensive that you should only attempt it if you have access to a GPU – it’s absolutely intractable on a CPU.

Feature extraction with data augmentation

Because models behave just like layers, you can add a model (like conv_base) to a sequential model just like you would add a layer.

model <- keras_model_sequential() %>% conv_base %>% layer_flatten() %>% layer_dense(units = 256, activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")

This is what the model looks like now:

summary(model) Layer (type) Output Shape Param # ================================================================ vgg16 (Model) (None, 4, 4, 512) 14714688 ________________________________________________________________ flatten_1 (Flatten) (None, 8192) 0 ________________________________________________________________ dense_1 (Dense) (None, 256) 2097408 ________________________________________________________________ dense_2 (Dense) (None, 1) 257 ================================================================ Total params: 16,812,353 Trainable params: 16,812,353 Non-trainable params: 0

As you can see, the convolutional base of VGG16 has 14,714,688 parameters, which is very large. The classifier you’re adding on top has 2 million parameters.

Before you compile and train the model, it’s very important to freeze the convolutional base. Freezing a layer or set of layers means preventing their weights from being updated during training. If you don’t do this, then the representations that were previously learned by the convolutional base will be modified during training. Because the dense layers on top are randomly initialized, very large weight updates would be propagated through the network, effectively destroying the representations previously learned.

In Keras, you freeze a network using the freeze_weights() function:

length(model$trainable_weights) [1] 30 freeze_weights(conv_base) length(model$trainable_weights) [1] 4

With this setup, only the weights from the two dense layers that you added will be trained. That’s a total of four weight tensors: two per layer (the main weight matrix and the bias vector). Note that in order for these changes to take effect, you must first compile the model. If you ever modify weight trainability after compilation, you should then recompile the model, or these changes will be ignored.

Using data augmentation

Overfitting is caused by having too few samples to learn from, rendering you unable to train a model that can generalize to new data. Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit. Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images. The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better.

In Keras, this can be done by configuring a number of random transformations to be performed on the images read by an image_data_generator(). For example:

train_datagen = image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" )

These are just a few of the options available (for more, see the Keras documentation). Let’s quickly go over this code:

  • rotation_range is a value in degrees (0–180), a range within which to randomly rotate pictures.
  • width_shift and height_shift are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
  • shear_range is for randomly applying shearing transformations.
  • zoom_range is for randomly zooming inside pictures.
  • horizontal_flip is for randomly flipping half the images horizontally – relevant when there are no assumptions of horizontal asymmetry (for example, real-world pictures).
  • fill_mode is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.

Now we can train our model using the image data generator:

# Note that the validation data shouldn't be augmented! test_datagen <- image_data_generator(rescale = 1/255) train_generator <- flow_images_from_directory( train_dir, # Target directory train_datagen, # Data generator target_size = c(150, 150), # Resizes all images to 150 × 150 batch_size = 20, class_mode = "binary" # binary_crossentropy loss for binary labels ) validation_generator <- flow_images_from_directory( validation_dir, test_datagen, target_size = c(150, 150), batch_size = 20, class_mode = "binary" ) model %>% compile( loss = "binary_crossentropy", optimizer = optimizer_rmsprop(lr = 2e-5), metrics = c("accuracy") ) history <- model %>% fit_generator( train_generator, steps_per_epoch = 100, epochs = 30, validation_data = validation_generator, validation_steps = 50 )

Let’s plot the results. As you can see, you reach a validation accuracy of about 96%.

Fine-tuning

Another widely used technique for model reuse, complementary to feature extraction, is fine-tuning Fine-tuning consists of unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in this case, the fully connected classifier) and these top layers. This is called fine-tuning because it slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant for the problem at hand.

I stated earlier that it’s necessary to freeze the convolution base of VGG16 in order to be able to train a randomly initialized classifier on top. For the same reason, it’s only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained. If the classifier isn’t already trained, then the error signal propagating through the network during training will be too large, and the representations previously learned by the layers being fine-tuned will be destroyed. Thus the steps for fine-tuning a network are as follows:

  • Add your custom network on top of an already-trained base network.
  • Freeze the base network.
  • Train the part you added.
  • Unfreeze some layers in the base network.
  • Jointly train both these layers and the part you added.

You already completed the first three steps when doing feature extraction. Let’s proceed with step 4: you’ll unfreeze your conv_base and then freeze individual layers inside it.

As a reminder, this is what your convolutional base looks like:

summary(conv_base) Layer (type) Output Shape Param # ================================================================ input_1 (InputLayer) (None, 150, 150, 3) 0 ________________________________________________________________ block1_conv1 (Convolution2D) (None, 150, 150, 64) 1792 ________________________________________________________________ block1_conv2 (Convolution2D) (None, 150, 150, 64) 36928 ________________________________________________________________ block1_pool (MaxPooling2D) (None, 75, 75, 64) 0 ________________________________________________________________ block2_conv1 (Convolution2D) (None, 75, 75, 128) 73856 ________________________________________________________________ block2_conv2 (Convolution2D) (None, 75, 75, 128) 147584 ________________________________________________________________ block2_pool (MaxPooling2D) (None, 37, 37, 128) 0 ________________________________________________________________ block3_conv1 (Convolution2D) (None, 37, 37, 256) 295168 ________________________________________________________________ block3_conv2 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_conv3 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_pool (MaxPooling2D) (None, 18, 18, 256) 0 ________________________________________________________________ block4_conv1 (Convolution2D) (None, 18, 18, 512) 1180160 ________________________________________________________________ block4_conv2 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_conv3 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_pool (MaxPooling2D) (None, 9, 9, 512) 0 ________________________________________________________________ block5_conv1 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv2 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv3 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_pool (MaxPooling2D) (None, 4, 4, 512) 0 ================================================================ Total params: 14714688

You’ll fine-tune the last three convolutional layers, which means all layers up to block4_pool should be frozen, and the layers block5_conv1, block5_conv2, and block5_conv3 should be trainable.

Why not fine-tune more layers? Why not fine-tune the entire convolutional base? You could. But you need to consider the following:

  • Earlier layers in the convolutional base encode more-generic, reusable features, whereas layers higher up encode more-specialized features. It’s more useful to fine-tune the more specialized features, because these are the ones that need to be repurposed on your new problem. There would be fast-decreasing returns in fine-tuning lower layers.
  • The more parameters you’re training, the more you’re at risk of overfitting. The convolutional base has 15 million parameters, so it would be risky to attempt to train it on your small dataset.

Thus, in this situation, it’s a good strategy to fine-tune only the top two or three layers in the convolutional base. Let’s set this up, starting from where you left off in the previous example.

unfreeze_weights(conv_base, from = "block5_conv1")

Now you can begin fine-tuning the network. You’ll do this with the RMSProp optimizer, using a very low learning rate. The reason for using a low learning rate is that you want to limit the magnitude of the modifications you make to the representations of the three layers you’re fine-tuning. Updates that are too large may harm these representations.

model %>% compile( loss = "binary_crossentropy", optimizer = optimizer_rmsprop(lr = 1e-5), metrics = c("accuracy") ) history <- model %>% fit_generator( train_generator, steps_per_epoch = 100, epochs = 100, validation_data = validation_generator, validation_steps = 50 )

Let’s plot our results:

You’re seeing a nice 1% absolute improvement in accuracy, from about 96% to above 97%.

Note that the loss curve doesn’t show any real improvement (in fact, it’s deteriorating). You may wonder, how could accuracy stay stable or improve if the loss isn’t decreasing? The answer is simple: what you display is an average of pointwise loss values; but what matters for accuracy is the distribution of the loss values, not their average, because accuracy is the result of a binary thresholding of the class probability predicted by the model. The model may still be improving even if this isn’t reflected in the average loss.

You can now finally evaluate this model on the test data:

test_generator <- flow_images_from_directory( test_dir, test_datagen, target_size = c(150, 150), batch_size = 20, class_mode = "binary" ) model %>% evaluate_generator(test_generator, steps = 50) $loss [1] 0.2840393 $acc [1] 0.972

Here you get a test accuracy of 97.2%. In the original Kaggle competition around this dataset, this would have been one of the top results. But using modern deep-learning techniques, you managed to reach this result using only a small fraction of the training data available (about 10%). There is a huge difference between being able to train on 20,000 samples compared to 2,000 samples!

Take-aways: using convnets with small datasets

Here’s what you should take away from the exercises in the past two sections:

  • Convnets are the best type of machine-learning models for computer-vision tasks. It’s possible to train one from scratch even on a very small dataset, with decent results.
  • On a small dataset, overfitting will be the main issue. Data augmentation is a powerful way to fight overfitting when you’re working with image data.
  • It’s easy to reuse an existing convnet on a new dataset via feature extraction. This is a valuable technique for working with small image datasets.
  • As a complement to feature extraction, you can use fine-tuning, which adapts to a new problem some of the representations previously learned by an existing model. This pushes performance a bit further.

Now you have a solid set of tools for dealing with image-classification problems – in particular with small datasets.

main { hyphens: inherit; }

Deep Learning with R

This post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R in the Windows Subsystem for Linux

Wed, 12/13/2017 - 23:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R has been available for Windows since the very beginning, but if you have a Windows machine and want to use R within a Linux ecosystem, that's easy to do with the new Fall Creator's Update (version 1709). If you need access to the gcc toolchain for building R packages, or simply prefer the bash environment, it's easy to get things up and running.

Once you have things set up, you can launch a bash shell and run R at the terminal like you would in any Linux system. And that's because this is a Linux system: the Windows Subsystem for Linux is a complete Linux distribution running within Windows. This page provides the details on installing Linux on Windows, but here are the basic steps you need and how to get the latest version of R up and running within it.

First, Enable the Windows Subsystem for Linux option. Go to Control Panel > Programs > Turn Windows Features on or off (or just type "Windows Features" into the search box), and select the "Windows Subsystem for Linux" option. You'll need to reboot, just this once. 

Next, you'll need to install your preferred distribution of Linux from the Microsoft Store. If you search for "Linux" in the store, you'll find an entry "Run Linux on Windows" which will provide you with the available distributions. I'm using "Ubuntu", which as of this writing is Ubuntu 16.04 (Xenial Xerus).

Once that's installed you can launch Ubuntu from the Start menu (just like any other app) to open a new bash shell window. The first time you launch, it will take a few minutes to install various components, and you'll also need to create a username and password. This is your Linux username, different from your Windows username. You'll automatically log in when you launch new Ubuntu sessions, but make sure you remember the password — you'll need it later.

From here you can go ahead and install R, but if you use the default Ubuntu repository you'll get an old version of R (R 3.2.3, from 2015). You probably want the latest version of R, so add CRAN as a new package repository for Ubuntu. You'll need to run these three commands as root, so enter the password you created above here if requested:

sudo echo "deb http://cloud.r-project.org/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

sudo apt-get update

(Don't be surprised by the message key E084DAB9: public key "Michael Rutter " imported. That's how Ubuntu signs the R packages.)

Now you're all set to install the latest version of R, which can be done with:

sudo apt-get install r-base

And that's it! (Once all the dependencies install, anyway, which can take a while the first time.) Now you're all ready to run R from the Linux command line:

Note that you can access files on your Windows system from R — you'll find them at /mnt/c/Users/. This FAQ on the WSL provides other useful tips, and for complete details refer to the Windows for Linux Subsystem Documentation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages