The 3rd paperback editions of my books on Cricket, now on Amazon
(This article was first published on R – Giga thoughts …, and kindly contributed to Rbloggers)
The 3rd paperback edition of both my books on cricket is now available on Amazon for $12.99
a) Cricket analytics with cricketr, Third Edition ($12.99). This book is based on my R package ‘cricketr‘, available on CRAN and uses ESPN Cricinfo Statsguru
b) Beaten by sheer pace! Cricket analytics with yorkr, 3rd edition ($12.99). This is based on my R package ‘yorkr‘ on CRAN and uses data from Cricsheet
Pick up your copies today!!
Note: In the 3rd edition of the paperback book, the charts will be in black and white. If you would like the charts to be in color, please check out the 2nd edition of these books
You may also like
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. A crime map of India in R: Crimes against women
3. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
4. Bend it like Bluemix, MongoDB with autoscaling – Part 2
5. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
6. Thinking Web Scale (TWS3): MapReduce – Bring compute to data
To see all posts see Index of posts
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
“Print hello” is not enough. A collection of Hello world functions.
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
I guess I wrote my R “hello world!” function 7 or 8 years ago while approaching R for the first time. And it is too little to illustrate the basic syntax of a programming language for a working program to a wannabe R programmer. Thus, here follows a collection of basic functions that may help a bit more than the famed piece of code.
That’s all folks!
#R #rstats #maRche #Rbloggers
This post is also shared in LinkedIn and www.rbloggers.com
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
NIPS 2017 Summary
(This article was first published on Rstats – bayesianbiologist, and kindly contributed to Rbloggers)
Some (opinionated) themes and highlights from this year’s NIPS conference:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rstats – bayesianbiologist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Interoperate with ‘MQTT’ Message Brokers With R (a.k.a. Live! BBC! Subtitles!)
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
Most of us see the internet through the lens of browsers and apps on our laptops, desktops, watches, TVs and mobile devices. These displays are showing us — for the most part — content designed for human consumption. Sure, apps handle API interactions, but even most of that communication happens over ports 80 or 443. But, there are lots of ports out there; 0:65535, in fact (at least TCPwise). And, all of them have some kind of data, and most of that is still targeted to something for us.
What if I told you the machines are also talking to each other using a thin/efficient protocol that allows one, tiny sensor to talk to hundreds — if not thousands — of systems without even a drop of siliconlaced sweat? How can a mere, constrained sensor do that? Well, it doesn’t do it alone. Many of them share their data over a fairly new protocol dubbed MQTT (Message Queuing Telemetry Transport).
An MQTT broker watches for devices to publish data under various topics and then also watches for other systems to subscribe to said topics and handles the rest of the interchange. The protocol is lightweight enough that fairly lowpowered (CPU and literal electricusewise) devices can easily send their data chunks up to a broker, and the entire protocol is robust enough to support a plethora of connections and an equal plethora of types of data.
Why am I telling you all this?Devices that publish to MQTT brokers tend to be in the spectrum of what folks sadly call the “Internet of Things”. It’s a terrible, ambiguous name, but it’s all over the media and most folks have some idea what it means. In the context of MQTT, you can think of it as, say, a single temperature sensor publishing it’s data to an MQTT broker so many other things — including programs written by humans to capture, log and analyze that data — can receive it. This is starting to sound like something that might be right up R’s alley.
There are also potential usecases where an online data processing system might want to publish data to many clients without said clients having to poll a poor, singlethreaded R server constantly.
Having MQTT connectivity for R could be really interesting.
And, now we have the beginnings of said connectivity with the mqtt package.
Another Package? Really?Yes, really.
Besides the huge potential for having a direct Rbridge to the MQTT world, I’m workinterested in MQTT since we’ve found over 35,000 of them on the default, plaintext port for MQTT (1883) alone:
There are enough of them that I don’t even need to show a base map.
Some of these servers require authentication and others aren’t doing much of anything. But, there are a number of them hosted by corporations and individuals that are exposing real data. OwnTracks seems to be one of the more popular self/badlyhosted ones.
Then, there are others — like test.mosquitto.org — which deliberately run open MQTT servers for “testing”. There definitely is testing going on there, but there are also real services using it as a production broker. The mqtt package is based on the mosquitto C library, so it’s only fitting that we show a few examples from its own test site here.
For now, there’s really one function: topic_subscribe(). Eventually, R will be able to publish to a broker and do more robust data collection operations (say, to make a live MQTT dashboard in Shiny). The topic_subscribe() function is an allin one tool that enables you to:
 connect to a broker
 subscribe to a topic
 pass in R callback functions which will be executed on connect, disconnect and when new messages come in
That’s plenty of functionality to do some fun things.
What’s the frequencytemperature, Kenneth?The mosquitto test server has one topic — /outbox/croutondemo/temperature — which is a fake temperature sensor that just sends data periodically so you have something to test with. Let’s capture 50 samples and plot them.
Since we’re using a callback we have to use the tricksy << operator to store/update variables outside the callback function. And, we should preallocate space for said data to avoid needlessly growing objects. Here’s a complete codeblock:
library(mqtt) # devtools::install_github("hrbrmstr/mqtt") library(jsonlite) library(hrbrthemes) library(tidyverse) i < 0 # initialize our counter max_recs < 50 # max number of readings to get readings < vector("character", max_recs) # our callback function temp_cb < function(id, topic, payload, qos, retain) { i << i + 1 # update the counter readings[i] << readBin(payload, "character") # update our largerscoped vector return(if (i==max_recs) "quit" else "go") # need to send at least "". "quit" == done } topic_subscribe( topic = "/outbox/croutondemo/temperature", message_callback=temp_cb ) # each reading looks like this: # {"update": {"labels":[4631],"series":[[68]]}} map(readings, fromJSON) %>% map(unlist) %>% map_df(as.list) %>% ggplot(aes(update.labels, update.series)) + geom_line() + geom_point() + labs(x="Reading", y="Temp (F)", title="Temperature via MQTT") + theme_ipsum_rc(grid="XY")We setup temp_cb() to be our callback and topic_subscribe() ensures that the underlying mosquitto library will call it every time a new message is published to that topic. The chart really shows how synthetic the data is.
Subtitles from the EdgeTemperature sensors are just the sort of thing that MQTT was designed for. But, we don’t need to be stodgy about our use of MQTT.
Just about a year ago from this post, the BBC launched live subtitles for iPlayer. Residents of the Colonies may not know what iPlayer is, but it’s the “app” that lets UK citizens watch BBC programmes on glowing rectangles that aren’t proper tellys. Live subtitles are hard to produce well (and get right) and the BBC making the effort to do so also on their digital platform is quite commendable. We U.S. folks will likely be charged $0.99 for each set of digital subtitles now that net neutrality is gone.
Now, some clever person(s) wired up some of these live subtitles to MQTT topics. We can wire up our own code in R to read them live:
bbc_callback < function(id, topic, payload, qos, retain) { cat(crayon::green(readBin(payload, "character")), "\n", sep="") return("") # ctrlc will terminate } mqtt::topic_subscribe(topic = "bbc/subtitles/bbc_news24/raw", connection_callback=mqtt::mqtt_silent_connection_callback, message_callback=bbc_callback)In this case, controlc terminates things (cleanly).
You could easily modify the above code to have a bot that monitors for certain keywords then sends windowed chunks of subtitled text to some other system (Slack, database, etc). Or, create an online tidy text analysis workflow from them.
FINThe code is on GitHub and all input/contributions are welcome and encouraged. Some necessary TBDs are authentication & encryption. But, how would you like the API to look for using it, say, in Shiny apps? What should publishing look like? What helper functions would be useful (ones to slice & dice topic names or another to convert raw message text more safely)? Should there be an R MQTT “DSL”? Lots of things to ponder and so many sites to “test”!
P.S.In case you are concerned about the unusually boring R package name, I wanted to use RIoT (lowercased, of course) but riot is, alas, already taken.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Getting started with seplyr
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
For how and why, please check out our new introductory article.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Pipes in R Tutorial For Beginners
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
You might have already seen or used the pipe operator when you're working with packages such as dplyr, magrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives?
This tutorial will give you an introduction to pipes in R and will cover the following topics:
 Pipe Operator in R: Introduction
 Short History of the Pipe Operator in R
 What Is It?
 Why Use It?
 Additional Pipes
 How To Use Pipes in R
 Basic Piping
 Argument Placeholder
 Reusing Placeholder for Attributes
 Building Unary Functions
 Compound Assignment Pipe Operations
 Tee Operations with the Tee Operator
 Exposing Data Variables with the Exposition Operator
 dplyr and magrittr
 RStudio Keyboard Shortcuts
 When Not To Use the Pipe Operator in R
 Alternatives to Pipes in R
Are you interested in learning more about manipulating data in R with dplyr? Take a look at DataCamp's Data Manipulation in R with dplyr course.
Pipe Operator in R: IntroductionTo understand what the pipe operator in R is and what you can do with it, it's necessary to consider the full picture, to learn the history behind it. Questions such as "where does this weird combination of symbols come from and why was it made like this?" might be on top of your mind. You'll discover the answers to these and more questions in this section.
Now, you can look at the history from three perspectives: from a mathematical point of view, from a holistic point of view of programming languages, and from the point of view of the R language itself. You'll cover all three in what follows!
History of the Pipe Operator in R Mathematical HistoryIf you have two functions, let's say $f : B → C$ and $g : A → B$, you can chain these functions together by taking the output of one function and inserting it into the next. In short, "chaining" means that you pass an intermediate result onto the next function, but you'll see more about that later.
For example, you can say, $f(g(x))$: $g(x)$ serves as an input for $f()$, while $x$, of course, serves as input to $g()$.
If you would want to note this down, you will use the notation $f ◦ g$, which reads as "f follows g". Alternatively, you can visually represent this as:
Image Credit: James Balamuta, "Piping Data" Pipe Operators in Other Programming LanguagesAs mentioned in the introduction to this section, this operator is not new in programming: in the Shell or Terminal, you can pass command from one to the next with the pipeline character . Similarly, F# has a forward pipe operator, which will prove to be important later on! Lastly, it's also good to know that Haskell contains many piping operations that are derived from the Shell or Terminal.
Pipes in RNow that you have seen some history of the pipe operator in other programming languages, it's time to focus on R. The history of this operator in R starts, according to this fantastic blog post written by Adolfo Álvarez, on January 17th, 2012, when an anonymous user asked the following question in this Stack Overflow post:
How can you implement F#'s forward pipe operator in R? The operator makes it possible to easily chain a sequence of calculations. For example, when you have an input data and want to call functions foo and bar in sequence, you can write data > foo > bar?
The answer came from Ben Bolker, professor at McMaster University, who replied:
I don't know how well it would hold up to any real use, but this seems (?) to do what you want, at least for singleargument functions …
"%>%" < function(x,f) do.call(f,list(x)) pi %>% sin [1] 1.224606e16 pi %>% sin %>% cos [1] 1 cos(sin(pi)) [1] 1About nine months later, Hadley Wickham started the dplyr package on GitHub. You might now know Hadley, Chief Scientist at RStudio, as the author of many popular R packages (such as this last package!) and as the instructor for DataCamp's Writing Functions in R course.
Be however it may, it wasn't until 2013 that the first pipe %.% appears in this package. As Adolfo Álvarez rightfully mentions in his blog post, the function was denominated chain(), which had the purpose to simplify the notation for the application of several functions to a single data frame in R.
The %.% pipe would not be around for long, as Stefan Bache proposed an alternative on the 29th of December 2013, that included the operator as you might now know it:
iris %>% subset(Sepal.Length > 5) %>% aggregate(. ~ Species, ., mean)Bache continued to work with this pipe operation and at the end of 2013, the magrittr package came to being. In the meantime, Hadley Wickham continued to work on dplyr and in April 2014, the %.% operator got replaced with the one that you now know, %>%.
Later that year, Kun Ren published the pipeR package on GitHub, which incorporated a different pipe operator, %>>%, which was designed to add more flexibility to the piping process. However, it's safe to say that the %>% is now established in the R language, especially with the recent popularity of the Tidyverse.
What Is It?Knowing the history is one thing, but that still doesn't give you an idea of what F#'s forward pipe operator is nor what it actually does in R.
In F#, the pipeforward operator > is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.
Remember that "chaining" means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.
In R, the pipe operator is, as you have already seen, %>%. If you're not familiar with F#, you can think of this operator as being similar to the + in a ggplot2 statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a "THEN".
Take, for example, following code chunk and read it aloud:
class="lang{r}">iris %>% subset(Sepal.Length > 5) %>% aggregate(. ~ Species, ., mean)You're right, the code chunk above will translate to something like "you take the Iris data, then you subset the data and then you aggregate the data".
This is one of the most powerful things about the Tidyverse. In fact, having a standardized chain of processing actions is called "a pipeline". Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2 friendly format, for example.
Why Use It?R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand. Here's where %>% comes in to the rescue!
Take a look at the following example, which is a typical example of nested code:
class="langR"># Initialize `x` x < c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907) # Compute the logarithm of `x`, return suitably lagged and iterated differences, # compute the exponential function and round the result round(exp(diff(log(x))), 1) 3.3
 1.8
 1.6
 0.5
 0.3
 0.1
 48.8
 1.1
With the help of %<%, you can rewrite the above code as follows:
class="langR"># Import `magrittr` library(magrittr) # Perform the same computations on `x` as above x %>% log() %>% diff() %>% exp() %>% round(1)Does this seem difficult to you? No worries! You'll learn more on how to go about this later on in this tutorial.
Note that you need to import the magrittr library to get the above code to work. That's because the pipe operator is, as you read above, part of the magrittr library and is, since 2014, also a part of dplyr. If you forget to import the library, you'll get an error like Error in eval(expr, envir, enclos): could not find function "%>%".
Also note that it isn't a formal requirement to add the parentheses after log, diff and exp, but that, within the R community, some will use it to increase the readability of the code.
In short, here are four reasons why you should be using pipes in R:
 You'll structure the sequence of your data operations from left to right, as apposed to from inside and out;
 You'll avoid nested function calls;
 You'll minimize the need for local variables and function definitions; And
 You'll make it easy to add steps anywhere in the sequence of operations.
These reasons are taken from the magrittr documentation itself. Implicitly, you see the arguments of readability and flexibility returning.
Additional PipesEven though %>% is the (main) pipe operator of the magrittr package, there are a couple of other operators that you should know and that are part of the same package:
 The compound assignment operator %<>%;
 The tee operator %T>%;
Note that it's good to know for now that the above code chunk is actually a shortcut for:
rnorm(200) %>% matrix(ncol = 2) %T>% { plot(.); . } %>% colSumsBut you'll see more about that later on!
 The exposition pipe operator %$%.
Of course, these three operators work slightly differently than the main %>% operator. You'll see more about their functionalities and their usage later on in this tutorial!
Note that, even though you'll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr's dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe >.;.
How to Use Pipes in RNow that you know how the %>% operator originated, what it actually is and why you should use it, it's time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it!
Basic PipingBefore you go into the more advanced usages of the operator, it's good to first take a look at the most basic examples that use the operator. In essence, you'll see that there are 3 rules that you can follow when you're first starting out:
 f(x) can be rewritten as x %>% f
In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent:
class="langR"># Compute the logarithm of `x` log(x) # Compute the logarithm of `x` x %>% log() f(x, y) can be rewritten as x %>% f(y)
Of course, there are a lot of functions that don't just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call.
This all seems quite theoretical. Let's take a look at a more practical example:
class="langR"># Round pi round(pi, 6) # Round pi pi %>% round(6) x %>% f %>% g %>% h can be rewritten as h(g(f(x)))
This might seem complex, but it isn't quite like that when you look at a reallife R example:
class="langR"># Import `babynames` data library(babynames) # Import `dplyr` library library(dplyr) # Load the data data(babynames) # Count how many young boys with the name "Taylor" are born sum(select(filter(babynames,sex=="M",name=="Taylor"),n)) # Do the same but now with `%>%` babynames%>%filter(sex=="M",name=="Taylor")%>% select(n)%>% sumNote how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you'll select n and lastly, you'll sum() everything.
Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log(), diff(), exp() and round() functions to perform calculations on x.
Functions that Use the Current EnvironmentUnfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let's take a look at some of them here.
Consider this example, where you use the assign() function to assign the value 10 to the variable x.
class="langR"># Assign `10` to `x` assign("x", 10) # Assign `100` to `x` "x" %>% assign(100) # Return `x` x10
You see that the second call with the assign() function, in combination with the pipe, doesn't work properly. The value of x is not updated.
Why is this?
That's because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment:
class="langR"># Define your environment env < environment() # Add the environment to `assign()` "x" %>% assign(100, envir = env) # Return `x` x100
Functions with Lazy EvalutionArguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn.
One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example:
class="langR">tryCatch(stop("!"), error = function(e) "An error") stop("!") %>% tryCatch(error = function(e) "An error")'An error'
Error in eval(expr, envir, enclos): ! Traceback: 1. stop("!") %>% tryCatch(error = function(e) "An error") 2. eval(lhs, parent, parent) 3. eval(expr, envir, enclos) 4. stop("!")You'll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try(), suppressMessages(), and suppressWarnings() in base R.
Argument PlaceholderThere are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples:
 f(x, y) can be rewritten as y %>% f(x, .)
In some cases, you won't want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code:
% round(6)If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call?
Take a look at this example, where the value is actually at the third position in the function call:
class="langR">"Ceci n'est pas une pipe" %>% gsub("une", "un", .)'Ceci n\'est pas un pipe'
 f(y, z = x) can be rewritten as x %>% f(y, z = .)
Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code:
class="langR">6 %>% round(pi, digits=.) Reusing the Placeholder for AttributesIt is straightforward to use the placeholder several times in a righthand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the firstargument rule. The reason is that in most cases this results more clean code.
Here are some general "rules" that you can take into account when you're working with argument placeholders in nested function calls:
 f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.))
12
12
The behavior can be overruled by enclosing the righthand side in braces:
 f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))}
4
To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call:
class="langR"># The function that you want to rewrite paste(1:5, letters[1:5]) # The nested function call with dot placeholder 1:5 %>% paste(., letters[.]) '1 a'
 '2 b'
 '3 c'
 '4 d'
 '5 e'
 '1 a'
 '2 b'
 '3 c'
 '4 d'
 '5 e'
You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }:
class="langR"># The nested function call with dot placeholder and curly brackets 1:5 %>% { paste(letters[.]) } # Rewrite the above function call paste(letters[1:5]) 'a'
 'b'
 'c'
 'd'
 'e'
 'a'
 'b'
 'c'
 'd'
 'e'
Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline:
class="langR">. %>% cos %>% sinThis pipeline would take some input, after which both the cos() and sin() fuctions would be applied to it.
But you're not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can reuse it later to do the operations that are contained within the pipeline on other values.
class="langR"># Unary function f < . %>% cos %>% sin f structure(function (value) freduce(value, `_function_list`), class = c("fseq", "function" ))Remember also that you could put parentheses after the cos() and sin() functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin().
You see, building functions in magrittr very similar to building functions with base R! If you're not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result!
class="langR"># is equivalent to f < function(.) sin(cos(.)) f function (.) sin(cos(.)) Compound Assignment Pipe OperationsThere are situations where you want to overwrite the value of the lefthand side, just like in the example right below. Intuitively, you will use the assignment operator < to do this.
class="langR"># Load in the Iris data iris < read.csv(url("http://archive.ics.uci.edu/ml/machinelearningdatabases/iris/iris.data"), header = FALSE) # Add column names to the Iris data names(iris) < c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") # Compute the square root of `iris$Sepal.Length` and assign it to the variable iris$Sepal.Length < iris$Sepal.Length %>% sqrt()However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the lefthand side:
class="langR"># Compute the square root of `iris$Sepal.Length` and assign it to the variable iris$Sepal.Length %<>% sqrt # Return `Sepal.Length` iris$Sepal.LengthNote that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular < assignment operator.
As a result, this operator will assign a result of a pipeline rather than returning it.
Tee Operations with The Tee OperatorThe tee operator works exactly like %>%, but it returns the lefthand side value rather than the potential result of the righthand side operations.
This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.
In other words, functions like plot() typically don't return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():
class="langR">set.seed(123) rnorm(200) %>% matrix(ncol = 2) %T>% plot %>% colSums Exposing Data Variables with the Exposition OperatorWhen you're working with R, you'll find that many functions take a data argument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function.
For functions that don't have a data argument, such as the cor() function, it's still handy if you can expose the variables in the data. That's where the %$% operator comes in. Consider the following example:
class="langR">iris %>% subset(Sepal.Length > mean(Sepal.Length)) %$% cor(Sepal.Length, Sepal.Width)0.336696922252551
With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot:
class="langR">data.frame(z = rnorm(100)) %$% ts.plot(z) dplyr and magrittrIn the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 20132014. And, as you have read, the magrittr package is also part of the Tidyverse.
In this section, you will discover how exciting it can be when you combine both packages in your R code.
For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, "select", "filter", "arrange", "mutate" and "summarize". If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.
Take an example of some traditional code that makes use of these dplyr functions:
class="langR">library(hflights) grouped_flights < group_by(hflights, Year, Month, DayofMonth) flights_data < select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay) summarized_flights < summarise(flights_data, arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) final_result < filter(summarized_flights, arr > 30  dep > 30) final_result Year Month DayofMonth arr dep 2011 2 4 44.08088 47.17216 2011 3 3 35.12898 38.20064 2011 3 14 46.63830 36.13657 2011 4 4 38.71651 27.94915 2011 4 25 37.79845 22.25574 2011 5 12 69.52046 64.52039 2011 5 20 37.02857 26.55090 2011 6 22 65.51852 62.30979 2011 7 29 29.55755 31.86944 2011 9 29 39.19649 32.49528 2011 10 9 61.90172 59.52586 2011 11 15 43.68134 39.23333 2011 12 29 26.30096 30.78855 2011 12 31 46.48465 54.17137When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:
class="langR">hflights %>% group_by(Year, Month, DayofMonth) %>% select(Year:DayofMonth, ArrDelay, DepDelay) %>% summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>% filter(arr > 30  dep > 30)Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the "flow" of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!
In short, dplyr and magrittr are your dreamteam for manipulating data in R!
RStudio Keyboard Shortcuts for PipesAdding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, cofounder and Principal Consultant at WinVector, LLC and DataCamp instructor, has released a package with some RStudio addins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex!
With these addins, you'll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu.
Note that this package is actually a fork from RStudio's original addin package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page.
You can download the addins and keyboard shortcuts here.
When Not To Use the Pipe Operator in RIn the above, you have seen that pipes are definitely something that you should be using when you're programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in "R for Data Science", in which you can best avoid them:
 Your pipes are longer than (say) ten steps.
In cases like these, it's better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you'll also understand your code better and it'll be easier for others to understand your code.
 You have multiple inputs or outputs.
If you aren't transforming one primary object, but two or more objects are combined together, it's better not to use the pipe.
 You are starting to think about a directed graph with a complex dependency structure.
Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand.
 You're doing internal package development
Using pipes in internal package development is a nogo, as it makes it harder to debug!
For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability.
In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes.
Alternatives to Pipes in RAfter all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following:
 Create intermediate variables with meaningful names;
Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible!
 Nest your code so that you read it from the inside out;
One of the possible objections that you could have against pipes is the fact that it goes against the "flow" that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don't like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy.
 … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn
You have covered a lot of ground in this tutorial: you have seen where %>% comes from, what it exactly is, why you should use it and how you should use it. You've seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn't use it when you're programming in R and what alternatives you can use in such cases.
If you're interested in learning more about the Tidyverse, consider DataCamp's Introduction to the Tidyverse course.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
An introduction to seplyr
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by John Mount, WinVector LLC
seplyr is an R package
that supplies improved standard evaluation interfaces for many common data wrangling tasks.
The core of seplyr is a
reskinning of dplyr's
functionality to seplyr conventions (similar to how stringr
reskins the implementing package stringi).
"Standard evaluation" is the name we are using for the value oriented calling convention found in many programming languages. The idea is: functions are only allowed to look at the values of their arguments and not how those values arise (i.e., they can not look at source code or variable names). This evaluation principle allows one to transform, optimize, and reason about code.
It is what lets us say the following two snippets of code are equivalent.
 x < 4; sqrt(x)
 x < 4; sqrt(4)
The mantra is:
"variables can be replaced with their values."
Which is called referential transparency.
"Nonstandard evaluation" is the name used for code that more aggressively inspects its environment. It is often used for harmless tasks such as conveniently setting axis labels on plots. For example, notice the following two plots have different yaxis labels (despite plotting identical values).
plot(x = 1:3) plot(x = c(1,2,3)) dplyr and seplyrThe dplyr authors appear to strongly prefer a nonstandard evaluation interface. Many in the dplyr community have come to think a package such as dplyr requires a nonstandard interface. seplyr started as an experiment to show this is not actually the case.
Syntactically the packages are deliberately similar.
We can take a dplyr pipeline:
suppressPackageStartupMessages(library("dplyr")) starwars %>% select(name, height, mass) %>% arrange(desc(height)) %>% head() ## # A tibble: 6 x 3 ## name height mass ## ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159And rewrite it in seplyr notation:
library("seplyr") starwars %.>% select_se(., c("name", "height", "mass")) %.>% arrange_se(., "desc(height)") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 Yarael Poof 264 NA ## 2 Tarfful 234 136 ## 3 Lama Su 229 88 ## 4 Chewbacca 228 112 ## 5 Roos Tarpals 224 82 ## 6 Grievous 216 159For the common dplyrverbs (excluding mutate(), which we will discuss next) all the nonstandard evaluation is saving us is a few quote marks and array designations (and we have ways of getting rid of the need for quote marks). In exchange for this small benefit the nonstandard evaluation is needlessly hard to program over. For instance in the seplyr pipeline it is easy to accept the list of columns from an outside source as a simple array of names.
Until you introduce a substitution system such as rlang or wrapr::let() (which we recommend over rlang and publicly predates the public release of rlang) you have some difficulty writing reusable programs that use the dplyr verbs over "to be specified later" column names.
We are presumably not the only ones who considered this a limitation:
seplyr is an attempt to make programming a primary concern by
making the valueoriented (standard) interfaces the primary interfaces.
The earlier "standard evaluation costs just a few quotes" becomes a bit strained when we talk about the dplyr::mutate() operator. It doesn't seem worth the effort unless you get something more in return. In seplyr 0.5.0 we introduced "the something more": planning over and optimizing dplyr::mutate() sequences.
A seplyr mutate looks like the following:
select_se(., c("name", "height", "mass")) %.>% mutate_se(., c( "height" := "height + 1", "mass" := "mass + 1", "height" := "height + 2", "mass" := "mass + 2", "height" := "height + 3", "mass" := "mass + 3" )) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 Ackbar 186 89 ## 2 Adi Gallia 190 56 ## 3 Anakin Skywalker 194 90 ## 4 Arvel Crynyd NA NA ## 5 Ayla Secura 184 61 ## 6 Bail Prestor Organa 197 NAseplyr::mutate_se() always uses ":=" to denote assignment (dplyr::mutate() prefers "=" for assignment, except in cases where ":=" is required).
The advantage is: once we are go to the trouble to capture the mutate expressions we can treat them as data and apply procedures to them. For example we can regroup and optimize the mutate assignments.
plan < partition_mutate_se( c("name" := "tolower(name)", "height" := "height + 0.5", "height" := "floor(height)", "mass" := "mass + 0.5", "mass" := "floor(mass)")) print(plan) ## $group00001 ## name height mass ## "tolower(name)" "height + 0.5" "mass + 0.5" ## ## $group00002 ## height mass ## "floor(height)" "floor(mass)"Notice seplyr::partition_mutate_se() reordered and regrouped the assignments so that:
 In each group each value used is independent of values produced in other assignments.
 All dependencies between assignments are respected by the group order.
The "safe block" assignments can then be used in a pipeline:
starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_seb(., plan) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 3 ## name height mass ## ## 1 ackbar 180 83 ## 2 adi gallia 184 50 ## 3 anakin skywalker 188 84 ## 4 arvel crynyd NA NA ## 5 ayla secura 178 55 ## 6 bail prestor organa 191 NAThis may not seem like much. However, when using dplyr with a SQL database (such as PostgreSQL or even Sparklyr) keeping the number of dependencies in a block low is critical for correct calculation (which is why I recommend keeping dependencies low). Furthermore, on Sparklyr sequences of mutates are simulated by nesting of SQL statements, so you must also keep the number of mutates at a moderate level (i.e., you want a minimal number of blocks or groups).
Machine Generated CodeBecause we are representing mutate assignments as user manipulable data we can also enjoy the benefit of machine generated code. seplyr 0.5.* uses this opportunity to introduce a simple function named if_else_device(). This device uses R's ifelse() statement (which conditionally chooses values in a vectorized form) to implement a more powerful blockif/else statement (which conditionally simultaneously controls blocks of values and assignments; SAS has such a feature).
For example: suppose we want to NAout one of height or mass for each row of the starwars data uniformly at random. This can be written naturally using the if_else_device.
if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA") ## ifebtest_30etsitqqutk ## "runif(n())>=0.5" ## height ## "ifelse( ifebtest_30etsitqqutk, NA, height)" ## mass ## "ifelse( !( ifebtest_30etsitqqutk ), NA, mass)"Notice the if_else_device translates the user code into a sequence of dplyr::mutate() expressions (using only the weaker operator ifelse()). Obviously the user could perform this translation, but if_else_device automates the record keeping and can even be nested. Also many such steps can be chained together and broken into a minimal sequence of blocks by partition_mutate_se() (not forcing a new dplyr::mutate() step for each ifblock encountered).
When we combine the device with the partitioned we get performant databasesafe code where the number of blocks is only the level of variable dependence (and not the possibly much larger number of initial value uses that a straightforward nonreordering split would give; note: seplyr::mutate_se() 0.5.1 and later incorporate the partition_mutate_se() in mutate_se()).
starwars %.>% select_se(., c("name", "height", "mass")) %.>% mutate_se(., if_else_device( testexpr = "runif(n())>=0.5", thenexprs = "height" := "NA", elseexprs = "mass" := "NA")) %.>% arrange_se(., "name") %.>% head(.) ## # A tibble: 6 x 4 ## name height mass ifebtest_wwr9k0bq4v04 ## ## 1 Ackbar NA 83 TRUE ## 2 Adi Gallia 184 NA FALSE ## 3 Anakin Skywalker NA 84 TRUE ## 4 Arvel Crynyd NA NA TRUE ## 5 Ayla Secura 178 NA FALSE ## 6 Bail Prestor Organa 191 NA FALSE ConclusionThe value oriented notation is a bit clunkier, but this is offset by it's greater
flexibility in terms of composition and working parametrically.
Our group has been using seplyr::if_else_device() and seplyr::partition_mutate_se() to greatly simplify porting powerful SAS procedures to R/Sparklyr/Apache Spark clusters.
This is new code, but we are striving to supply sufficient initial documentation and examples.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing smooths in factorsmooth interactions II
(This article was first published on From the Bottom of the Heap  R, and kindly contributed to Rbloggers)
In a previous post I looked at an approach for computing the differences between smooths estimated as part of a factorsmooth interaction using s()’s by argument. When a commonorgarden factor variable is passed to by, gam() estimates a separate smooth for each level of the by factor. Using the (Xp) matrix approach, we previously saw that we can postprocess the model to generate estimates for pairwise differences of smooths. However, the by variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an ordered factor for the by variable.
When an ordered factor is passed to by, mgcv does something quite different to the model I described previously, although the end results should be similar. What mgcv does in the ordered factor case is to fit (L1) difference smooths, where (l = 1, , L) are the levels of the factor and (L) the number of levels. These smooths model the difference between the smooth estimated for the reference level and the (l)th level of the factor. Additionally, the by variable smooth doesn’t itself estimate the smoother for the reference level; so we are required to add a second smooth to the model that estimates that particular smooth.
In pseudo code our model would be something like, for ordered factor of,
model < gam(y ~ of + s(x) + s(x, by = of), data = df)As with any by factor smooth we are required to include a parametric term for the factor because the individual smooths are centered for identifiability reasons. The first s(x) in the model is the smooth effect of x on the reference level of the ordered factor of. The second smoother, s(x, by = of) is the set of (L1) difference smooths, which model the smooth differences between the reference level smoother and those of the individual levels (excluding the reference one).
Note that this model still estimates a separator smoother for each level of the ordered factor, it just does it in a different way. The smoother for the reference level is estimated via contribution from s(x) only, whilst the smoothers for the other levels are formed from the additive combination of s(x) and the relevant difference smoother from the set created by s(x, by = of). This is analogous to the situation we have when estimating an ANOVA using the default contrasts and lm(); the intercept is then an estimate of the mean response for the reference level of the factor, and the remaining model coefficients estimate the differences between the mean response of the reference level and that of the other factor levels.
This orderedfactorsmooth interaction is most directly applicable to situations where you have a reference category and you are interested in difference between that category and the other levels. If you are interested in pairwise comparison of smooths you could use the ordered factor approach — it may be more parsimonious than estimating separate smoothers for each level — but you will still need to postprocess the results in a manner similar to that described in the previous post1.
To illustrate the ordered factor difference smooths, I’ll reuse the example from the Geochimica paper I wrote with my colleagues at UCL, Neil Rose, Handong Yang, and Simon Turner (Rose et al., 2012), and which formed the basis for the previous post.
Neil, Handong, and Simon had collected sediment cores from several Scottish lochs and measured metal concentrations, especially of lead (Pb) and mercury (Hg), in sediment slices covering the last 200 years. The aim of the study was to investigate sediment profiles of these metals in three regions of Scotland; north east, north west, and south west. A pair of lochs in each region was selected, one in a catchment with visibly eroding peat/soil, and the other in a catchment without erosion. The different regions represented variations in historical deposition levels, whilst the hypothesis was that cores from eroded and noneroded catchments would show differential responses to reductions in emissions of Pb and Hg to the atmosphere. The difference, it was hypothesized, was that the eroding soil acts as a secondary source of pollutants to the lake. You can read more about it in the paper — if you’re interested but don’t have access to the journal, send me an email and I’ll pass on a pdf.
Below I make use of the following packages
 readr
 dplyr
 ggplot2, and
 mgcv
You’ll more than likely have these installed, but if you get errors about missing packages when you run the code chunk below, install any missing packages and run the chunk again
library('readr') library('dplyr') library('ggplot2') theme_set(theme_bw()) library('mgcv')Next, load the data set and convert the SiteCode variable to a factor
uri < 'https://gist.githubusercontent.com/gavinsimpson/eb4ff24fa9924a588e6ee60dfae8746f/raw/geochimicametals.csv' metals < read_csv(uri, skip = 1, col_types = c('ciccd')) metals < mutate(metals, SiteCode = factor(SiteCode))This is a subset of the data used in Rose et al. (2012) — the Hg concentrations in the sediments for just three of the lochs are included here in the interests of simplicity. The data set contains 5 variables
metals # A tibble: 44 x 5 SiteCode Date SoilType Region Hg 1 CHNA 2000 thin NW 3.843399 2 CHNA 1990 thin NW 5.424618 3 CHNA 1980 thin NW 8.819730 4 CHNA 1970 thin NW 11.417457 5 CHNA 1960 thin NW 16.513540 6 CHNA 1950 thin NW 16.512047 7 CHNA 1940 thin NW 11.188840 8 CHNA 1930 thin NW 11.622222 9 CHNA 1920 thin NW 13.645853 10 CHNA 1910 thin NW 11.181711 # ... with 34 more rows SiteCode is a factor indexing the three lochs, with levels CHNA, FION, and NODH,
 Date is a numeric variable of sediment age per sample,
 SoilType and Region are additional factors for the (natural) experimental design, and
 Hg is the response variable of interest, and contains the Hg concentration of each sediment sample.
Neil gave me permission to make these data available openly should you want to try this approach out for yourself. If you make use of the data for other purposes, please cite the source publication (Rose et al., 2012) and recognize the contribution of the data creators; Handong Yang, Simon Turner, and Neil Rose.
To proceed, we need to create an ordered factor. Here I’m going to use the SoilType variable as that is easier to relate to conditions of the soil (rather than the Site Code I used in the previous post). I set the noneroded level to be the reference and as such the GAM will estimate a full smooth for that level and then smooth differences between the noneroded, and each of the eroded and thin lakes.
metals < mutate(metals, oSoilType = ordered(SoilType, levels = c('noneroded','eroded','thin')))The orderedfactor GAM is fitted to the three lochs using the following
m < gam(Hg ~ oSoilType + s(Date) + s(Date, by = oSoilType), data = metals, method = 'REML')and the resulting smooths can be drawn using the plot() method
plot(m, shade = TRUE, pages = 1, scale = 0, seWithMean = TRUE) Estimated smooth trend for the noneroded site (top, left), and difference smooths reflecting estimated differences between the noneroded site and the eroded site (top, right) and thin soil site (bottom, left), respectively.The smooth in the top left is the reference smooth trend for the noneroded site. The other two smooths are the difference smooths between the noneroded and eroded sites (top right).
It is immediately clear that the difference between the noneroded and eroded sites is not significant under this model. The estimated difference is linear, which suggests the trend in the eroded site is stronger than the one estimated for the noneroded site. However, this difference is not so large as to be an identifiably different trend.
The difference smooth for the thin soil site is considerably different to that estimated for the noneroded site; the principal difference being the much reduced trend in the thin soil site, as indicated by the difference smooth acting in opposition to the estimated trend for the noneroded site.
A nice feature of the ordered factor approach is that inference on these difference can be performed formally and directly using the summary() output of the estimated GAM
summary(m) Family: gaussian Link function: identity Formula: Hg ~ oSoilType + s(Date) + s(Date, by = oSoilType) Parametric coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 13.2231 0.6789 19.478 < 2e16 *** oSoilType.L 1.6948 1.1608 1.460 0.15399 oSoilType.Q 4.2847 1.1990 3.573 0.00114 **  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F pvalue s(Date) 4.843 5.914 10.862 2.67e07 *** s(Date):oSoilTypeeroded 1.000 1.000 0.471 0.498 s(Date):oSoilTypethin 3.047 3.779 10.091 1.84e05 ***  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Rsq.(adj) = 0.76 Deviance explained = 82.1% REML = 126.5 Scale est. = 20.144 n = 44The impression we formed about the differences in trends are reinforced with actual test statistics; this is a clear advantage of the orderedfactor approach if your problem suits this different from reference situation.
One feature to note, because we used an ordered factor, the parametric term for oSoilType uses polynomial contrasts: the .L and .Q refer to the linear and quadratic terms used to represent the factor. This is not as easy to identify differences in mean Hg concentration. If you want to retain that readily interpreted parameterisation, use the SoilType factor for the parametric part:
m < gam(Hg ~ SoilType + s(Date) + s(Date, by = oSoilType), data = metals, method = 'REML') summary(m) Family: gaussian Link function: identity Formula: Hg ~ SoilType + s(Date) + s(Date, by = oSoilType) Parametric coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 16.722 1.213 13.788 4.88e15 *** SoilTypenoneroded 4.049 1.684 2.405 0.022115 * SoilTypethin 6.446 1.681 3.835 0.000553 ***  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F pvalue s(Date) 4.843 5.914 10.862 2.67e07 *** s(Date):oSoilTypeeroded 1.000 1.000 0.471 0.498 s(Date):oSoilTypethin 3.047 3.779 10.091 1.84e05 ***  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Rsq.(adj) = 0.76 Deviance explained = 82.1% REML = 125.95 Scale est. = 20.144 n = 44Now the output in the parametric terms section is easier to interpret yet we retain the behavior of the reference smooth plus difference smooths part of the fitted GAM.
ReferencesRose, N. L., Yang, H., Turner, S. D., and Simpson, G. L. (2012). An assessment of the mechanisms for the transfer of lead and mercury from atmospherically contaminated organic soils to lake sediments with particular reference to scotland, UK. Geochimica et cosmochimica acta 82, 113–135. doi:http://doi.org/10.1016/j.gca.2010.12.026.

Except now you need to be sure to include the right set of basis functions that correspond to the pair of levels you want to compare. You can’t do that with the function I included in that post; it requires something a bit more sophisticated, but the principles are the same.↩
To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
archivist: Boost the reproducibility of your research
(This article was first published on SmarterPoland.pl » English, and kindly contributed to Rbloggers)
A few days ago Journal of Statistical Software has published our article (in collaboration with Marcin Kosiński) archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.
Why should you care? Let’s see.
Starter
Would you want to retrieve a ggplot2 object with the plot on the right?
Just call the following line in your R console.
archivist::aread('pbiecek/Eseje/arepo/65e430c4180e97a704249a56be4a7b88')
Want to check versions of packages loaded when the plot was created?
Just call
archivist::asession('pbiecek/Eseje/arepo/65e430c4180e97a704249a56be4a7b88')
Wishful Thinking?
When people talk about reproducibility, usually they focus on tools like packrat, MRAN, docker or RSuite. These are great tools, that help to manage the execution environment in which analyses are performed. The common belief is that if one is able to replicate the execution environment then the same R source code will produce same results.
And it’s true in most cases, maybe even more than 99% of cases. Except that there are things in the environment that are hard to control or easy to miss. Things like external system libraries or dedicated hardware or user input. No matter what you will copy, you will never know if it was enough to recreate exactly same results in the future. So you can hope that results will be replicated, but do not bet too high.
Even if some result will pop up eventually, how can you check if it’s the same result as previously?
Literate programming is not enough
There are other great tools like knitr, Sweave, Jupiter or others. The advantage of them is that results are rendered as tables or plots in your report. This gives you chance to verify if results obtained now and some time ago are identical.
But what about more complicated results like a random forest with 100k trees created with 100k variables or some deep neural network. It will be hard to verify by eye that results are identical.
So, what can I do?
The safest solution would be to store copies of every object, ever created during the data analysis. All forks, wrong paths, everything. Along with detailed information which functions with what parameters were used to generate each result. Something like the ultimate TimeMachine or GitHub for R objects.
With such detailed information, every analysis would be auditable and replicable.
Right now the full tracking of all created objects is not possible without deep changes in the R interpreter.
The archivist is the lightweight version of such solution.
What can you do with archivist?
Use the saveToRepo() function to store selected R objects in the archivist repository.
Use the addHooksToPrint() function to automatically keep copies of every plot or model or data created with knitr report.
Use the aread() function to share your results with others or with future you. It’s the easiest way to access objects created by a remote shiny application.
Use the asearch() function to browse objects that fit specified search criteria, like class, date of creation, used variables etc.
Use asession() to access session info with detailed information about versions of packages available during the object creation.
Use ahistory() to trace how given object was created.
Lots of function, do you have a cheatsheet?
Yes! It’s here.
If it’s not enough, find more details in the JSS article.
To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
#13: (Much) Faster Package (Re)Installation via Binaries
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Welcome to the thirteenth post in the ridiculously rapid R recommendation series, or R4 for short. A few days ago we riffed on faster installation thanks to ccache. Today we show another way to get equally drastic gains for some (if not most) packages.
In a nutshell, there are two ways to get your R packages off CRAN. Either you install as a binary, or you use source. Most people do not think too much about this as on Windows, binary is the default. So why wouldn’t one? Precisely. (Unless you are on Windows, and you develop, or debug, or test, or … and need source. Another story.) On other operating systems, however, source is the rule, and binary is often unavailable.
Or is it? Exactly how to find out what is available will be left for another post as we do have a tool just for that. But today, just hear me out when I say that binary is often an option even when source is the default. And it matters. See below.
As a (mostlytoalways) Linux user, I sometimes whistle between my teeth that we "lost all those battles" (i.e. for the desktop(s) or laptop(s)) but "won the war". That topic merits a longer post I hope to write one day, and I won’t do it justice today but my main gist that everybody (and here I mean mostly developers/power users) now at least also runs on Linux. And by that I mean that we all test our code in Linux environments such as e.g. Travis CI, and that many of us run deployments on cloud instances (AWS, GCE, Azure, …) which are predominantly based on Linux. Or on local clusters. Or, if one may dream, the top500 And on and on. And frequently these are Ubuntu machines.
So here is an Ubuntu trick: Install from binary, and save loads of time. As an illustration, consider the chart below. It carries over the logic from the ‘cached vs noncached’ compilation post and contrasts two ways of installing: from source, or as a binary. I use pristine and empty Docker containers as the base, and rely of course on the official rbase image which is supplied by Carl Boettiger and yours truly as part of our Rocker Project (and for which we have a forthcoming R Journal piece I might mention). So for example the timings for the ggplot2 installation were obtained via
time docker run rm ti rbase /bin/bash c 'install.r ggplot2'and
time docker run rm ti rbase /bin/bash c 'aptget update && aptget install y rcranggplot2'Here docker run rm ti just means to launch Docker, in ‘remove leftovers at end’ mode, use terminal and interactive mode and invoke a shell. The shell command then is, respectively, to install a CRAN package using install.r from my littler package, or to install the binary via aptget after updating the apt indices (as the Docker container may have been built a few days or more ago).
Let’s not focus on Docker here—it is just a convenient means to an end of efficiently measuring via a simple (wallclock counting) time invocation. The key really is that install.r is just a wrapper to install.packages() meaning source installation on Linux (as used inside the Docker container). And aptget install ... is how one gets a binary. Again, I will try post another piece to determine how one finds if a suitable binary for a CRAN package exists. For now, just allow me to proceed.
So what do we see then? Well have a look:
A few things stick out. RQuantLib really is a monster. And dplyr is also fairly heavy—both rely on Rcpp, BH and lots of templating. At the other end, data.table is still a marvel. No external dependencies, and just plain C code make the source installation essentially the same speed as the binary installation. Amazing. But I digress.
We should add that one of the source installations also required installing additional libries: QuantLib is needed along with Boost for RQuantLib. Similar for another package (not shown) which needed curl and libcurl.
So what is the upshot? If you can, consider binaries. I will try to write another post how I do that e.g. for Travis CI where all my tests us binaries. (Yes, I know. This mattered more in the past when they did not cache. It still matters today as you a) do not need to fill the cache in the first place and b) do not need to worry about details concerning compilation from source which still throws enough people off. But yes, you can of course survive as is.)
The same approach is equally valid on AWS and related instances: I answered many StackOverflow questions where folks were failing to compile "largeenough" pieces from source on minimal installations with minimal RAM, and running out of resources and failed with bizarre errors. In short: Don’t. Consider binaries. It saves time and trouble.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RVowpalWabbit 0.0.10
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A boring little RVowpalWabbit package update to version 0.0.10 came in response to another CRAN request: We were switching directories to run tests (or examples) which is now discouraged, so we no longer do this as it turns that we can of course refer to the files directly as well. Much cleaner.
No new code or features were added.
We should mention once more that is parallel work ongoing in a higherlevel package interfacing the vw binary — rvw — as well as plan to redo this package via the external libraries. In that sounds interesting to you, please get in touch.
More information is on the RVowpalWabbit page. Issues and bugreports should go to the GitHub issue tracker.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Team Rtus wins Munich Re Datathon with mlr
(This article was first published on mlrorg, and kindly contributed to Rbloggers)
Blog Post “Rtus @ Munich Re Datathon”On the weekend of November 17. – 19. five brave dataknights from team “Rtus and the knights of the data.table” took on the challenge to compete in a datathon organized by Munich Re in its Munichbased innovation lab. Team Rtus was formed in April this year by a bunch of statistics students from LMU with the purpose to prove their dataskills in competitions with other teams from various backgrounds. The datathon was centered around the topic “the effects of climate change and hurricane events on modern reinsurance business” and after two days of very intensive battles with databases, webcrawlers and advanced machine learning modelling (using the famous Rlibrary mlr), they managed to prevail against strong professional competitors and won the best overall price. After victories at the Datafest at Mannheim University and the Telefonica data challenge earlier this year, this was the last step to a hattrick for the core members of team Rtus.
Their way to success was the implementation of an interactive webapp that could serve as a decision support system for underwriting and tariffing units at Munich Re that are dealing with natural catastrophies. Munich Re provided the participants with databases on claims exposure in the Floridabay, footprints of past hurricanes and tons of data on climate variable measurements over the past decades. One of the core tasks of the challenge was to define and calculate the maximum foreseeable loss and the probability of such a worstcase event to take place.
To answer the first question, they created a web app that calculates the expected loss of a hurricane in a certain region. To give the decision makers the opportunity to include their expert domain knowledge, they could interact with the app and set the shape and the location of the hurricane, which was modelled as a spatial Gaussian process. This is depicted in the first screenshot. Due to a NDA the true figures and descriptions in the app were altered.
The team recognised the existence of several critical nuclear power plants in this area. The shocking event of Fukushima in 2011 showed the disastrous effects that storm surges, a side effect of hurricanes, can have in combination with nuclear power plants. To account for this, team Rtus implemented the “Nuclear Power Plant” mode in the app. The function of this NPPmode is shown in this figure:
In a next step, the team tried to provide evidence for the plausibility of such a worstcase event. The following image, based on the footprints of past hurricanes, shows that there were indeed hurricanes crossing the locations of the nuclear power plants:
To answer the second part of the question, they also created a simulation of several weather variables to forecast the probability of such heavy category 5 hurricane events. One rule of reliable statistic modelling is the inclusion of uncertainty measures in any prediction, which was integrated via the prediction intervals. Also, the user of the simulation is able to increase or decrease the estimated temperature trend that underlies the model. This screenshot illustrates the simulation app:
The 36 hours of intensive hacking, discussions, reiterations and fantastic team work combined with the consumption of estimated 19.68 litres of Club Mate were finally rewarded with the first place ranking and a Microsoft Surface Pro for each of the knights. “We will apply this augmentation of our weapon arsenal directly in the next data battle”, one of the knights proudly stated during the awards ceremony.
This portrait shows the tired but happy data knights (from left to right: Niklas Klein, Moritz Herrmann, Jann Goschenhofer, Markus Dumke and Daniel Schalk):
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: mlrorg. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Image Classification on Small Datasets with Keras
(This article was first published on TensorFlow for R, and kindly contributed to Rbloggers)
Deep Learning with RThis post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com.
Training a convnet with a small datasetHaving to train an imageclassification model using very little data is a common situation, which you’ll likely encounter in practice if you ever do computer vision in a professional context. A “few” samples can mean anywhere from a few hundred to a few tens of thousands of images. As a practical example, we’ll focus on classifying images as dogs or cats, in a dataset containing 4,000 pictures of cats and dogs (2,000 cats, 2,000 dogs). We’ll use 2,000 pictures for training – 1,000 for validation, and 1,000 for testing.
In Chapter 5 of the Deep Learning with R book we review three techniques for tackling this problem. The first of these is training a small model from scratch on what little data you have (which achieves an accuracy of 82%). Subsequently we use feature extraction with a pretrained network (resulting in an accuracy of 90% to 96%) and finetuning a pretrained network (with a final accuracy of 97%). In this post we’ll cover only the second and third techniques.
The relevance of deep learning for smalldata problemsYou’ll sometimes hear that deep learning only works when lots of data is available. This is valid in part: one fundamental characteristic of deep learning is that it can find interesting features in the training data on its own, without any need for manual feature engineering, and this can only be achieved when lots of training examples are available. This is especially true for problems where the input samples are very highdimensional, like images.
But what constitutes lots of samples is relative – relative to the size and depth of the network you’re trying to train, for starters. It isn’t possible to train a convnet to solve a complex problem with just a few tens of samples, but a few hundred can potentially suffice if the model is small and well regularized and the task is simple. Because convnets learn local, translationinvariant features, they’re highly data efficient on perceptual problems. Training a convnet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any custom feature engineering. You’ll see this in action in this section.
What’s more, deeplearning models are by nature highly repurposable: you can take, say, an imageclassification or speechtotext model trained on a largescale dataset and reuse it on a significantly different problem with only minor changes. Specifically, in the case of computer vision, many pretrained models (usually trained on the ImageNet dataset) are now publicly available for download and can be used to bootstrap powerful vision models out of very little data. That’s what you’ll do in the next section. Let’s start by getting your hands on the data.
Downloading the dataThe Dogs vs. Cats dataset that you’ll use isn’t packaged with Keras. It was made available by Kaggle as part of a computervision competition in late 2013, back when convnets weren’t mainstream. You can download the original dataset from https://www.kaggle.com/c/dogsvscats/data (you’ll need to create a Kaggle account if you don’t already have one – don’t worry, the process is painless).
The pictures are mediumresolution color JPEGs. Here are some examples:
Unsurprisingly, the dogsversuscats Kaggle competition in 2013 was won by entrants who used convnets. The best entries achieved up to 95% accuracy. Below you’ll end up with a 97% accuracy, even though you’ll train your models on less than 10% of the data that was available to the competitors.
This dataset contains 25,000 images of dogs and cats (12,500 from each class) and is 543 MB (compressed). After downloading and uncompressing it, you’ll create a new dataset containing three subsets: a training set with 1,000 samples of each class, a validation set with 500 samples of each class, and a test set with 500 samples of each class.
Following is the code to do this:
original_dataset_dir < "~/Downloads/kaggle_original_data" base_dir < "~/Downloads/cats_and_dogs_small" dir.create(base_dir) train_dir < file.path(base_dir, "train") dir.create(train_dir) validation_dir < file.path(base_dir, "validation") dir.create(validation_dir) test_dir < file.path(base_dir, "test") dir.create(test_dir) train_cats_dir < file.path(train_dir, "cats") dir.create(train_cats_dir) train_dogs_dir < file.path(train_dir, "dogs") dir.create(train_dogs_dir) validation_cats_dir < file.path(validation_dir, "cats") dir.create(validation_cats_dir) validation_dogs_dir < file.path(validation_dir, "dogs") dir.create(validation_dogs_dir) test_cats_dir < file.path(test_dir, "cats") dir.create(test_cats_dir) test_dogs_dir < file.path(test_dir, "dogs") dir.create(test_dogs_dir) fnames < paste0("cat.", 1:1000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(train_cats_dir)) fnames < paste0("cat.", 1001:1500, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(validation_cats_dir)) fnames < paste0("cat.", 1501:2000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(test_cats_dir)) fnames < paste0("dog.", 1:1000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(train_dogs_dir)) fnames < paste0("dog.", 1001:1500, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(validation_dogs_dir)) fnames < paste0("dog.", 1501:2000, ".jpg") file.copy(file.path(original_dataset_dir, fnames), file.path(test_dogs_dir)) Using a pretrained convnetA common and highly effective approach to deep learning on small image datasets is to use a pretrained network. A pretrained network is a saved network that was previously trained on a large dataset, typically on a largescale imageclassification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computervision problems, even though these new problems may involve completely different classes than those of the original task. For instance, you might train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as identifying furniture items in images. Such portability of learned features across different problems is a key advantage of deep learning compared to many older, shallowlearning approaches, and it makes deep learning very effective for smalldata problems.
In this case, let’s consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1,000 different classes). ImageNet contains many animal classes, including different species of cats and dogs, and you can thus expect to perform well on the dogsversuscats classification problem.
You’ll use the VGG16 architecture, developed by Karen Simonyan and Andrew Zisserman in 2014; it’s a simple and widely used convnet architecture for ImageNet. Although it’s an older model, far from the current state of the art and somewhat heavier than many other recent models, I chose it because its architecture is similar to what you’re already familiar with and is easy to understand without introducing any new concepts. This may be your first encounter with one of these cutesy model names – VGG, ResNet, Inception, InceptionResNet, Xception, and so on; you’ll get used to them, because they will come up frequently if you keep doing deep learning for computer vision.
There are two ways to use a pretrained network: feature extraction and finetuning. We’ll cover both of them. Let’s start with feature extraction.
Feature extractionFeature extraction consists of using the representations learned by a previous network to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch.
As you saw previously, convnets used for image classification comprise two parts: they start with a series of pooling and convolution layers, and they end with a densely connected classifier. The first part is called the convolutional base of the model. In the case of convnets, feature extraction consists of taking the convolutional base of a previously trained network, running the new data through it, and training a new classifier on top of the output.
Why only reuse the convolutional base? Could you reuse the densely connected classifier as well? In general, doing so should be avoided. The reason is that the representations learned by the convolutional base are likely to be more generic and therefore more reusable: the feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computervision problem at hand. But the representations learned by the classifier will necessarily be specific to the set of classes on which the model was trained – they will only contain information about the presence probability of this or that class in the entire picture. Additionally, representations found in densely connected layers no longer contain any information about where objects are located in the input image: these layers get rid of the notion of space, whereas the object location is still described by convolutional feature maps. For problems where object location matters, densely connected features are largely useless.
Note that the level of generality (and therefore reusability) of the representations extracted by specific convolution layers depends on the depth of the layer in the model. Layers that come earlier in the model extract local, highly generic feature maps (such as visual edges, colors, and textures), whereas layers that are higher up extract moreabstract concepts (such as “cat ear” or “dog eye”). So if your new dataset differs a lot from the dataset on which the original model was trained, you may be better off using only the first few layers of the model to do feature extraction, rather than using the entire convolutional base.
In this case, because the ImageNet class set contains multiple dog and cat classes, it’s likely to be beneficial to reuse the information contained in the densely connected layers of the original model. But we’ll choose not to, in order to cover the more general case where the class set of the new problem doesn’t overlap the class set of the original model.
Let’s put this in practice by using the convolutional base of the VGG16 network, trained on ImageNet, to extract interesting features from cat and dog images, and then train a dogsversuscats classifier on top of these features.
The VGG16 model, among others, comes prepackaged with Keras. Here’s the list of imageclassification models (all pretrained on the ImageNet dataset) that are available as part of Keras:
 Xception
 Inception V3
 ResNet50
 VGG16
 VGG19
 MobileNet
Let’s instantiate the VGG16 model.
library(keras) conv_base < application_vgg16( weights = "imagenet", include_top = FALSE, input_shape = c(150, 150, 3) )You pass three arguments to the function:
 weights specifies the weight checkpoint from which to initialize the model.
 include_top refers to including (or not) the densely connected classifier on top of the network. By default, this densely connected classifier corresponds to the 1,000 classes from ImageNet. Because you intend to use your own densely connected classifier (with only two classes: cat and dog), you don’t need to include it.
 input_shape is the shape of the image tensors that you’ll feed to the network. This argument is purely optional: if you don’t pass it, the network will be able to process inputs of any size.
Here’s the detail of the architecture of the VGG16 convolutional base. It’s similar to the simple convnets you’re already familiar with:
summary(conv_base) Layer (type) Output Shape Param # ================================================================ input_1 (InputLayer) (None, 150, 150, 3) 0 ________________________________________________________________ block1_conv1 (Convolution2D) (None, 150, 150, 64) 1792 ________________________________________________________________ block1_conv2 (Convolution2D) (None, 150, 150, 64) 36928 ________________________________________________________________ block1_pool (MaxPooling2D) (None, 75, 75, 64) 0 ________________________________________________________________ block2_conv1 (Convolution2D) (None, 75, 75, 128) 73856 ________________________________________________________________ block2_conv2 (Convolution2D) (None, 75, 75, 128) 147584 ________________________________________________________________ block2_pool (MaxPooling2D) (None, 37, 37, 128) 0 ________________________________________________________________ block3_conv1 (Convolution2D) (None, 37, 37, 256) 295168 ________________________________________________________________ block3_conv2 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_conv3 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_pool (MaxPooling2D) (None, 18, 18, 256) 0 ________________________________________________________________ block4_conv1 (Convolution2D) (None, 18, 18, 512) 1180160 ________________________________________________________________ block4_conv2 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_conv3 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_pool (MaxPooling2D) (None, 9, 9, 512) 0 ________________________________________________________________ block5_conv1 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv2 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv3 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_pool (MaxPooling2D) (None, 4, 4, 512) 0 ================================================================ Total params: 14,714,688 Trainable params: 14,714,688 Nontrainable params: 0The final feature map has shape (4, 4, 512). That’s the feature on top of which you’ll stick a densely connected classifier.
At this point, there are two ways you could proceed:

Running the convolutional base over your dataset, recording its output to an array on disk, and then using this data as input to a standalone, densely connected classifier similar to those you saw in part 1 of this book. This solution is fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. But for the same reason, this technique won’t allow you to use data augmentation.

Extending the model you have (conv_base) by adding dense layers on top, and running the whole thing end to end on the input data. This will allow you to use data augmentation, because every input image goes through the convolutional base every time it’s seen by the model. But for the same reason, this technique is far more expensive than the first.
In this post we’ll cover the second technique in detail (in the book we cover both). Note that this technique is so expensive that you should only attempt it if you have access to a GPU – it’s absolutely intractable on a CPU.
Feature extraction with data augmentationBecause models behave just like layers, you can add a model (like conv_base) to a sequential model just like you would add a layer.
model < keras_model_sequential() %>% conv_base %>% layer_flatten() %>% layer_dense(units = 256, activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")This is what the model looks like now:
summary(model) Layer (type) Output Shape Param # ================================================================ vgg16 (Model) (None, 4, 4, 512) 14714688 ________________________________________________________________ flatten_1 (Flatten) (None, 8192) 0 ________________________________________________________________ dense_1 (Dense) (None, 256) 2097408 ________________________________________________________________ dense_2 (Dense) (None, 1) 257 ================================================================ Total params: 16,812,353 Trainable params: 16,812,353 Nontrainable params: 0As you can see, the convolutional base of VGG16 has 14,714,688 parameters, which is very large. The classifier you’re adding on top has 2 million parameters.
Before you compile and train the model, it’s very important to freeze the convolutional base. Freezing a layer or set of layers means preventing their weights from being updated during training. If you don’t do this, then the representations that were previously learned by the convolutional base will be modified during training. Because the dense layers on top are randomly initialized, very large weight updates would be propagated through the network, effectively destroying the representations previously learned.
In Keras, you freeze a network using the freeze_weights() function:
length(model$trainable_weights) [1] 30 freeze_weights(conv_base) length(model$trainable_weights) [1] 4With this setup, only the weights from the two dense layers that you added will be trained. That’s a total of four weight tensors: two per layer (the main weight matrix and the bias vector). Note that in order for these changes to take effect, you must first compile the model. If you ever modify weight trainability after compilation, you should then recompile the model, or these changes will be ignored.
Using data augmentationOverfitting is caused by having too few samples to learn from, rendering you unable to train a model that can generalize to new data. Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit. Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believablelooking images. The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better.
In Keras, this can be done by configuring a number of random transformations to be performed on the images read by an image_data_generator(). For example:
train_datagen = image_data_generator( rescale = 1/255, rotation_range = 40, width_shift_range = 0.2, height_shift_range = 0.2, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = TRUE, fill_mode = "nearest" )These are just a few of the options available (for more, see the Keras documentation). Let’s quickly go over this code:
 rotation_range is a value in degrees (0–180), a range within which to randomly rotate pictures.
 width_shift and height_shift are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
 shear_range is for randomly applying shearing transformations.
 zoom_range is for randomly zooming inside pictures.
 horizontal_flip is for randomly flipping half the images horizontally – relevant when there are no assumptions of horizontal asymmetry (for example, realworld pictures).
 fill_mode is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.
Now we can train our model using the image data generator:
# Note that the validation data shouldn't be augmented! test_datagen < image_data_generator(rescale = 1/255) train_generator < flow_images_from_directory( train_dir, # Target directory train_datagen, # Data generator target_size = c(150, 150), # Resizes all images to 150 × 150 batch_size = 20, class_mode = "binary" # binary_crossentropy loss for binary labels ) validation_generator < flow_images_from_directory( validation_dir, test_datagen, target_size = c(150, 150), batch_size = 20, class_mode = "binary" ) model %>% compile( loss = "binary_crossentropy", optimizer = optimizer_rmsprop(lr = 2e5), metrics = c("accuracy") ) history < model %>% fit_generator( train_generator, steps_per_epoch = 100, epochs = 30, validation_data = validation_generator, validation_steps = 50 )Let’s plot the results. As you can see, you reach a validation accuracy of about 96%.
FinetuningAnother widely used technique for model reuse, complementary to feature extraction, is finetuning Finetuning consists of unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in this case, the fully connected classifier) and these top layers. This is called finetuning because it slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant for the problem at hand.
I stated earlier that it’s necessary to freeze the convolution base of VGG16 in order to be able to train a randomly initialized classifier on top. For the same reason, it’s only possible to finetune the top layers of the convolutional base once the classifier on top has already been trained. If the classifier isn’t already trained, then the error signal propagating through the network during training will be too large, and the representations previously learned by the layers being finetuned will be destroyed. Thus the steps for finetuning a network are as follows:
 Add your custom network on top of an alreadytrained base network.
 Freeze the base network.
 Train the part you added.
 Unfreeze some layers in the base network.
 Jointly train both these layers and the part you added.
You already completed the first three steps when doing feature extraction. Let’s proceed with step 4: you’ll unfreeze your conv_base and then freeze individual layers inside it.
As a reminder, this is what your convolutional base looks like:
summary(conv_base) Layer (type) Output Shape Param # ================================================================ input_1 (InputLayer) (None, 150, 150, 3) 0 ________________________________________________________________ block1_conv1 (Convolution2D) (None, 150, 150, 64) 1792 ________________________________________________________________ block1_conv2 (Convolution2D) (None, 150, 150, 64) 36928 ________________________________________________________________ block1_pool (MaxPooling2D) (None, 75, 75, 64) 0 ________________________________________________________________ block2_conv1 (Convolution2D) (None, 75, 75, 128) 73856 ________________________________________________________________ block2_conv2 (Convolution2D) (None, 75, 75, 128) 147584 ________________________________________________________________ block2_pool (MaxPooling2D) (None, 37, 37, 128) 0 ________________________________________________________________ block3_conv1 (Convolution2D) (None, 37, 37, 256) 295168 ________________________________________________________________ block3_conv2 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_conv3 (Convolution2D) (None, 37, 37, 256) 590080 ________________________________________________________________ block3_pool (MaxPooling2D) (None, 18, 18, 256) 0 ________________________________________________________________ block4_conv1 (Convolution2D) (None, 18, 18, 512) 1180160 ________________________________________________________________ block4_conv2 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_conv3 (Convolution2D) (None, 18, 18, 512) 2359808 ________________________________________________________________ block4_pool (MaxPooling2D) (None, 9, 9, 512) 0 ________________________________________________________________ block5_conv1 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv2 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_conv3 (Convolution2D) (None, 9, 9, 512) 2359808 ________________________________________________________________ block5_pool (MaxPooling2D) (None, 4, 4, 512) 0 ================================================================ Total params: 14714688You’ll finetune the last three convolutional layers, which means all layers up to block4_pool should be frozen, and the layers block5_conv1, block5_conv2, and block5_conv3 should be trainable.
Why not finetune more layers? Why not finetune the entire convolutional base? You could. But you need to consider the following:
 Earlier layers in the convolutional base encode moregeneric, reusable features, whereas layers higher up encode morespecialized features. It’s more useful to finetune the more specialized features, because these are the ones that need to be repurposed on your new problem. There would be fastdecreasing returns in finetuning lower layers.
 The more parameters you’re training, the more you’re at risk of overfitting. The convolutional base has 15 million parameters, so it would be risky to attempt to train it on your small dataset.
Thus, in this situation, it’s a good strategy to finetune only the top two or three layers in the convolutional base. Let’s set this up, starting from where you left off in the previous example.
unfreeze_weights(conv_base, from = "block5_conv1")Now you can begin finetuning the network. You’ll do this with the RMSProp optimizer, using a very low learning rate. The reason for using a low learning rate is that you want to limit the magnitude of the modifications you make to the representations of the three layers you’re finetuning. Updates that are too large may harm these representations.
model %>% compile( loss = "binary_crossentropy", optimizer = optimizer_rmsprop(lr = 1e5), metrics = c("accuracy") ) history < model %>% fit_generator( train_generator, steps_per_epoch = 100, epochs = 100, validation_data = validation_generator, validation_steps = 50 )Let’s plot our results:
You’re seeing a nice 1% absolute improvement in accuracy, from about 96% to above 97%.
Note that the loss curve doesn’t show any real improvement (in fact, it’s deteriorating). You may wonder, how could accuracy stay stable or improve if the loss isn’t decreasing? The answer is simple: what you display is an average of pointwise loss values; but what matters for accuracy is the distribution of the loss values, not their average, because accuracy is the result of a binary thresholding of the class probability predicted by the model. The model may still be improving even if this isn’t reflected in the average loss.
You can now finally evaluate this model on the test data:
test_generator < flow_images_from_directory( test_dir, test_datagen, target_size = c(150, 150), batch_size = 20, class_mode = "binary" ) model %>% evaluate_generator(test_generator, steps = 50) $loss [1] 0.2840393 $acc [1] 0.972Here you get a test accuracy of 97.2%. In the original Kaggle competition around this dataset, this would have been one of the top results. But using modern deeplearning techniques, you managed to reach this result using only a small fraction of the training data available (about 10%). There is a huge difference between being able to train on 20,000 samples compared to 2,000 samples!
Takeaways: using convnets with small datasetsHere’s what you should take away from the exercises in the past two sections:
 Convnets are the best type of machinelearning models for computervision tasks. It’s possible to train one from scratch even on a very small dataset, with decent results.
 On a small dataset, overfitting will be the main issue. Data augmentation is a powerful way to fight overfitting when you’re working with image data.
 It’s easy to reuse an existing convnet on a new dataset via feature extraction. This is a valuable technique for working with small image datasets.
 As a complement to feature extraction, you can use finetuning, which adapts to a new problem some of the representations previously learned by an existing model. This pushes performance a bit further.
Now you have a solid set of tools for dealing with imageclassification problems – in particular with small datasets.
main { hyphens: inherit; } Deep Learning with RThis post is an excerpt from Chapter 5 of François Chollet’s and J.J. Allaire’s book, Deep Learning with R (Manning Publications). Use the code fccallaire for a 42% discount on the book at manning.com.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R in the Windows Subsystem for Linux
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
R has been available for Windows since the very beginning, but if you have a Windows machine and want to use R within a Linux ecosystem, that's easy to do with the new Fall Creator's Update (version 1709). If you need access to the gcc toolchain for building R packages, or simply prefer the bash environment, it's easy to get things up and running.
Once you have things set up, you can launch a bash shell and run R at the terminal like you would in any Linux system. And that's because this is a Linux system: the Windows Subsystem for Linux is a complete Linux distribution running within Windows. This page provides the details on installing Linux on Windows, but here are the basic steps you need and how to get the latest version of R up and running within it.
First, Enable the Windows Subsystem for Linux option. Go to Control Panel > Programs > Turn Windows Features on or off (or just type "Windows Features" into the search box), and select the "Windows Subsystem for Linux" option. You'll need to reboot, just this once.
Next, you'll need to install your preferred distribution of Linux from the Microsoft Store. If you search for "Linux" in the store, you'll find an entry "Run Linux on Windows" which will provide you with the available distributions. I'm using "Ubuntu", which as of this writing is Ubuntu 16.04 (Xenial Xerus).
Once that's installed you can launch Ubuntu from the Start menu (just like any other app) to open a new bash shell window. The first time you launch, it will take a few minutes to install various components, and you'll also need to create a username and password. This is your Linux username, different from your Windows username. You'll automatically log in when you launch new Ubuntu sessions, but make sure you remember the password — you'll need it later.
From here you can go ahead and install R, but if you use the default Ubuntu repository you'll get an old version of R (R 3.2.3, from 2015). You probably want the latest version of R, so add CRAN as a new package repository for Ubuntu. You'll need to run these three commands as root, so enter the password you created above here if requested:
sudo echo "deb http://cloud.rproject.org/bin/linux/ubuntu xenial/"  sudo tee a /etc/apt/sources.list
sudo aptkey adv keyserver keyserver.ubuntu.com recvkeys E084DAB9
sudo aptget update
(Don't be surprised by the message key E084DAB9: public key "Michael Rutter " imported. That's how Ubuntu signs the R packages.)
Now you're all set to install the latest version of R, which can be done with:
sudo aptget install rbaseAnd that's it! (Once all the dependencies install, anyway, which can take a while the first time.) Now you're all ready to run R from the Linux command line:
Note that you can access files on your Windows system from R — you'll find them at /mnt/c/Users/. This FAQ on the WSL provides other useful tips, and for complete details refer to the Windows for Linux Subsystem Documentation.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Color palettes derived from the Dutch masters
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
Among tulip fields, canals and sampling cheese,
the museums of the Netherlands are one of its biggest tourist attractions. And
for very good reasons! During the seventeenth century, known as the Dutch Golden
Age, there was an abundance of talented painters. If you ever have
the chance to visit the Rijksmuseum you will be in awe by the landscapes,
households and portraits, painted with incredible detail and beautiful colors.
Rembrandt van Rijn and Johannes Vermeer are the most famous of the seventeenth
century Dutch masters. Both are renowned for their use of light and color,
making encounters with their subjects feel as being in the room with them.
Recently, during the OzUnconference, the beautiful ochRe package was
developed. This package contains color palettes of the wonderful Australian
landscape (which my wife got to witness during our honeymoon last
year). Drawing colors from both works of art and photographs of Australia.
I shamelessly stole both the idea and package structure of ochRe to create dutchmasters. This package contains six color
palettes derived from my six favourites by Van Rijn and Vermeer.
Vermeer’s paintings are renowned for their vivid and light colors. The package
offers palettes from Vermeer’s The Milkmaid,
Girl with a Pearl Earring,
View of Delft, and
The Little Street.
Rembrandt’s paintings on the other hand are more atmospheric, with a lot of dark
colors and subjects that stare you right into the soul. From him you find
palettes derived from the The Anatomy Lesson of Dr. Nicolaes Tulp
and The “Staalmeesters”.
Like ochRe, the package comprises a list with the color palettes. I have added
names to each of the color codes, reflecting the color it represents and moreover
where on the painting the color was taken from. As mentioned, I shamelessly
converted the functions from ochRe into dutchmasters functions. Why invent
something that smarter people already figured out?
Grab the package from github with devtools::install_github("EdwinTh/dutchmasters").
Make sure to install ochRe right away for the handy viz_palette function,
with devtools::install_github("ropenscilabs/ochRe/").
This is what the palette “milkmaid” looks like.
library(dutchmasters) ochRe::viz_palette(dutchmasters[["milkmaid"]])Now to put those beautiful Vermeer shades into action on a ggplot.
library(ggplot2) ggplot(diamonds, aes(color, fill = clarity)) + geom_bar() + scale_fill_dutchmasters(palette = "milkmaid")Or maybe the dark “staalmeesters” colors?
ggplot(diamonds, aes(color, fill = clarity)) + geom_bar() + scale_fill_dutchmasters(palette = "staalmeesters")I leave the other four palettes for you to explore. In the future I will
certainly add more paintings, as the well of the Dutch masters seems bottomless.
A big thanks to the ochRe team, for inspiration and groundwork. I hope that
this package motivates you to further explore the wonderful art of the Dutch
Golden Age.
To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Mapping oil production by country in R
(This article was first published on rbloggers – SHARP SIGHT LABS, and kindly contributed to Rbloggers)
This week, here’s another map.
As I’ve mentioned, projects like this – a few hundred lines of code, with a variety of functions from the tidyverse – are excellent practice for intermediate data scientists.
But if you’re a beginner, don’t be intimidated. You can actually learn these tools much faster than you think.
To get to a point where you really know how this works, you’ll need a few things. You will need to know a few things about ggplot2, dplyr and the tidyverse. I recommend that you break this down, line by line. Do you know all of the functions? Do you know how it all works? If you don’t, you might have some more practice to do.
Ultimately, to be “fluent” in data science in R, you should be able to write this code (or something like it) from scratch, more or less from memory.
In any case, there’s a lot going on in this code.
In a separate post, I will explain a few details about how this works, and expand on it somewhat.
In the meantime, if you want to learn to do projects like this, go through the code below. Break it down into small pieces that you can study and practice.
And if you want to know more about data visualization and data science in R, sign up for our email list.
Sign up now, and discover how to rapidly master data scienceTo master data visualization and data science, you need to master the essential tools.
Moreover, to make rapid progress, you need to know what to learn, what not to learn, and you need to know how to practice what you learn.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you’ll receive regular tutorials and lessons.
You’ll learn:
 How to do data visualization in R
 How to practice data science
 How to apply data visualization to more advanced topics (like machine learning)
 … and more
If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.
SIGN UP NOW
Code: mapping oil production by country #============== # LOAD PACKAGES #============== library(tidyverse) library(sf) library(rvest) library(stringr) library(scales) #library(viridis) #============ # SCRAPE DATA #============ df.oil < read_html("https://en.wikipedia.org/wiki/List_of_countries_by_oil_production") %>% html_nodes("table") %>% .[[1]] %>% html_table() #==================== # CHANGE COLUMN NAMES #==================== colnames(df.oil) < c('rank', 'country', 'oil_bbl_per_day') #============================= # WRANGLE VARIABLES INTO SHAPE #============================= # # COERCE 'rank' VARIABLE TO INTEGER # df.oil < df.oil %>% mutate(rank = as.integer(rank)) df.oil %>% glimpse() # # WRANGLE FROM CHARACTER TO NUMERIC: oil_bbl_per_day # df.oil < df.oil %>% mutate(oil_bbl_per_day = oil_bbl_per_day %>% str_replace_all(',','') %>% as.integer()) # inspect df.oil %>% glimpse() #=========================== #CREATE VARIABLE: 'opec_ind' #=========================== df.oil < df.oil %>% mutate(opec_ind = if_else(str_detect(country, 'OPEC'), 1, 0)) #========================================================= # CLEAN UP 'country' #  some country names are tagged as being OPEC countries # and this information is in the country name #  we will strip this information out #========================================================= df.oil < df.oil %>% mutate(country = country %>% str_replace(' \\(OPEC\\)', '') %>% str_replace('\\s{2,}',' ')) # inspect df.oil %>% glimpse() # # EXAMINE OPEC COUNTRIES #  here, we'll just visually inspect # to make sure that the names are correct # df.oil %>% filter(opec_ind == 1) %>% select(country) #================== # REORDER VARIABLES #================== df.oil < df.oil %>% select(rank, country, opec_ind, oil_bbl_per_day) df.oil %>% glimpse() #======== # GET MAP #======== map.world < map_data('world') df.oil #========================== # CHECK FOR JOIN MISMATCHES #========================== anti_join(df.oil, map.world, by = c('country' = 'region')) # rank country opec_ind oil_bbl_per_day # 1 67 Congo, Democratic Republic of the 0 20,000 # 2 47 Trinidad and Tobago 0 60,090 # 3 34 Sudan and South Sudan 0 255,000 # 4 30 Congo, Republic of the 0 308,363 # 5 20 United Kingdom 0 939,760 # 6 3 United States 0 8,875,817 #===================== # RECODE COUNTRY NAMES #===================== map.world %>% group_by(region) %>% summarise() %>% print(n = Inf) # UK # USA # Democratic Republic of the Congo # Trinidad # Sudan # South Sudan df.oil < df.oil %>% mutate(country = recode(country, `United States` = 'USA' , `United Kingdom` = 'UK' , `Congo, Democratic Republic of the` = 'Democratic Republic of the Congo' , `Trinidad and Tobago` = 'Trinidad' , `Sudan and South Sudan` = 'Sudan' #, `Sudan and South Sudan` = 'South Sudan' , `Congo, Republic of the` = 'Republic of Congo' ) ) # # JOIN DATASETS TOGETHER # map.oil < left_join( map.world, df.oil, by = c('region' = 'country')) #===== # PLOT #===== # BASIC (this is a first draft) ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(fill = oil_bbl_per_day)) #======================= # FINAL, FORMATTED DRAFT #======================= df.oil %>% filter(oil_bbl_per_day > 822675) %>% summarise(mean(oil_bbl_per_day)) # 3190373 df.oil %>% filter(oil_bbl_per_day < 822675) %>% summarise(mean(oil_bbl_per_day)) # 96581.08 ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(fill = oil_bbl_per_day)) + scale_fill_gradientn(colours = c('#461863','#404E88','#2A8A8C','#7FD157','#F9E53F') ,values = scales::rescale(c(100,96581,822675,3190373,10000000)) ,labels = comma ,breaks = c(100,96581,822675,3190373,10000000) ) + guides(fill = guide_legend(reverse = T)) + labs(fill = 'bbl/day' ,title = 'Oil Production by Country' ,subtitle = 'Barrels per day, 2016' ,x = NULL ,y = NULL) + theme(text = element_text(family = 'Gill Sans', color = '#EEEEEE') ,plot.title = element_text(size = 28) ,plot.subtitle = element_text(size = 14) ,axis.ticks = element_blank() ,axis.text = element_blank() ,panel.grid = element_blank() ,panel.background = element_rect(fill = '#333333') ,plot.background = element_rect(fill = '#333333') ,legend.position = c(.18,.36) ,legend.background = element_blank() ,legend.key = element_blank() ) + annotate(geom = 'text' ,label = 'Source: U.S. Energy Information Administration\nhttps://en.wikipedia.org/wiki/List_of_countries_by_oil_production' ,x = 18, y = 55 ,size = 3 ,family = 'Gill Sans' ,color = '#CCCCCC' ,hjust = 'left' )
And here’s the finalized map that the code produces:
To get more tutorials that explain how this code works, sign up for our email list.
SIGN UP NOW
The post Mapping oil production by country in R appeared first on SHARP SIGHT LABS.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – SHARP SIGHT LABS. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dummy Variable for Examining Structural Instability in Regression: An Alternative to Chow Test
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
One of the fast growing economies in the era of globalization is the Ethiopian economy. Among the lower income group countries, it has emerged as one of the rare countries to achieve a double digit growth rate in Grows Domestic Product (GDP). However, there is a great deal of debate regarding the double digit growth rate, especially during the recent global recession period. So, it becomes a question of empirical research whether there is a structural change in the relationship between the GDP of Ethiopia and the regressor (time).
How do we find out that a structural change has in fact occurred? To answer this question, we consider the GDP of Ethiopia (measured on constant 2010 US$) over the period of 1981 to 2015. Like many other countries in the world, Ethiopia has adopted the policy of regulated globalization during the early nineties of the last century. So, our aim is to whether the GDP of Ethiopia has undergone any structural changes following the major policy shift due to adoption of globalization policy. To answer this question, we have two options in statistical and econometric research. The most important classes of tests on structural change are the tests from the generalized fluctuation test framework (Kuan and Hornik, 1995) on the one hand and tests based on F statistics (Hansen, 1992; Andrews, 1993; Andrews and Ploberger, 1994) on the other. The first class includes in particular the CUSUM and MOSUM tests and the fluctuation test, while the Chow and the supF test belong to the latter. A topic that gained more interest rather recently is to monitor structural change, i.e., to start after a history phase (without structural changes) to analyze new observations and to be able to detect a structural change as soon after its occurrence as possible.
Let us divide the whole study period in two subperiods
 1. preglobalization (1981 – 1991)
2. postglobalization (19922015)
$$ \text {Pre – globalization period (1981 – 1991):} ~~~ lnGDP=\beta_{01} +β_{11} t +u_1 \\ \text {Post – globalization period (1992 – 2015):}~~~ lnGDP=\beta_{02} +\beta_{12} t+u_2 \\ \text {Whole period (1981 – 2015):} ~~~ lnGDP=\beta_0 +\beta_1 t +u $$
The regression for the whole period assumes that there is no difference between the two time periods and, therefore, estimates the GDP growth rate for the entire time period. In other words, this regression assumes that the intercept, as well as the slope coefficient, remains the same over the entire period; that is, there is no structural change. If this is, in fact, the situation, then
\( \beta_{01} =\beta_{02}=\beta_0 ~~~~and~~~~ \beta_{11} =\beta_{12} =\beta_1 \)
The first two regression lines assume that the regression parameters are different in two subperiod periods, that is, there is structural instability either because of changes in slope parameters or intercept parameters.
Solution by ChowTo apply Chow test to visualize the structural changes empirically, the following assumptions are required:
 i. The error terms for the two subperiod regressions are normally distributed with the same variance;
ii. The two error terms are independently distributed.
iii. The breakpoint at which the structural stability to be examined should be known apriori.
The Chow test examines the following set of hypotheses:
$$ H_0 : \text {There is no structural change} \\ against \\ H_1 : \text {There is a structural change}
$$
Step 1: Estimate third regression assuming that there is no parameter instability, and obtain the Residual Sum Squares \( RSS_R ~~ with ~~~df =(n_1 +n_2−2k ) \\ where~~k=\text {No of parameters estimated}\\ n_1= \text {No of observations in first period}\\n_2=\text {No of observations in second subperiods} \)
So, in present case \(k = 2,\) as there are two parameters (intercept term and slope coefficient).
We call this Restricted Residual Sum Squares as it assumes the restriction of structural stability, that is, \( \beta_{01}=\beta_{02}=\beta_{0} ~~~and~~~\beta_{11}=\beta_{12}=\beta_1 \)
Step 2: Estimate the first and second regressions assuming that there is parameter instability and obtain the respective Residual Sum Squares as
\( RSS_1 = \text {Residual Sum Squares for the first subperiod regression with}~~ df =n_1−k \\RSS_2 = \text {Residual Sum Squares for the second subperiod regression with}~~ df =n_2−k \)
Step 3: As the two sets of samples supposed to be independent, we can add $RSS_1$ and $RSS_2$ and the resultant Residual Sum Squares may be called the Unrestricted Residual Sum of Squares, that is, \( RSS_{UR}=RSS_1 + RSS_2 ~~~with ~~~df = (n_1 + n_2 2k) \)
The test statistic to be used is given as:
\( F= \frac {\left (RSS_R – RSS_{UR} \right)/k}{\left (RSS_{UR}\right)/(n_1 +n_2 2k)} \)
Under the assumption of true null hypothesis, this follows Fdistribution.
Now if this Ftest is significant, we can reject the null hypothesis of no structural instability and conclude that the fluctuations is the GGP is high enough to believe that it leads to structural instability in the GDP growth path.
Importing the Data into RImporting the data saved in csv format gdp_ethiopia
ethio_gdp<read.csv(file.choose(), header = TRUE, sep = ",", skip = 0) # To set the data as time series gdp_ethio<ts(ethio_gdp$GDP, frequency = 1, start = c(1981), end = c(2015))Let us have a look at the scatter plot of GDP against time using ggplot2 package
library(ggplot2) ggplot(ethio_gdp,aes( YEAR, GDP))+geom_line(col="red")+labs(title="GDP of Ethiopia over 1981 to 2015")As it observed from the figure that the growth of GDP gets momentum after the adoption of new policy regime, especially after the 2000. Until 199293, the economy was looking as a stagnant as the GDP curve was more or less parallel to the xaxis. It is only after 1993 that the GDP started to grow and the growth became more prominent after 2000. To confirm this, we will apply the Chow test by considering the breakpoint at the year 1992 when the new economic policy was adopted by the Ethiopian government.
To fit the model of GDP growth path by assuming that there is no structural instability in the model, we need to create the time variable by using the following code to do so:
For estimating the growth of GDP for the whole period
model11< lm(log(GDP)~t , data = ethio_gdp) summary(model11) Call: lm(formula = log(GDP) ~ t, data = ethio_gdp) Residuals: Min 1Q Median 3Q Max 0.28379 0.17673 0.01783 0.16134 0.34789 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 22.495383 0.066477 338.4 <2e16 *** t 0.050230 0.003221 15.6 <2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1924 on 33 degrees of freedom Multiple Rsquared: 0.8805, Adjusted Rsquared: 0.8769 Fstatistic: 243.2 on 1 and 33 DF, pvalue: < 2.2e16Thus, we see that over the entire period, the GDP is growing at an annual growth rate of 5 per cent.
We’ll use this object ‘model1’ to apply Chow test in R. For this, we need to load the package called ‘strucchange’ by using the following code:
The breakpoint occurs in 1992 which will coincide with 12 in the sequence 1 to 35 corresponding to year 1981 to 2015. The code and results are shown below:
sctest(log(gdp_ethio)~t, type = "Chow", point = 12) Chow test data: log(gdp_ethio) ~ t F = 51.331, pvalue = 1.455e10The above result shows that the Fvalue is 51.331 and the corresponding pvalue is much less than the level of significance (0.05) implying that the test is significant. This confirms that there is structural instability in the GDP growth path of Ethiopia at the breakpoint 1992.
However, we cannot tell whether the difference in the two regressions is because of differences in the intercept terms or the slope coefficients or both. Very often this knowledge itself is very useful. For analyzing such situation, we have the following four possibilities:
I. Coincident Regression where both the intercept and the slope coefficients are the same in the two subperiod regressions;
II. Parallel Regression where only the intercepts in the two regressions are different but the slopes are the same;
III. Concurrent Regression where the intercepts in the two regressions are the same, but the slopes are different; and
IV. Dissimilar Regression where both the intercepts and slopes in the two regressions are different.
Let us examine case by case.
Suppose the economic reforms do not influence the slope parameter but instead simply affect the intercept term.
Following the economic reforms in 1992, we have the following regression models in two subperiods – preglobalization (1981 – 1991) and postglobalization (19922015).
\( \begin {aligned} \text {Preglobalization (1981 – 1991):}~~~& lnGDP=β_{01} +β_{11} t +u_1\\ \text {Post – globalization period (1992 – 2015):}~~~& lnGDP=β_{02} +β_{12} t+u_2\\ \text {Whole period (1981 – 2015):}~~~& lnGDP=β_0 +β_1 t +u \end {aligned} \)
If there is no structural change:
\( \beta_{01}=\beta_{02}=\beta_0 \\and \\ \beta_{11}=\beta_{12}=\beta_1 \)
If there is structural change and that affects only the intercept terms, not the slope coefficients, we have:
\( \beta_{01}\ne\beta_{02}\\and \\ \beta_{11}=\beta_{12} \)
To capture this effect, the dummy variable \(D\) is included in the following manner:
\( \begin {aligned} lGDP&=\beta_0+\beta_1 t+\beta_2 D+u\\where~~D&=1~~\text {for the postreform period (subperiod 2)} \\ &=0~~ \text {for the postreform period (subperiod 2)} \end {aligned} \)
The difference in the logarithm of GDP between two sub periods is given by the coefficient \(\beta_2\).
If \(D = 1\), then for the postreform period, \( \widehat {lGDP} = \hat\beta_0+\hat\beta_1 t+\hat\beta_2 \)
If \(D=0,\) then for the prereform period, \( \widehat {lGDP} = \hat\beta_0+\hat\beta_1 t\)
Using the ‘gdp_ethio.csv’ datafile, this model can be estimated. The R code along with the explanation is produced below:
attach(gdp) gdp$t<seq(1:35) gdp$lGDP<log(GDP) gdp$D=1992,1,0) attach(gdp) gdp$D<factor(D, levels = c(0,1), labels = c("pre","post")) attach(gdp) model5 summary(model5) Call: lm(formula = lGDP ~ t + D, data = gdp) Residuals: Min 1Q Median 3Q Max 0.309808 0.080889 0.006459 0.087419 0.257020 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 22.505310 0.046417 484.847 < 2e16 *** t 0.068431 0.003783 18.089 < 2e16 *** Dpost 0.492244 0.082302 5.981 1.15e06 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1343 on 32 degrees of freedom Multiple Rsquared: 0.9436, Adjusted Rsquared: 0.9401 Fstatistic: 267.6 on 2 and 32 DF, pvalue: < 2.2e16So, the estimated regression model is given by
\( \widehat {lGDP}= 22.505+0.068t0.492D \)
This would imply that
For the postreform period: $$ \widehat {lGDP}= 22.013+0.068t $$
For the prereform period: $$ \widehat {lGDP}= 22.505+0.068t $$
To graphically visualize this, use the following code to produce the following graphs:
library(ggplot2) ggplot(data=gdp, mapping=aes(x=t, y=lGDP, color=D)) + geom_point() + geom_line(mapping=aes(y=pred)) III. Concurrent Regression where the intercepts in the two regressions are the same, but the slopes are differentNow we assume that the economic reforms do not influence the intercept term \(\beta_0\), but simply affect the slope parameter. That is, in terms of our present example, we can say that
\( β_{01} =β_{02} \\and\\ β_{11} \ne β_{12} \)
To capture the effects of such changes in slope parameters, it is necessary to add the product of time variable \(t\) and the dummy variable \(D\). The new variable \(tD\) is called the interaction variable or slope dummy variable, since it allows for a change in the slope of the relationship.
The modified growth estimation model is:
\( lGDP=\beta_0 +\beta_1t+\beta_2 tD+u \)
Obviously, when \(D = 1,\) then \( \widehat{lGDP}=\hat\beta_0 +\hat\beta_1t+\hat\beta_2 t \\\implies \widehat{lGDP}=\hat\beta_0 +(\hat\beta_1+\hat\beta_2)t \)
when \(D=0\), then \( \widehat{lGDP}=\hat\beta_0+\hat\beta_1t \)
The necessary R codes and the results are produced below:
attach(gdp) gdp$D1=1992,1,0) attach(gdp) gdp$tD attach(gdp) model6<lm(lGDP~t+tD,data = gdp) summary(model6) Call: lm(formula = lGDP ~ t + tD, data = gdp) Residuals: Min 1Q Median 3Q Max 0.26778 0.14726 0.04489 0.13919 0.35627 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 22.40377 0.09125 245.521 < 2e16 *** t 0.07072 0.01458 4.851 3.06e05 *** tD 0.01721 0.01195 1.440 0.16  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1894 on 32 degrees of freedom Multiple Rsquared: 0.8878, Adjusted Rsquared: 0.8808 Fstatistic: 126.6 on 2 and 32 DF, pvalue: 6.305e16In the above results, it is evident that the interaction term (tD) is not statistically significant as the corresponding pvalue is much higher than the level of significance (5% or 0.05). This would imply that the economic reforms do not affect the slope term. However, for visualizing the results graphically we proceed as follows:
The estimated regression model is:
\( \widehat {lGDP} = 22.404+0.071t0.017tD \)
This would imply the following growth functions for both the periods.
\( \widehat {lGDP} = 22.404+0.054t~~~~~~\text {for the postreform period}\\ \widehat {lGDP} = 22.404+0.071t~~~~~~\text {for the prereform period} \)
Notice that the slope, rather than the intercept of the growth function changed. This can graphically shown below:
curve((22.404+0.054*x), xlab = "t", ylab = "", xlim = c(1,35), col="red", main="Fig.12.2: Concurrent Regression Using Dummy Variable") curve(22.404+0.071*x, xlab = "t", ylab = "", xlim = c(1,35), col="blue", add = TRUE) legend( "bottomright", legend = c("post", "pre"), fill = c("red", "blue") ) IV. Dissimilar Regression (both the intercepts and slopes are different)In this case, both the intercept and the slope of the growth function changed simultaneously. The statistical model to capture such changes can be represented by the following equation.
The modified statistical model to capture such changes can be represented by the following equation.
\( lGDP=\beta_0 +\beta_1 t +\beta_2 D+\beta_3 tD+u \)
Obviously, when \(D=1\), then for the postreform period: \( \widehat{lGDP}=\hat\beta_0 +\hat\beta_1 t+\hat\beta_2 +\hat\beta_3 t +u
\\=(\hat\beta_0 +\hat\beta_2 )+(\hat\beta_1 + \hat\beta_3) t +u \)
when \(D=0\), then for the prereform period: \( \widehat{lGDP}= \hat\beta_0 + \hat\beta_1 t +u \)
The model is estimated using the following codes and the results are produced below:
Thus, our estimated regression model is written as:
\( \widehat {lGDP} = 22.808 + 0.018 t 0.908 D +0.0552 tD \)
\( \widehat {lGDP} = 21.9+0.0732t~~~~~~\text {for the postreform period}\\ \widehat {lGDP} = 22.808+0.018t~~~~~~\text {for the prereform period} \)
To visualize the matter, the following codes are used to produce the figure below.
curve(21.9+0.0732*x, xlab = "t", ylab = "", xlim = c(1,35), col="blue") curve(22.808+0.018*x, xlab = "t", ylab = "", xlim = c(1,35), col="red", add = TRUE) legend( "topleft", legend = c("post", "pre"), fill = c("blue","red") )Thats all, if you have questions post comments below.
Related Post
 Outliers Detection and Intervention Analysis
 Spark DataFrames: Exploring Chicago Crimes
 Image Processing and Manipulation with magick in R
 Analyzing the Bible and the Quran using Spark
 Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Serving shiny apps in the internet with your own server
(This article was first published on R and Finance, and kindly contributed to Rbloggers)
–
In this post I’ll share my experience in setting up my own virtual
server for hosting shiny applications in Digital
Ocean. First, context. I’m working in a
academic project where we build a package for accessing financial data
and corporate events directly from B3, the Brazilian financial exchange.
The objective is to set a reproducible standard and facilite data
acquisition of a large, and very interesting, dataset. The result is
GetDFPData.
Since many researchers and students in Brazil are not knowledgeable in
R, we needed to make it easier for people to use the software. A shiny
app hosted in the internet is perfect for that. The app is available at
http://www.msperlin.com/shiny/GetDFPData/.
You can host your own shiny app for free in
www.shiny.io, but that comes with some
usage limitations. While searching
for alternatives, I’ve found this great
post
by Dean Attali that clearly explains the
steps for setting up a web server in a virtual machine from Digital
Ocean. Despite being a 2015 post, it works perfectly. The best thing is
that it only costs $5 per month, with the first two months for free.
Once the server is up and running, I can control it using ssh
(terminal), send/retrieve files with github and dropbox, and run code
with Rstudio server, which is basically a Rstudio session in a browser.
Now I have my own corner in the internet, where I can server all my
shiny apps with full control. I’m not only using the server for hosting
web applications, but also running CRON jobs for periodically gather
data for another project, which has to run a R script every day. No
longer I have to worry or remember to turn on my computer every day. I’m
sure I’ll find many more uses to it in the future.
I’m very happy in choosing the longer, more difficult path in publishing
a shiny app in the internet. I learned a lot along the way. At first it
felt overwhelming to configure every aspect of the server. But, if you
know a bit of Linux, setting up your own webserver is not that
difficult. I recommend everyone to give it a try.
To leave a comment for the author, please follow the link and comment on their blog: R and Finance. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introduction to Skewness
(This article was first published on R Views, and kindly contributed to Rbloggers)
In previous posts here, here, and here, we spent quite a bit of time on portfolio volatility, using the standard deviation of returns as a proxy for volatility. Today we will begin to a twopart series on additional statistics that aid our understanding of return dispersion: skewness and kurtosis. Beyond being fancy words and required vocabulary for CFA level 1, these two concepts are both important and fascinating for lovers of returns distributions. For today, we will focus on skewness.
Skewness is the degree to which returns are asymmetric around the mean. Since a normal distribution is symmetric around the mean, skewness can be taken as one measure of how returns are not distributed normally. Why does skewness matter? If portfolio returns are right, or positively, skewed, it implies numerous small negative returns and a few large positive returns. If portfolio returns are left, or negatively, skewed, it implies numerous small positive returns and few large negative returns. The phrase “large negative returns” should trigger Pavlovian sweating for investors, even if it’s preceded by a diminutive modifier like “just a few”. For a portfolio manager, a negatively skewed distribution of returns implies a portfolio at risk of rare but large losses. This makes us nervous and is a bit like saying, “I’m healthy, except for my occasional massive heart attack.”
Let’s get to it.
First, have a look at one equation for skewness:
\[Skew=\sum_{t=1}^n (x_i\overline{x})^3/n \bigg/ (\sum_{t=1}^n (x_i\overline{x})^2/n)^{3/2}\]
Skew has important substantive implications for risk, and is also a concept that lends itself to data visualization. In fact, I find the visualizations of skewness more illuminating than the numbers themselves (though the numbers are what matter in the end). In this section, we will cover how to calculate skewness using xts and tidyverse methods, how to calculate rolling skewness, and how to create several data visualizations as pedagogical aids. We will be working with our usual portfolio consisting of:
+ SPY (S&P500 fund) weighted 25% + EFA (a nonUS equities fund) weighted 25% + IJS (a smallcap value fund) weighted 20% + EEM (an emergingmkts fund) weighted 20% + AGG (a bond fund) weighted 10%Before we can calculate the skewness, we need to find portfolio monthly returns, which was covered in this post.
Building off that previous work, we will be working with two objects of portfolio returns:
+ portfolio_returns_xts_rebalanced_monthly (an xts of monthly returns) + portfolio_returns_tq_rebalanced_monthly (a tibble of monthly returns)Let’s begin in the xts world and make use of the skewness() function from PerformanceAnalytics.
library(PerformanceAnalytics) skew_xts < skewness(portfolio_returns_xts_rebalanced_monthly$returns) skew_xts ## [1] 0.1710568Our portfolio is relatively balanced, and a slight negative skewness of 0.1710568 is unsurprising and unworrisome. However, that final number could be omitting important information and we will resist the temptation to stop there. For example, is that slight negative skew being caused by one very large negative monthly return? If so, what happened? Or is it caused by several mediumsized negative returns? What caused those? Were they consecutive? Are they seasonal? We need to investigate further.
Before doing so and having fun with data visualization, let’s explore the tidyverse methods and confirm consistent results.
We will make use of the same skewness() function, but because we are using a tibble, we use summarise() as well and call summarise(skew = skewness(returns). It’s not necessary, but we are also going to run this calculation by hand, the same as we have done with standard deviation. Feel free to delete the byhand section from your code should this be ported to enterprise scripts, but keep in mind that there is a benefit to forcing ourselves and loved ones to write out equations: it emphasizes what those nice builtin functions are doing under the hood. If a client, customer or risk officer were ever to drill into our skewness calculations, it would be nice to have a superfirm grasp on the equation.
library(tidyverse) library(tidyquant) skew_tidy < portfolio_returns_tq_rebalanced_monthly %>% summarise(skew_builtin = skewness(returns), skew_byhand = (sum((returns  mean(returns))^3)/length(returns))/ ((sum((returns  mean(returns))^2)/length(returns)))^(3/2)) %>% select(skew_builtin, skew_byhand)Let’s confirm that we have consistent calculations.
skew_xts ## [1] 0.1710568 skew_tidy$skew_builtin ## [1] 0.1710568 skew_tidy$skew_byhand ## [1] 0.1710568The results are consistent using xts and our tidyverse, byhand methods. Again, though, that singular number 0.1710568 does not fully illuminate the riskiness or distribution of this portfolio. To dig deeper, let’s first visualize the density of returns with stat_density from ggplot2.
portfolio_density_plot < portfolio_returns_tq_rebalanced_monthly %>% ggplot(aes(x = returns)) + stat_density(geom = "line", alpha = 1, colour = "cornflowerblue") portfolio_density_plotThe slight negative skew is a bit more evident here. It would be nice to shade the area that falls below some threshold again, and let’s go with the mean return. To do that, let’s create an object called shaded_area using ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)). That snippet will take our original ggplot object and create a new object filtered for x values less than mean return. Then we use geom_area to add the shaded area to portfolio_density_plot.
shaded_area_data < ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)) portfolio_density_plot_shaded < portfolio_density_plot + geom_area(data = shaded_area_data, aes(x = x, y = y), fill="pink", alpha = 0.5) portfolio_density_plot_shadedThe shaded area highlights the mass of returns that fall below the mean. Let’s add a vertical line at the mean and median, and some explanatory labels. This will help to emphasize that negative skew indicates a mean less than the median.
First, create variables for mean and median so that we can add a vertical line.
median < median(portfolio_returns_tq_rebalanced_monthly$returns) mean < mean(portfolio_returns_tq_rebalanced_monthly$returns)We want the vertical lines to just touch the density plot so we once again use a call to ggplot_build(portfolio_density_plot)$data[[1]].
median_line_data < ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x <= median)Now we can start adding aesthetics to the latest iteration of our graph, which is stored in the object portfolio_density_plot_shaded.
portfolio_density_plot_shaded + geom_segment(aes(x = 0, y = 1.9, xend = .045, yend = 1.9), arrow = arrow(length = unit(0.5, "cm")), size = .05) + annotate(geom = "text", x = .02, y = .1, label = "returns <= mean", fontface = "plain", alpha = .8, vjust = 1) + geom_segment(data = shaded_area_data, aes(x = mean, y = 0, xend = mean, yend = density), color = "red", linetype = "dotted") + annotate(geom = "text", x = mean, y = 5, label = "mean", color = "red", fontface = "plain", angle = 90, alpha = .8, vjust = 1.75) + geom_segment(data = median_line_data, aes(x = median, y = 0, xend = median, yend = density), color = "black", linetype = "dotted") + annotate(geom = "text", x = median, y = 5, label = "median", fontface = "plain", angle = 90, alpha = .8, vjust = 1.75) + ggtitle("Density Plot Illustrating Skewness")We added quite a bit to the chart, possibly too much, but it’s better to be overinclusive now to test different variants. We can delete any of those features when using this chart later, or refer back to these lines of code should we ever want to reuse some of the aesthetics.
At this point, we have calculated the skewness of this portfolio throughout its history, and done so using three methods. We have also created an explanatory visualization.
Similar to the portfolio standard deviation, though, our work is not complete until we look at rolling skewness. Perhaps the first two years of the portfolio were positive skewed, and last two were negative skewed but the overall skewness is slightly negative. We would like to understand how the skewness has changed over time, and in different economic and market regimes. To do so, we calculate and visualize the rolling skewness over time.
In the xts world, calculating rolling skewness is almost identical to calculating rolling standard deviation, except we call the skewness() function instead of StdDev(). Since this is a rolling calculation, we need a window of time for each skewness; here, we will use a sixmonth window.
window < 6 rolling_skew_xts < na.omit(rollapply(portfolio_returns_xts_rebalanced_monthly, window, function(x) skewness(x)))Now we pop that xts object into highcharter for a visualization. Let’s make sure our yaxis range is large enough to capture the nature of the rolling skewness fluctuations by setting the range to between 3 and 3 with hc_yAxis(..., max = 3, min = 3). I find that if we keep the range from 1 to 1, it makes most rolling skews look like a roller coaster.
library(highcharter) highchart(type = "stock") %>% hc_title(text = "Rolling") %>% hc_add_series(rolling_skew_xts, name = "Rolling skewness", color = "cornflowerblue") %>% hc_yAxis(title = list(text = "skewness"), opposite = FALSE, max = 3, min = 3) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE){"x":{"hc_opts":{"title":{"text":"Rolling"},"yAxis":{"title":{"text":"skewness"},"opposite":false,"max":3,"min":3},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1375228800000,0.0511810666724179],[1377820800000,0.0965061660700991],[1380499200000,0.163579005279668],[1383177600000,0.0111271137345324],[1385683200000,0.329175858151934],[1388448000000,0.682693636723609],[1391126400000,0.299180325964799],[1393545600000,1.05714468932288],[1396224000000,1.12678024437719],[1398816000000,0.907478344540586],[1401408000000,0.923291364103329],[1404086400000,0.994115098616124],[1406764800000,0.106737545935232],[1409270400000,0.963528231829294],[1412035200000,0.750713278430844],[1414713600000,0.912281062412703],[1417132800000,0.767038118453993],[1419984000000,0.252362341390629],[1422576000000,0.281193932085033],[1424995200000,0.201889248675472],[1427760000000,0.758665508271901],[1430352000000,0.809337297292621],[1432857600000,0.96467960406329],[1435622400000,0.894828618685645],[1438300800000,0.917549882978621],[1440979200000,0.859694106578632],[1443571200000,0.416692593249242],[1446163200000,0.664519170731158],[1448841600000,0.6350949021651],[1451520000000,0.71546907245901],[1454025600000,0.99556931274623],[1456704000000,1.03166733380321],[1459382400000,0.36259225564619],[1461888000000,0.804397321645972],[1464652800000,0.81695627686931],[1467244800000,0.56636688270716],[1469750400000,0.966848135339037],[1472601600000,1.05049406444632],[1475193600000,1.28796539782198],[1477872000000,0.478246331580933],[1480464000000,0.0334553106079278],[1483056000000,0.314054096729488],[1485820800000,1.18653628983962],[1488240000000,1.52309795733004],[1490918400000,1.65501284162553],[1493337600000,0.332506398750578],[1496188800000,0.633713296827805],[1498780800000,0.534386548695677],[1501459200000,0.719857327874752],[1504137600000,0.0682564673333931],[1506643200000,0.122507709915633],[1509408000000,0.340860682326799],[1512000000000,0.444064183403134],[1513036800000,0.394434127262236]],"name":"Rolling skewness","color":"cornflowerblue"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"chart":{"backgroundColor":"transparent"}},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vmlradialgradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvastools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}
For completeness of methods, we can calculate rolling skewness in a tibble and then use ggplot.
We will make use of rollapply() from within tq_mutate in tidyquant.
rolling_skew_tidy < portfolio_returns_tq_rebalanced_monthly %>% tq_mutate(select = returns, mutate_fun = rollapply, width = window, FUN = skewness, col_rename = "skew")rolling_skew_tidy is ready for ggplot. ggplot is not purposebuilt for time series plotting, but we can set aes(x = date, y = skew) to make the xaxis our date values.
library(scales) theme_update(plot.title = element_text(hjust = 0.5)) rolling_skew_tidy %>% ggplot(aes(x = date, y = skew)) + geom_line(color = "cornflowerblue") + ggtitle("Rolling Skew with ggplot") + ylab(paste("Rolling", window, "month skewness", sep = " ")) + scale_y_continuous(limits = c(3, 3), breaks = pretty_breaks(n = 8)) + scale_x_date(breaks = pretty_breaks(n = 8))The rolling charts are quite illuminating and show that the sixmonthinterval skewness has been positive for about half the lifetime of this portfolio. Today, the overall skewness is negative, but the rolling skewness in mid2016 was positive and greater than 1. It took a huge plunge starting at the end of 2016, and the lowest reading was 1.65 in March of 2017, most likely caused by one or two very large negative returns when the market was worried about the US election. We can see those worries start to abate as the rolling skewness becomes more positive throughout 2017.
That’s all for today. Thanks for reading and see you next time when we tackle kurtosis.
_____='https://rviews.rstudio.com/2017/12/13/introductiontoskewness/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A chart of Bechdel Test scores
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
A movie is said to satisfy the Bechdel Test if it satisfies the following three criteria:
 The movie has at least two named female characters
 … who have a conversation with each other
 … about something other than a man
The website BechdelTest.com scores movies accordingly, granting one point for each of the criteria above for a maximum of three points. The recent Wonder Woman movie scores the full three points (Diana and her mother discuss war, for example), while Dunkirk, which features no named female characters, gets zero. (It was still a great film, however.)
The website also offers an API, which enabled data and analytics director Austin Wehrwein to create this time series chart of Bechdel scores for movies listed on BechdelTest.com:
This chart only includes ratings for that subset of movies listed on Bechdeltest.com, so it's not clear whether this is representative of movies as a whole. Austin suggests combining these data with the broader data from IDMB, so maybe someone wants to give that a try. Austin's R code for generating the above chart and several others is available at the link below, so click through for the full analysis.
Austin Wehrwein: A quick look at Bechdel test data (& an awtools update) (via Mara Averick)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...