Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 13 hours ago

Plotting Deep Learning Model Performance Trajectories

Sat, 12/23/2017 - 20:42

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I am excited to share a new deep learning model performance trajectory graph.

Here is an example produced based on Keras in R using ggplot2:

The ideas include:

  • We plot model performance as a function of training epoch, data set (training and validation), and metric.
  • For legibility we facet on metric, and facets are adjusted so all facets have the same visual interpretation (“up is better”).
  • The only solid horizontal curve is validation performance, and training performance is only indicated as the top-region of a shared region that depicts degree of over-fit.

Obviously is going to take some training and practice to read these graphs quickly: but that is petty much true for all visualizations.

The methods work with just about any staged machine learning algorithm (neural nets, deep learning, boosting, random forests, and more) and can also be adapted to non-staged bug regularized methods (lasso, elastic net, and so on).

The graph is now part of the development version of WVPlots. And we have complete worked examples for Keras and xgboost.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tiny Art in Less Than 280 Characters

Sat, 12/23/2017 - 10:04

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Now that Twitter allows 280 characters, the code of some drawings I have made can fit in a tweet. In this post I have compiled a few of them.

The first one is a cardioid inspired in string art (more info here):

library(ggplot2) n=300 t1=1:n t0=seq(3,2*n+1,2)%%n t2=t0+(t0==0)*n df=data.frame(x=cos((t1-1)*2*pi/n), y=sin((t1-1)*2*pi/n), x2=cos((t2-1)*2*pi/n), y2=sin((t2-1)*2*pi/n)) ggplot(df,aes(x,y,xend=x2,yend=y2)) + geom_segment(alpha=.1)+theme_void()


This other is based on Fermat’s spiral (more info here):

library(ggplot2) library(dplyr) t=seq(from=0, to=100*pi, length.out=500*100) data.frame(x= t^(1/2)*cos(t), y= t^(1/2)*sin(t))%>% rbind(-.)%>%ggplot(aes(x, y))+geom_polygon()+theme_void()


A recurrence plot of Gauss error function (more info here):

library(dplyr) library(ggplot2) library(pracma) seq(-5*pi,5*pi,by=.1)%>%expand.grid(x=., y=.)%>% ggplot(aes(x=x, y=y, fill=erf(sec(x)-sec(y))))+geom_tile()+ scale_fill_gradientn(colours=c("#000000","#FFFFFF"))+ theme_void()+theme(legend.position="none")


A x-y scatter plot of a trigonometric function on R2 (more info here):

library(dplyr) library(ggplot2) seq(from=-10, to=10, by = 0.05) %>% expand.grid(x=., y=.) %>% ggplot(aes(x=(x+pi*sin(y)), y=(y+pi*sin(x)))) + geom_point(alpha=.1, shape=20, size=1, color="black")+ theme_void()


A turtle graphic (more info here):

library(TurtleGraphics) turtle_init() turtle_col("gray25") turtle_do({ for (i in 1:150) { turtle_forward(dist=1+0.5*i) turtle_right(angle=89.5)} }) turtle_hide()


A curve generated by a simulated harmonograph (more info here):

t=seq(1, 100, by=.001) plot(exp(-0.006*t)*sin(t*3.019+2.677)+ exp(-0.001*t)*sin(t*2.959+2.719), exp(-0.009*t)*sin(t*2.964+0.229)+ exp(-0.008*t)*sin(t*2.984+1.284), type="l", axes=FALSE, xlab="", ylab="")


A chord diagram of a 20×20 1-matrix (more info here):

library(circlize) chordDiagram(matrix(1, 20, 20), symmetric = TRUE, col="black", transparency = 0.85, annotationTrack = NULL)


Most of them are made with ggplot2 package. I love R and the sense of wonder of how just one or two lines of code can create beautiful and unexpected patterns.

I recently did this project for DataCamp to show how easy is to do art with R and ggplot. Starting from a extremely simple plot, and following a well guided path, you can end making beautiful images like this one:

Furthermore, you can learn also ggplot2 while you do art.

I have done the project together with Rasmus Bååth, instructor at DataCamp and the perfect mate to work with. He is looking for people to build more projects so if you are interested, here you can find more information. Do not hesitate to ask him for details.

All the best for 2018.

Merry Christmas.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Kindle clippings.txt with Python

Sat, 12/23/2017 - 01:00

(This article was first published on Max Humber, and kindly contributed to R-bloggers)

Exactly a year ago I posted Kindle clippings.txt with R. Since then things have changed… I’m a Pythonista now! Consequently, I thought it would be fun to update that post and parse highlights with 3.6+ and pandas. Janky, but it works:

import pandas as pd txt = """Sourdough (Robin Sloan) - Your Highlight on page 187 | Location 2853-2855 | Added on Tuesday, October 2, 2017 8:47:09 PM The world is going to change, I think—slowly at first, then faster than anyone expects. ========== Sapiens (Yuval Noah Harari) - Your Highlight on page 196 | Location 2996-2997 | Added on Tuesday, October 3, 2017 8:51:09 PM Evolution has made Homo sapiens, like other social mammals, a xenophobic creature. ========== Life 3.0 (Max Tegmark) - Your Highlight on page 75 | Location 1136-1137 | Added on Wednesday, October 11, 2017 6:00:15 PM In short, computation is a pattern in the spacetime arrangement of particles ========== """ with open('clippings.txt', 'w', encoding='utf-8-sig') as f: f.write(txt) with open('clippings.txt', 'r', encoding='utf-8-sig') as f: contents = f.read().replace(u'\ufeff', '') lines = contents.rsplit('==========') store = {'author': [], 'title': [], 'quote': []} for line in lines: try: meta, quote = line.split(')\n- ', 1) title, author = meta.split(' (', 1) _, quote = quote.split('\n\n') store['author'].append(author.strip()) store['title'].append(title.strip()) store['quote'].append(quote.strip()) except ValueError: pass df = pd.DataFrame(store) print(df.to_csv(index=False, encoding='utf-8-sig')) # author,quote,title # Robin Sloan,"The world is going to change, I think—slowly at first, then faster than anyone expects.",Sourdough # Yuval Noah Harari,"Evolution has made Homo sapiens, like other social mammals, a xenophobic creature.",Sapiens # Max Tegmark,"In short, computation is a pattern in the spacetime arrangement of particles",Life 3.0

Right now I’m 49 books deep. It’s crunch time, but I can see the end! Look for my annual 52 Quotes post in a couple of days!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Max Humber. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#14: Finding Binary .deb Files for CRAN Packages

Fri, 12/22/2017 - 23:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the fourteenth post in the rationally rambling R rants series, or R4 for short. The last two posts were concerned with faster installation. First, we showed how ccache can speed up (re-)installation. This was followed by a second post on faster installation via binaries.

This last post immediately sparked some follow-up. Replying to my tweet about it, David Smith wondered how to combine binary and source installation (tl;dr: it is hard as you need to combine two package managers). Just this week, Max Ogden wondered how to install CRAN packages as binaries on Linux, and Daniel Nuest poked me on GitHub as part of his excellent containerit project as installation of binaries would of course also make Docker container builds much faster. (tl;dr: Oh yes, see below!)

So can one? Sure. We have a tool. But first the basics.

The Basics

Packages for a particular distribution are indexed by a packages file for that distribution. This is not unlike CRAN using top-level PACKAGES* files. So in principle you could just fetch those packages files, parse and index them, and then search them. In practice that is a lot of work as Debian and Ubuntu now have several tens of thousands of packages.

So it is better to use the distro tool. In my use case on .deb-based distros, this is apt-cache. Here is a quick example for the (Ubuntu 17.04) server on which I type this:

$ sudo apt-get update -qq ## suppress stdout display $ apt-cache search r-cran- | wc -l 419 $

So a very vanilla Ubuntu installation has "merely" 400+ binary CRAN packages. Nothing to write home about (yet) — but read on.

cran2deb4ubuntu, or c2d4u for short

A decade ago, I was involved in two projects to turn all of CRAN into .deb binaries. We had a first ad-hoc predecessor project, and then (much better) a ‘version 2’ thanks to the excellent Google Summer of Code work by Charles Blundell (and mentored by me). I ran with that for a while and carried at the peak about 2500 binaries or so. And then my controlling db died, just as I visited CRAN to show it off. Very sad. Don Armstrong ran with the code and rebuilt it on better foundations and had for quite some time all of CRAN and BioC built (peaking at maybe 7k package). Then his RAID died. The surviving effort is the one by Michael Rutter who always leaned on the Lauchpad PPA system to build his packages. And those still exist and provide a core of over 10k packages (but across different Ubuntu flavours, see below).

Using cran2deb4ubuntu

In order to access c2d4u you need an Ubuntu system. For example my Travis runner script does

# Add marutter's c2d4u repository, (and rrutter for CRAN builds too) sudo add-apt-repository -y "ppa:marutter/rrutter" sudo add-apt-repository -y "ppa:marutter/c2d4u"

After that one can query apt-cache as above, but take advantage of a much larger pool with over 3500 packages (see below). The add-apt-repository command does the Right Thing (TM) in terms of both getting the archive key, and adding the apt source entry to the config directory.

How about from R? Sure, via RcppAPT

Now, all this command-line business is nice. But can we do all this programmatically from R? Sort of.

The RcppAPT package interface the libapt library, and provides access to a few functions. I used this feature when I argued (unsuccessfully, as it turned out) for a particular issue concerning Debian and R upgrades. But that is water under the bridge now, and the main point is that "yes we can".

In Docker: r-apt

Building on RcppAPT, within the Rocker Project we built on top of this by proving a particular class of containers for different Ubuntu releases which all contain i) RcppAPT and ii) the required apt source entry for Michael’s repos.

So now we can do this

$ docker run --rm -ti rocker/r-apt:xenial /bin/bash -c 'apt-get update -qq; apt-cache search r-cran- | wc -l' 3525 $

This fires up the corresponding Docker container for the xenial (ie 16.04 LTS) release, updates the apt indices and then searches for r-cran-* packages. And it seems we have a little over 3500 packages. Not bad at all (especially once you realize that this skews strongly towards the more popular packages).

Example: An rstan container

A little while a ago a seemingly very frustrated user came to Carl and myself and claimed that out Rocker Project sucketh because building rstan was all but impossible. I don’t have the time, space or inclination to go into details, but he was just plain wrong. You do need to know a little about C++, package building, and more to do this from scratch. Plus, there was a long-standing issue with rstan and newer Boost (which also included several workarounds).

Be that as it may, it serves as nice example here. So the first question: is rstan packaged?

$ docker run --rm -ti rocker/r-apt:xenial /bin/bash -c 'apt-get update -qq; apt-cache show r-cran-rstan' Package: r-cran-rstan Source: rstan Priority: optional Section: gnu-r Installed-Size: 5110 Maintainer: cran2deb4ubuntu <cran2deb4ubuntu@gmail.com> Architecture: amd64 Version: 2.16.2-1cran1ppa0 Depends: pandoc, r-base-core, r-cran-ggplot2, r-cran-stanheaders, r-cran-inline, r-cran-gridextra, r-cran-rcpp,\ r-cran-rcppeigen, r-cran-bh, libc6 (>= 2.14), libgcc1 (>= 1:4.0), libstdc++6 (>= 5.2) Filename: pool/main/r/rstan/r-cran-rstan_2.16.2-1cran1ppa0_amd64.deb Size: 1481562 MD5sum: 60fe7cfc3e8813a822e477df24b37ccf SHA1: 75bbab1a4193a5731ed105842725768587b4ec22 SHA256: 08816ea0e62b93511a43850c315880628419f2b817a83f92d8a28f5beb871fe2 Description: GNU R package "R Interface to Stan" Description-md5: c9fc74a96bfde57f97f9d7c16a218fe5 $

It would seem so. With that, the following very minimal Dockerfile is all we need:

## Emacs, make this -*- mode: sh; -*- ## Start from xenial FROM rocker/r-apt:xenial ## This handle reaches Carl and Dirk MAINTAINER "Carl Boettiger and Dirk Eddelbuettel" rocker-maintainers@eddelbuettel.com ## Update and install rstan RUN apt-get update && apt-get install -y --no-install-recommends r-cran-rstan ## Make R the default CMD ["R"]

In essence, it executes one command: install rstan but from binary taking care of all dependencies. And lo and behold, it works as advertised:

$ docker run --rm -ti rocker/rstan:local Rscript -e 'library(rstan)' Loading required package: ggplot2 Loading required package: StanHeaders rstan (Version 2.16.2, packaged: 2017-07-03 09:24:58 UTC, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) $

So there: installing from binary works, takes care of dependencies, is easy and as an added bonus even faster. What’s not too like?

(And yes, a few of us are working on a system to have more packages available as binaries, but it may take another moment…)

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Make your own color palettes with paletti

Fri, 12/22/2017 - 22:52

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Last week I blogged about the dutchmasters color palettes package, which was inspired
by the wonderful ochRe package. As
mentioned I shamelessly copied the package. I replaced the list with character
vectors containing hex colors and did a find and replace to make it
dutchmasters instead of ochRe. This was pretty ugly. I realized that when
we would refactor the ochRe functions, thus creating functions that create the functions, there would no longer be a need to copy-paste and find-and-replace. So that is what I did. I refactored and expanded ochRe’s chore into paletti. (Name chosen because I liked the ring of it). You grab it from Github, with devtools::install_github("edwinth/paletti").

Single palettes

paletti takes both single palettes (character vectors with hex codes) and
lists with palettes, like the ochRe and dutchmasters lists. Lets start with
a single palette, this might be useful when you want your coporate identity
colors translated into R. Here I just pick some colors found on the interweb

library(paletti) library(ggplot2) mycols <- c( red = "#E27D60", blue = "#085DCB", orange = "#E8A87C", purple = "#C38D9E", green = "#41B3A3" ) viz_palette(mycols)

Now, ochRe provided us with two functions, one two create a ggplot scale for
colours and to create one for fills. These functions can be created in the
following fashion

mycols_fill <- get_scale_fill(get_pal(mycols)) mycols_color <- get_scale_color(get_pal(mycols))

Both now can be used in ggplot

mtcars$cyl <- as.character(mtcars$cyl) col_plot <- ggplot(mtcars, aes(mpg, drat, color = cyl)) + geom_point(size = 4) col_plot + mycols_color()

fill_plot <- ggplot(mtcars, aes(cyl, fill = cyl)) + geom_bar() fill_plot + mycols_fill()

Now, I said I expanded the ochRe code a bit. The function get_hex will
produce a function that will return a function in which you can directly return
the hex code by typing its unquoted name. Handy if you want an exact color from
your palette. Prerequisit is that your palette is a named character vector.

mycols_hex <- get_hex(mycols) fill_plot + scale_fill_manual(values = mycols_hex(blue, purple, red))

Multiple palettes in a list

Both ochRe and dutchmasters offer multiple palettes in a list. The only
difference from a single palette is that in the returned function you have to
specify the name of the palette youw want to use. If you don’t, it defaults to
the first palette in the list.

ochRe_col <- get_scale_color(get_pal(ochre_palettes)) dutchmasters_fill <- get_scale_fill(get_pal(dutchmasters)) col_plot + ochRe_col("lorikeet")

fill_plot + dutchmasters_fill("view_of_Delft")

And the same holds for the get_hex function. You can feed a list with palettes
as well. Note that the palettes that you are going to call must have named
elements.

dutchmasters_hex <- get_hex(dutchmasters) fill_plot + scale_fill_manual(values = dutchmasters_hex("anatomy", black(cloak), red(muscle), brown(table)))

That’s it, off you go! Add your own color palette(s) and start plotting. Once
again a major thanks to the ochRe team for the inspiration and the foundations
on which paletti is built.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Because it’s Friday: Deck the Halls

Fri, 12/22/2017 - 19:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Sure, this is a promo for a movie, but I'd love to have a full-length single of this:

Relatedly, if you want to settle an argument about which pop diva has the greatest vocal range, Giora Simchoni used  R to perform frequency analysis of their hits:

That's all from us here at the blog for this week, and in fact for a little while: we're taking a break for the holidays. We'll be back on January 2, and in the meantime, enjoy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a Hacker News scraper with 8 lines of R code using rvest library

Fri, 12/22/2017 - 14:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

There was a time when Web scraping was quite a difficult task requiring knowledge of XML Tree parsing and HTTP Requests. But with the likes of libraries like beautifulsoup (for Python) and rvest (for R), Web scraping has become a toy for any beginner to play with.

This post aims to explain how insanely simple it is to build a scraper in R using rvest and the site we have decided to scrape content from is Hacker News Front Page.

Package Installation and Loading

rvest can be installed from CRAN and loaded into R like below:

install.packages('rvest') library(rvest)

read_html() function of rvest can be used to extract the html content of the url given as the arguemnt for read_html function.

content <- read_html('https://news.ycombinator.com/&#x27)

For this above read_html() to work without any concern, Please make sure you are not behind any organization firewall and if so, configure your RStudio with a proxy to bypass the firewall otherwise you might face connection timed out error.

Below is the screenshot of HN front page layout (with key elements highlighted):

Now, with the html content of the Hacker News front page loaded into the R object content, Let us extract the data that we need – starting from the Title. The most important aspect of making any web scraping assignment successful is to identify the right css selector or xpath values of the html elements whose values are supposed to be scraped and the easiest way to get the right element value is to use the inspect tool in Developer Tools of any browser.

Here’s the screenshot of the css selector value highlighted by Chrome Inspect Tool when hovered over Title of the links present in Hacker News Frontpage.

title <- content %>% html_nodes('a.storylink') %>% html_text() title [1] "Magic Leap One" [2] "Show HN: Termipal – native micro-GUIs for shell scripts and command line apps" [3] "Tokio internals: Understanding Rust's async I/O framework" [4] "Funding Yourself as a Free Software Developer" [5] "US Federal Ban on Making Lethal Viruses Is Lifted" [6] "Pass-Thru Income Deduction" [7] "Orson Welles' first attempt at movie-making" [8] "D’s Newfangled Name Mangling" [9] "Apple Plans Combined iPhone, iPad and Mac Apps to Create One User Experience" [10] "LiteDB – A .NET NoSQL Document Store in a Single Data File" [11] "Taking a break from Adblock Plus development" [12] "SpaceX’s Falcon Heavy rocket sets up at Cape Canaveral ahead of launch" [13] "This is not a new year’s resolution" [14] "Artists and writers whose works enter the public domain in 2018" [15] "Open Beta of Texpad 1.8, macOS LaTeX editor with integrated realtime typesetting" [16] "The triumph and near-tragedy of the first Moon landing" [17] "Retrotechnology – PC desktop screenshots from 1983-2005" [18] "Google Maps' Moat" [19] "Regex Parser in C Using Continuation Passing" [20] "AT&T giving $1000 bonus to all its employees because of tax reform" [21] "How a PR Agency Stole Our Kickstarter Money" [22] "Google Hangouts now on Firefox without plugins via WebRTC" [23] "Ubuntu 17.10 corrupting BIOS of many Lenovo laptop models" [24] "I Know What You Download on BitTorrent" [25] "Carrie Fisher’s Private Philosophy Coach" [26] "Show HN: Library of API collections for Postman" [27] "Uber is officially a cab firm, says European court" [28] "The end of the Iceweasel Age (2016)" [29] "Google will turn on native ad-blocking in Chrome on February 15" [30] "Bitcoin Cash deals frozen as insider trading is probed"

Since rvest package supports pipe %>% operator, content (the R object containing the content of the html page read with read_html) can be piped with html_nodes() that takes css selector or xpath as its arugment and then extract respective xml tree (or html node value) whose text value could be extracted with html_text() function. The beauty of rvest is that it abstracts the entire xml parsing operation under the hood of functions like html_nodes() and html_text() making it easier for us to achieve our scraping goal with minimal code.

Like Title, The css selector value of other required elements of the web page can be identified with the Chrome Inspect tool and passed as an argument to html_nodes() function and respective values can be extracted and stored in R objects.

link_domain <- content %>% html_nodes('span.sitestr') %>% html_text() score <- content %>% html_nodes('span.score') %>% html_text() age <- content %>% html_nodes('span.age') %>% html_text()

With all the essential pieces of information extracted from the page, an R data frame can be made with the extracted elements to put the extracted data into a structured format.

df <- data.frame(title = title, link_domain = link_domain, score = score, age = age)

Below is the screenshot of the final dataframe in RStudio viewer:

Thus in just 8 lines of code, We have successfully built a Hacker News Scraper in R using rvest package and this scraper could be for a variety of purposes like News Reader, Summarizer, Text Analytics and much more. While a lot more things could be done with rvest, this post is kept simple to explain how easily a web scraper could be built with rvest. The code used here is available on my github.

    Related Post

    1. Building a Telecom Dictionary scraping web using rvest in R
    2. Scraping Javascript-rendered web content using R
    3. Analysing Cryptocurrency Market in R
    4. Time Series Analysis in R Part 3: Getting Data from Quandl
    5. Pulling Data Out of Census Spreadsheets Using R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    chRistmas tRees

    Fri, 12/22/2017 - 07:45

    (This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers)

    Year over year, in the last classes before Christmas I ask my students to create a Christmas tree in R.
    Classes are about Techniques of data visualisation and usually, at this point, we are discussing interactive graphics and tools like rbokeh, ggiraph, vegalite, googleVis, D3, rCharts or plotly. I like this exercise because with most tools it is easy to create a barchart, but how good must be the tool and the craftsman to handle a christmas tree?

    Here is what they did this year (having around 1 hour to finish the task). Knitr scripts.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Looking Back on 2017

    Fri, 12/22/2017 - 01:00

    (This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

    Say what you will about 2017 – it sure made an impact. In the broader view I’m pretty sure all sensible people can agree that that impact was predominantly negative, but even so, small pockets of good can happen. So, here’s a short post about all (the predominantly good) that happened in my life in this year.

    As you can guess this post will be unusually self-centered, so if that is not your thing feel free to give this post a pass…

    Kicking it off

    2016 ended with me getting fired from my first “real” job after my PhD. While that in isolation would appear as a horrible start on the year, it really was not. A number of issues on the job had made me questioning the position and a badly handled firing further cemented that this was really no a place I should waste my time on. Once ego always takes a hit by the inherint rejection in a firing, but this was smoothed out by me being offered a new, interesting, job a couple of days after I announced me re-entry on the job market.

    So, I started 2017 in a rather good mood and with lot of sparetime on my hand. I decided that this was a rather unique opportunity to finally get ggraph ready for release. ggraph is definetly one of the bigger projects I have done, and a lot of stuff was still missing before it was CRAN ready – A month payed leave was exactly what I needed for this. As the start of my new position in SKAT was drawing nearer I did my best to get ggraph done and it so happened that it got released in CRAN the day before I started on the new job. I was rather overwhelmed with the recpetion, and also very proud. I honestly believe ggraph to be the best way to visualise static networks and this is a fantastic feeling to have with something you’ve made yourself.

    ggraph also sparked of two other projects that came to take a chunk of my 2017. If you want to get Hadley to use your new shiny ggplot2 extension you better make sure the input data is tidy. At that point in time graph data was by no means tidy so following a discussion with Hadley I set out to right this wrong – thus tidygraph was born. Later on Hadley (sense a pattern) needed a better layout for directed acyclic graphs, and kindly requested a port of the D3 force algorithm. This initiated the development of the (yet not on CRAN) particles package.

    I also got to speak about ggraph publicly as I had been invited to PlotCon on the day I got fired (strange coincidences), so in the late spring I found myself in San Francisco speaking for the first time about my spare time projects. If you’re interested the in the presentation, it has been made available on YouTube…

    Along came a job

    While all of this was going on, I had the fortune of having started in a really cool group at SKAT, handling all the advanced analytics going on in the danish tax authorities. One of the biggest coming challanges for our work, was the new changes to EUs directives regarding use of algorithms in decision making. Going forward every citizen in EU has the right to an explanation about algorithmic decision regarding themselves. This provides a challange for complex algorithmes whose workings are more akin to a black box. Someone in the Python world had id figured out though, with the lime library, and one of my first tasks was to port this code to R. I quickly gained the help of Michaël Benesty, and together we got that package going.

    Over the summer I began looking at my fiery web server framework again. I never had the time for this I’d hoped when I released it the summer before and I had a lot of ideas for improving and expanding the framework. This coincidet nicely with our office looking for a new infrastructure for deploying our models internally in SKAT. I could thus devote a part of my work time along with my sparetime in getting this package to the next level. This resulted in the release of two new packages and an extensive update to fiery itself. reqres was born with the intent of making HTTP messages more convenient to work with, and routr was a way to ease routing of requests.

    Autumn

    I must admit I felt a bit exhausted after this flurry of releases. I had a lot of ideas for improvements to both tweenr, ggforce, and ggraph as well as expanding on the fiery ecosystem and an as-yet-unannounced internet service I would like to build, but while I began working on new ggforce stuff that effort somewhat halted. I moved my blog to blogdown, I made some logos, I fixed some issues and created some PRs to ggplot2 but mainly I just did nothing. Whereas before I had cherished the unique times where I had time to put in development of my projects, these times were now filled with rewatching Archer and Rich & Morty. I felt guilty for not spending my time doing the stuff I had set out to do but every time I sat down to do some coding I was bummed out by the less-than-interesting chores related to maintaining packages. I might be victim of setting too many things free on the world, but on the other hand I can’t help it, and in this autumn I definetly payed the prize.

    I think it would be unfair to those that are suffering from acute burnout to call what I was feeling “burnout”, but it was definietly an early sign of it. The best would probably be to take a vacation, as many well-meaning twitter friends suggested, but the vacation-solution requires the means to take vacation. Further, I needed a way to handle this on the long term as I’m unlikely to really slow down (I have a depression induced fear of leaving this world without setting a mark, somehow).

    One thing I did find joy in in this period was development of generative art, using the tools I’ve been developing over the last 1 1/2 years. This really started during the summer before my “down period”, but I was allowing myself more time for it as my urge for package development was vaning. I have to say that I’m very proud of the pieces I’ve produced…

    unfold02

    Patching myself up

    While I cannot claim that I have found a permanent solution to my problem of staying motivated amidst increasing package maintenance needs, I did end up succeeding with a particularly stupid approach: Create a new package… This obviously doesn’t scale well, but for the immidiate future it seems to work.

    The package in question is called patchwork and is the fruit of some thoughts that had been rumaging around in my head during the autumn, and a solution to a need I feel gets expressed quite often despite several solutions already available: How to combine multiple ggplots in the same graphics. The package is not done yet but the reception suggest that it has struck a nerve among some. It will get its own blogpost come release so for now, if you’re more interested have a look in the github repository.

    So, I leave 2017 a bit as I entered it. Hopeful and in good mood. 2018 already seems like a blast of a year, as I’ll be speaking at both Rstudio::conf and useR (keynoting the latter no less – still can’t get over that). There are sure to come some bumps on the way, and I’ll defenitly need a better way of coping with all my ideas, but those details will hopefully be sorted out in due time…

    hopefully…

    Merry Christmas!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Nov 2017: New Package Picks

    Fri, 12/22/2017 - 01:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    Two hundred thirty-seven new packages made it to CRAN in November. Here are my picks for the “Top 40” organized into the categories: Computational Methods, Data, Data Science, Science, Social Science, Utilities and Visualizations.

    Computational Methods

    CVXR v0.94-4: Implements an object-oriented modeling language for disciplined convex programming (DCP) which allows users to formulate and solve convex optimization problems. The vignette introduces the package. Look here for examples and theory.

    PreciseSums v0.1: Implements the Kahan (1965) sum, Neumaier (1974) sum, pairwise-sum adapted from ‘NumPy’ and arbitrary precision sum.

    Data

    ballr v0.1.1: Provides functions for accessing data/tables from basketball-reference.com. There is a vignette.

    biofiles v1.0.0: Provides functions to parse GenBank/GenPept records into native R objects, access and manipulate the sequence and annotation information. There is an Introduction.

    ipumsr v0.1.1: Enables users to import census, survey and geographic data from IPUMS. There is an Introduction and vignettes on CPS Extraction, Geographic Data, NHDIS Datasets and on Using Value Labels.

    proPubBills v0.1: Implements an API wrapper around the ProPublica API. The brief vignette shows how to use it.

    Rpolyhedra v0.1.0: Contains a 142 polyhedra database scraped from PHD files as R6 objects, and provides rgl visualizing capabilities. The PHD format was created to describe the geometric polyhedra definitions derived mathematically by Andrew Hume and by the Kaleido program of Zvi Har’El. The vignette will get you started.

    voteogram v0.2.0: Provides tools to retrieve United States Congressional voting data from ProPublica, prepare the data for plotting with ggplot2 and create vote cartograms and themes. The vignette provides examples.

    Data Science

    imbalance v0.1.1: Provides algorithms to treat unbalanced datasets. See the vignette for details.

    intrinsicdimension v1.1.0: Implements a variety of methods for estimating intrinsic dimension of data sets (i.e the manifold or Hausdorff dimension of the support of the distribution that generated the data) as reviewed in Johnsson et al.(2015). The vignette provides an Overview.

    ppclust v0.1.0: Implements probabilistic clustering algorithms for partitioning datasets including Fuzzy C-Means (Bezdek, 1974), Possibilistic C-Means (Krishnapuram & Keller, 1993), Possibilistic Fuzzy C-Means (Pal et al, 2005), Possibilistic Clustering Algorithm (Yang et al, 2006), Possibilistic C-Means with Repulsion (Wachs et al, 2006) and the other variants. There are vignettes on Fuzzy C-Means, Probabilistic C-Means, Probabilistic Fuzzy C-Means and Unsupervised Probabilistic Fuzzy C-Means.

    textrank v0.2.0: Implements the textrank algorithm, an extension of the Pagerank algorithm for text. See the paper Mihalcea & Tarau (2004) and the vignette.

    TrajDataMining v0.1.4: Contains a set of methods for trajectory data preparation, such as filtering, compressing and clustering, and for trajectory pattern discovery. The vignette provides examples.

    Science

    benthos v1.3-4: Provides preprocessing tools and biodiversity measures for analyzing marine benthic data. See Van Loon et al. (2015) for an application and the vignette for an introduction to the package.

    nlmixr v0.9.0-1: Provides functions to fit and compare nonlinear mixed-effects models in differential equations with flexible dosing information commonly seen in pharmacokinetics and pharmacodynamics. See Almquist et al. (2015). The vignette shows how to use the package.

    PCRedux v0.2.5-1: Provides functions to extract Polymerase Chain Reactions (qPCR) amplification curve data for machine learning purposes. For details see Pabinger et al.(2014) and the vignette.

    PDN v0.1.0: Provides tools for building patient level networks for predicting medical outcomes based on the paper by Cabrera et al. (2016). See the vignette for an introduction.

    Rraven v1.0.0: Provides a tool to exchange data between R and Raven sound analysis software. The vignette shows how to use the software.

    spew v1.3.0: Provides tools for generating Synthetic Populations and Ecosystems. See Gallagher et al. (2017) for details and the vignette for a brief tour.

    Social Science

    EvolutionaryGames v0.1.0: Provides a set of tools to illustrate the core concepts of evolutionary game theory, such as evolutionary stability or various evolutionary dynamics, for teaching and academic research. See the vignette for details.

    Statistics

    [bang(https://CRAN.R-project.org/package=bang)] v1.0.0: Provides functions for the Bayesian analysis of some simple common models, without using Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling. There is an Introduction and vignettes on Hierarchical 1-way ANOVA, Conjugate Hierarchical Models and Posterior Predictive Checking.

    beast v1.0: Provides a method for the Bayesian estimation of change-points in the slope of multivariate time series. See Papastamoulis et al (2017) for a detailed presentation of the method.

    CorShrink v0.1.1: Offers functions to perform adaptive shrinkage of correlation and covariance matrices using a mixture model prior over the Fisher z-transformation of the correlations. See Stephens (2016) for details. The vignette contains examples.

    dvmisc v1.1.1: Provides faster versions of base R functions (e.g. mean, standard deviation, covariance, weighted mean), mostly written in C++, along with other miscellaneous functions.

    inlabru v2.1.2: Facilitates spatial modeling using integrated nested Laplace approximation via the INLA package and also implements a log Gaussian Cox process likelihood for modeling univariate and spatial point processes. Yuan et al. (2017).

    outbreaker2 v1.0-0: Allows users to reconstruct disease outbreaks using epidemiological and genetic information. See Jombart et al. (2014) for the details. There is a package Overview as well as an Introduction and vignettes on Using Custom Priors and The Rcpp API.

    probout v1.0: Provides functions to estimate unsupervised outlier probabilities for multivariate numeric data with many observations from a nonparametric outlier statistic. There is a vignette.

    quokar v0.1.0: Provides diagnostics for quantile regression models including detecting influential observations, robust distance methods, generalized Cook’s distance and Q-function distance (see Benites et al. (2015)) and mean posterior probability and Kullback–Leibler divergence methods (see Santos & Bolfarine (2016)). The vignette introduces the package.

     

    tidyposterior v0.0.1: This memorably named package implements a Bayesian approach for examining the differences between models and aims to answer the question: “When looking at resampling results, are the differences between models real?” The methods included are similar to those described in Benavoli et al (2017). There is a Getting Started Guide and a vignette on Bayesian Models.

    trialr v0.0.1: Offers a showcase of Bayesian clinical trial designs, implemented in RStan and R including some designs implemented in R for the first time (e.g. EffTox’ by Thall & Cook (2004). There are vignettes on the BEBOP Design, EffTox and Hierarchical Bayesian Models for Binary Responses.

    tvReg v0.2.1: Provides functions for fitting simultaneous equations with time varying coefficients, for both independent and correlated equations. The vignette contains examples.

    Utilities

    cli v1.0.0: Implements a suite of tools designed to build attractive command line interfaces. Includes tools for drawing rules, boxes, trees, and ‘Unicode’ symbols with ‘ASCII’ alternatives. See README for details.

    float v0.1-1: Extends R’s linear algebra facilities to include 32-bit float (single precision) data. There is a vignette.

    mudata2 v1.0.0: Offers functions and data structures designed to easily organize and visualize spatiotemporal data. There are vignettes for usinng and creating mudata2 objects.

    rhub v1/0.2: Provides an interface to the R-Hub package build system sponsored by the R Consortium. Run R CMD check on Windows, macOS, Solari and various Linux distributions.

    Visualizations

    ALEPlot v1.0: Offers functions to visualizes the main effects of individual predictor variables and their second-order interaction effects in black-box supervised learning models. The vignette contains several examples.

    dbplot v0.1.1: Leverages dplyr to process the calculations for a plot inside a database. Helper functions abstract the work at three levels: outputs the ggplot, the calculations and the formula needed to calculate bins. See README to get started.

    ggalluvial v0.5.0: Implements ggplot2 stat and geom layers for alluvial diagrams, charts that use x-splines (alluvia and flows), sometimes augmented with stacked bars (lodes or strata), to visualize incidence structures derived from several data types. The vignette provides examples.

    shinyaframe v1.0.1: Enables users to make R data available in Web-based virtual reality experiences for immersive, cross-platform data visualizations. It provides functions to create 3-dimensional data visualizations with Mozilla A-Frame. The vignette shows how.

    tactile v0.1.0: Extends lattice, providing new high-level functions, methods for existing functions, panel functions, and a theme. There are vignettes for New High-Level Functions, New Methods for Lattice and the tactile Theme.

    _____='https://rviews.rstudio.com/2017/12/22/nov-2017-new-package-picks/';

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    M4 Forecasting Competition update

    Fri, 12/22/2017 - 01:00

    (This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

    The official guidelines for the M4 competition have now been published, and there have been several developments since my last post on this.
    There is now a prize for prediction interval accuracy using a scaled version of the Mean Interval Score. If the $100(1-\alpha)$% prediction interval for time $t$ is given by $[L_{t},U_{t}]$, for $t=1,\dots,h$, then the MIS is defined as $$\frac{1}{h}\sum_{t=1}^{h} \left[ (U_t-L_t) + \frac{2}{\alpha}(L_t-Y_t)1(Y_t < L_t) + \frac{2}{\alpha}(Y_t-L_t)1(Y_t > U_t) \right] $$ where $Y_t$ is the observation at time $t$.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    (Don’t Get) Contangled Up In Noise

    Thu, 12/21/2017 - 18:06

    (This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

    This post will be about investigating the efficacy of contango as a volatility trading signal.

    For those that trade volatility (like me), a term you may see that’s somewhat ubiquitous is the term “contango”. What does this term mean?

    Well, simple: it just means the ratio of the second month of VIX futures over the first. The idea being is that when the second month of futures is more than the first, that people’s outlook for volatility is greater in the future than it is for the present, and therefore, the futures are “in contango”, which is most of the time.

    Furthermore, those that try to find decent volatility trading ideas may have often seen that futures in contango implies that holding a short volatility position will be profitable.

    Is this the case?

    Well, there’s an easy way to answer that.

    First off, refer back to my post on obtaining free futures data from the CBOE.

    Using this data, we can obtain our signal (that is, in order to run the code in this post, run the code in that post).

    xivSig <- termStructure$C2 > termStructure$C1

    Now, let’s get our XIV data (again, big thanks to Mr. Helmuth Vollmeier for so kindly providing it.

    require(downloader) require(quantmod) require(PerformanceAnalytics) require(TTR) require(Quandl) require(data.table) download("https://dl.dropboxusercontent.com/s/jk6der1s5lxtcfy/XIVlong.TXT", destfile="longXIV.txt") download("https://dl.dropboxusercontent.com/s/950x55x7jtm9x2q/VXXlong.TXT", destfile="longVXX.txt") #requires downloader package xiv <- xts(read.zoo("longXIV.txt", format="%Y-%m-%d", sep=",", header=TRUE)) xivRets <- Return.calculate(Cl(xiv))

    Now, here’s how this works: as the CBOE doesn’t update its settles until around 9:45 AM EST on the day after (EG a Tuesday’s settle data won’t release until Wednesday at 9:45 AM EST), we have to enter at close of the day after the signal fires. (For those wondering, my subscription strategy uses this mechanism, giving subscribers ample time to execute throughout the day.)

    So, let’s calculate our backtest returns. Here’s a stratStats function to compute some summary statistics.

    stratStats <- function(rets) { stats <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets)) stats[5,] <- stats[1,]/stats[4,] stats[6,] <- stats[1,]/UlcerIndex(rets) rownames(stats)[4] <- "Worst Drawdown" rownames(stats)[5] <- "Calmar Ratio" rownames(stats)[6] <- "Ulcer Performance Index" return(stats) } stratRets <- lag(xivSig, 2) * xivRets charts.PerformanceSummary(stratRets) stratStats(stratRets)

    With the following results:

    C2 Annualized Return 0.3749000 Annualized Std Dev 0.4995000 Annualized Sharpe (Rf=0%) 0.7505000 Worst Drawdown 0.7491131 Calmar Ratio 0.5004585 Ulcer Performance Index 0.7984454

    So, this is obviously a disaster. Visual inspection will show devastating, multi-year drawdowns. Using the table.Drawdowns command, we can view the worst ones.

    > table.Drawdowns(stratRets, top = 10) From Trough To Depth Length To Trough Recovery 1 2007-02-23 2008-12-15 2010-04-06 -0.7491 785 458 327 2 2010-04-21 2010-06-30 2010-10-25 -0.5550 131 50 81 3 2014-07-07 2015-12-11 2017-01-04 -0.5397 631 364 267 4 2012-03-27 2012-06-01 2012-07-17 -0.3680 78 47 31 5 2017-07-25 2017-08-17 2017-10-16 -0.3427 59 18 41 6 2013-09-27 2014-04-11 2014-06-18 -0.3239 182 136 46 7 2011-02-15 2011-03-16 2011-04-26 -0.3013 49 21 28 8 2013-02-20 2013-03-01 2013-04-23 -0.2298 44 8 36 9 2013-05-20 2013-06-20 2013-07-08 -0.2261 34 23 11 10 2012-12-19 2012-12-28 2013-01-23 -0.2154 23 7 16

    So, the top 3 are horrendous, and then anything above 30% is still pretty awful. A couple of those drawdowns lasted multiple years as well, with a massive length to the trough. 458 trading days is nearly two years, and 364 is about one and a half years. Imagine seeing a strategy be consistently on the wrong side of the trade for nearly two years, and when all is said and done, you’ve lost three-fourths of everything in that strategy.

    There’s no sugar-coating this: such a strategy can only be called utter trash.

    Let’s try one modification: we’ll require both contango (C2 > C1), and that contango be above its 60-day simple moving average, similar to my VXV/VXMT strategy.

    contango <- termStructure$C2/termStructure$C1 maContango <- SMA(contango, 60) xivSig <- contango > 1 & contango > maContango stratRets <- lag(xivSig, 2) * xivRets

    With the results:

    > stratStats(stratRets) C2 Annualized Return 0.4271000 Annualized Std Dev 0.3429000 Annualized Sharpe (Rf=0%) 1.2457000 Worst Drawdown 0.5401002 Calmar Ratio 0.7907792 Ulcer Performance Index 1.7515706

    Drawdowns:

    > table.Drawdowns(stratRets, top = 10) From Trough To Depth Length To Trough Recovery 1 2007-04-17 2008-03-17 2010-01-06 -0.5401 688 232 456 2 2014-12-08 2014-12-31 2015-04-09 -0.2912 84 17 67 3 2017-07-25 2017-09-05 2017-12-08 -0.2610 97 30 67 4 2012-03-27 2012-06-21 2012-07-02 -0.2222 68 61 7 5 2012-07-20 2012-12-06 2013-02-08 -0.2191 139 96 43 6 2015-10-20 2015-11-13 2016-03-16 -0.2084 102 19 83 7 2013-12-27 2014-04-11 2014-05-23 -0.1935 102 73 29 8 2017-03-21 2017-05-17 2017-06-26 -0.1796 68 41 27 9 2012-02-07 2012-02-15 2012-03-12 -0.1717 24 7 17 10 2016-09-08 2016-09-09 2016-12-06 -0.1616 63 2 61

    So, a Calmar still safely below 1, an Ulcer Performance Index still in the basement, a maximum drawdown that’s long past the point that people will have abandoned the strategy, and so on.

    So, even though it was improved, it’s still safe to say this strategy doesn’t perform too well. Even after the large 2007-2008 drawdown, it still gets some things pretty badly wrong, like being exposed to all of August 2017.

    While I think there are applications to contango in volatility investing, I don’t think its use is in generating the long/short volatility signal on its own. Rather, I think other indices and sources of data do a better job of that. Such as the VXV/VXMT, which has since been iterated on to form my subscription strategy.

    Thanks for reading.

    NOTE: I am currently seeking networking opportunities, long-term projects, and full-time positions related to my skill set. My linkedIn profile can be found here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A tour of the data.table package by creator Matt Dowle

    Thu, 12/21/2017 - 18:00

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The data.table package provides a high-performance interface for querying and manipulating data tables (a close cousin of data frames). In the video below (recorded at the H2OWorld conference), creator Matt Dowle gives a tour of the package and provides several examples of its use.

    If you'd like to see the details on the example Matt presents at the beginning of the talk, you can find the post how the City of Chicago uses data.table for food safety, here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to apply Linear Regression in R

    Thu, 12/21/2017 - 14:00
    Machine Learning and Regression

    Machine Learning (ML) is a field of study that provides the capability to a Machine to understand data and to learn from the data. ML is not only about analytics modeling but it is end-to-end modeling that broadly involves following steps:

      – Defining problem statement
      – Data collection.
      – Exploring, Cleaning and transforming data.
      – Making the analytics model.
      – Dashboard creation & deployment of the model.

    Machine learning has two distinct field of study – supervised learning and unsupervised learning. Supervised learning technique generates a response based on the set of input features. Unsupervised learning does not have any response variable and it explores the association and interaction between input features. In the following topic, I will discuss linear regression that is an example of supervised learning technique.

    Supervised Learning & Regression

    Linear Regression is a supervised modeling technique for continuous data. The model fits a line that is closest to all observation in the dataset. The basic assumption here is that functional form is the line and it is possible to fit the line that will be closest to all observation in the dataset. Please note that if the basic assumption about the linearity of the model is away from reality then there is bound to have an error (bias towards linearity) in the model however best one will try to fit the model.

    Let’s analyze the basic equation for any supervised learning algorithm
    Y = F(x) + ε
    The above equation has three terms:
    Y – define the response variable.
    F(X) – defines the function that is dependent on set of input features.
    ε – defines the random error. For ideal model, this should be random and should not be dependent on any input.

    In linear regression, we assume that functional form, F(X) is linear and hence we can write the equation as below. Next step will be to find the coefficients (β0, β1..) for below model.
    Y = β0 + β1 X + ε ( for simple regression )
    Y = β0 + β1 X1 + β2 X2+ β3 X3 + …. + βp Xp + ε ( for multiple regression )

    How to apply linear regression

    The coefficient for linear regression is calculated based on the sample data. The basic assumption here is that the sample is not biased. This assumption makes sure that the sample does not necessarily always overestimate or underestimate the coefficients. The idea is that a particular sample may overestimate or underestimate but if one takes multiple samples and try to estimate the coefficient multiple times, then the average of co-efficient from multiple samples will be spot on.

    Extract the data and create the training and testing sample

    For the current model, let’s take the Boston dataset that is part of the MASS library in R Studio. Following are the features available in Boston dataset. The problem statement is to predict ‘medv’ based on the set of input features.

    library(MASS) library(ggplot2) attach(Boston) names(Boston) [1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" [10] "tax" "ptratio" "black" "lstat" "medv" Split the sample data and make the model

    Split the input data into training and evaluation set and make the model for the training dataset. It can be seen that training dataset has 404 observations and testing dataset has 102 observations based on 80-20 split.

    ##Sample the dataset. The return for this is row nos. set.seed(1) row.number <- sample(1:nrow(Boston), 0.8*nrow(Boston)) train = Boston[row.number,] test = Boston[-row.number,] dim(train) dim(test) [1] 404 14 [1] 102 14 Explore the response variable

    Let’s check for the distribution of response variable ‘medv’. The following figure shows the three distributions of ‘medv’ original, log transformation and square root transformation. We can see that both ‘log’ and ‘sqrt’ does a decent job to transform ‘medv’ distribution closer to normal. In the following model, I have selected ‘log’ transformation but it is also possible to try out ‘sqrt’ transformation.

    ##Explore the data. ggplot(Boston, aes(medv)) + geom_density(fill="blue") ggplot(train, aes(log(medv))) + geom_density(fill="blue") ggplot(train, aes(sqrt(medv))) + geom_density(fill="blue")

    Gives this plot:

    Model Building – Model 1

    Now as a first step we will fit the multiple regression models. We will start by taking all input variables in the multiple regression.

    #Let’s make default model. model1 = lm(log(medv)~., data=train) summary(model1) par(mfrow=c(2,2)) plot(model1) Call: lm(formula = log(medv) ~ ., data = train) Residuals: Min 1Q Median 3Q Max -0.72354 -0.11993 -0.01279 0.10682 0.84791 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.2812343 0.2289799 18.697 < 2e-16 *** crim -0.0133166 0.0019722 -6.752 5.30e-11 *** zn 0.0012855 0.0006558 1.960 0.050678 . indus 0.0032675 0.0029440 1.110 0.267724 chas 0.1093931 0.0378934 2.887 0.004108 ** nox -0.9457575 0.1748322 -5.410 1.10e-07 *** rm 0.0651669 0.0186119 3.501 0.000516 *** age 0.0010095 0.0006322 1.597 0.111139 dis -0.0475650 0.0092928 -5.119 4.85e-07 *** rad 0.0176230 0.0030523 5.774 1.59e-08 *** tax -0.0006691 0.0001739 -3.847 0.000140 *** ptratio -0.0364731 0.0059456 -6.134 2.10e-09 *** black 0.0003882 0.0001205 3.223 0.001377 ** lstat -0.0310961 0.0022960 -13.543 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.195 on 390 degrees of freedom Multiple R-squared: 0.7733, Adjusted R-squared: 0.7658 F-statistic: 102.3 on 13 and 390 DF, p-value: < 2.2e-16

    Gives this plot:

    Observation from summary (model1)

    Is there a relationship between predictor and response variables?
    We can answer this using F stats. This defines the collective effect of all predictor variables on the response variable. In this model, F=102.3 is far greater than 1, and so it can be concluded that there is a relationship between predictor and response variable.

    Which of the predictor variables are significant?
    Based on the ‘p-value’ we can conclude on this. The lesser the ‘p’ value the more significant is the variable. From the ‘summary’ dump we can see that ‘zn’, ‘age’ and ‘indus’ are less significant features as the ‘p’ value is large for them. In next model, we can remove these variables from the model.

    Is this model fit?
    We can answer this based on R2 (multiple-R-squared) value as it indicates how much variation is captured by the model. R2 closer to 1 indicates that the model explains the large value of the variance of the model and hence a good fit. In this case, the value is 0.7733 (closer to 1) and hence the model is a good fit.

    Observation from the plot

    Fitted vs Residual graph
    Residuals plots should be random in nature and there should not be any pattern in the graph. The average of the residual plot should be close to zero. From the above plot, we can see that the red trend line is almost at zero except at the starting location.

    Normal Q-Q Plot
    Q-Q plot shows whether the residuals are normally distributed. Ideally, the plot should be on the dotted line. If the Q-Q plot is not on the line then models need to be reworked to make the residual normal. In the above plot, we see that most of the plots are on the line except at towards the end.

    Scale-Location
    This shows how the residuals are spread and whether the residuals have an equal variance or not.

    Residuals vs Leverage
    The plot helps to find influential observations. Here we need to check for points that are outside the dashed line. A point outside the dashed line will be influential point and removal of that will affect the regression coefficients.

    Model Building – Model 2

    As the next step, we can remove the four lesser significant features (‘zn’, age’ and ‘indus’ ) and check the model again.

    # remove the less significant feature model2 = update(model1, ~.-zn-indus-age) summary(model2) Call: lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = train) Residuals: Min 1Q Median 3Q Max -0.72053 -0.11852 -0.01833 0.10495 0.85353 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.2623092 0.2290865 18.606 < 2e-16 *** crim -0.0129937 0.0019640 -6.616 1.21e-10 *** chas 0.1178952 0.0378684 3.113 0.001986 ** nox -0.8549561 0.1627727 -5.252 2.46e-07 *** rm 0.0731284 0.0180930 4.042 6.38e-05 *** dis -0.0465887 0.0073232 -6.362 5.55e-10 *** rad 0.0157173 0.0028977 5.424 1.02e-07 *** tax -0.0005108 0.0001494 -3.418 0.000697 *** ptratio -0.0384253 0.0056084 -6.851 2.84e-11 *** black 0.0003987 0.0001206 3.307 0.001031 ** lstat -0.0295185 0.0021192 -13.929 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1959 on 393 degrees of freedom Multiple R-squared: 0.7696, Adjusted R-squared: 0.7637 F-statistic: 131.2 on 10 and 393 DF, p-value: < 2.2e-16 par(mfrow=c(2,2)) plot(model2)

    Gives this plot:

    Observation from summary (model1)

    Is there a relationship between predictor and response variable?
    F=131.2 is far greater than 1 and this value is more than the F value of the previous model. It can be concluded that there is a relationship between predictor and response variable.

    Which of the variable are significant?
    Now in this model, all the predictors are significant.

    Is this model fit?
    R2 =0.7696 is closer to 1 and so this model is a good fit. Please note that this value has decreased a little from the first model but this should be fine as removing three predictors caused a drop from 0.7733 to 0.7696 and this is a small drop. In other words, the contribution of three predictors towards explaining the variance is an only small value(0.0037) and hence it is better to drop the predictor.

    Observation of the plot
    All the four plots look similar to the previous model and we don’t see any major effect.

    Check for predictor vs Residual Plot

    In the next step, we will check the residual graph for all significant features from Model 2. We need to check if we see any pattern in the residual plot. Ideally, the residual plot should be random plot and we should not see a pattern. In the following plots, we can see some non-linear pattern for features like ‘crim’, ‘rm’, ‘nox’ etc.

    ##Plot the residual plot with all predictors. attach(train) require(gridExtra) plot1 = ggplot(train, aes(crim, residuals(model2))) + geom_point() + geom_smooth() plot2=ggplot(train, aes(chas, residuals(model2))) + geom_point() + geom_smooth() plot3=ggplot(train, aes(nox, residuals(model2))) + geom_point() + geom_smooth() plot4=ggplot(train, aes(rm, residuals(model2))) + geom_point() + geom_smooth() plot5=ggplot(train, aes(dis, residuals(model2))) + geom_point() + geom_smooth() plot6=ggplot(train, aes(rad, residuals(model2))) + geom_point() + geom_smooth() plot7=ggplot(train, aes(tax, residuals(model2))) + geom_point() + geom_smooth() plot8=ggplot(train, aes(ptratio, residuals(model2))) + geom_point() + geom_smooth() plot9=ggplot(train, aes(black, residuals(model2))) + geom_point() + geom_smooth() plot10=ggplot(train, aes(lstat, residuals(model2))) + geom_point() + geom_smooth() grid.arrange(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10,ncol=5,nrow=2)

    Gives this plot:

    Model Building – Model 3 & Model 4

    We can now enhance the model by adding a square term to check for non-linearity. We can first try model3 by introducing square terms for all features ( from model 2). And in the next iteration, we can remove the insignificant feature from the model.

    #Lets make default model and add square term in the model. model3 = lm(log(medv)~crim+chas+nox+rm+dis+rad+tax+ptratio+ black+lstat+ I(crim^2)+ I(chas^2)+I(nox^2)+ I(rm^2)+ I(dis^2)+ I(rad^2)+ I(tax^2)+ I(ptratio^2)+ I(black^2)+ I(lstat^2), data=train) summary(model3) Call: lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad + tax + ptratio + black + lstat + I(crim^2) + I(chas^2) + I(nox^2) + I(rm^2) + I(dis^2) + I(rad^2) + I(tax^2) + I(ptratio^2) + I(black^2) + I(lstat^2), data = train) Residuals: Min 1Q Median 3Q Max -0.78263 -0.09843 -0.00799 0.10008 0.76342 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 7.742e+00 9.621e-01 8.048 1.06e-14 *** crim -2.532e-02 5.203e-03 -4.866 1.66e-06 *** chas 1.209e-01 3.481e-02 3.474 0.000572 *** nox -3.515e-01 1.136e+00 -0.309 0.757224 rm -6.061e-01 1.394e-01 -4.349 1.75e-05 *** dis -1.183e-01 2.563e-02 -4.615 5.36e-06 *** rad 1.831e-02 9.843e-03 1.860 0.063675 . tax -4.160e-04 5.687e-04 -0.731 0.464961 ptratio -1.783e-01 7.748e-02 -2.301 0.021909 * black 1.450e-03 5.379e-04 2.695 0.007340 ** lstat -4.860e-02 6.009e-03 -8.088 8.05e-15 *** I(crim^2) 1.542e-04 8.700e-05 1.773 0.077031 . I(chas^2) NA NA NA NA I(nox^2) -5.801e-01 8.492e-01 -0.683 0.494947 I(rm^2) 5.239e-02 1.100e-02 4.762 2.73e-06 *** I(dis^2) 6.691e-03 2.077e-03 3.222 0.001383 ** I(rad^2) 8.069e-05 3.905e-04 0.207 0.836398 I(tax^2) -2.715e-07 6.946e-07 -0.391 0.696114 I(ptratio^2) 4.174e-03 2.203e-03 1.895 0.058860 . I(black^2) -2.664e-06 1.187e-06 -2.244 0.025383 * I(lstat^2) 5.741e-04 1.663e-04 3.451 0.000620 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1766 on 384 degrees of freedom Multiple R-squared: 0.8169, Adjusted R-squared: 0.8079 F-statistic: 90.19 on 19 and 384 DF, p-value: |t|) ##Removing the insignificant variables. model4=update(model3, ~.-nox-rad-tax-I(crim^2)-I(chas^2)-I(rad^2)- I(tax^2)-I(ptratio^2)-I(black^2)) summary(model4) Call: lm(formula = log(medv) ~ crim + chas + rm + dis + ptratio + black + lstat + I(nox^2) + I(rm^2) + I(dis^2) + I(lstat^2), data = train) Residuals: Min 1Q Median 3Q Max -0.73918 -0.09787 -0.00723 0.08868 0.82585 (Intercept) 6.4071124 0.4571101 14.017 < 2e-16 *** crim -0.0125562 0.0016777 -7.484 4.78e-13 *** chas 0.1353044 0.0356980 3.790 0.000174 *** rm -0.7248878 0.1428717 -5.074 6.04e-07 *** dis -0.0915153 0.0242616 -3.772 0.000187 *** ptratio -0.0247304 0.0050367 -4.910 1.34e-06 *** black 0.0002375 0.0001134 2.094 0.036928 * lstat -0.0461831 0.0061301 -7.534 3.44e-13 *** I(nox^2) -0.6335121 0.1185127 -5.346 1.53e-07 *** I(rm^2) 0.0632918 0.0112473 5.627 3.49e-08 *** I(dis^2) 0.0049036 0.0020706 2.368 0.018363 * I(lstat^2) 0.0004675 0.0001692 2.763 0.006003 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1852 on 392 degrees of freedom Multiple R-squared: 0.7946, Adjusted R-squared: 0.7888 F-statistic: 137.9 on 11 and 392 DF, p-value: < 2.2e-16 par(mfrow=c(2,2)) plot(model4)

    Gives this plot:

    Observation from summary (model1)

    Is there a relationship between predictor and response variables?
    F-Stat is 137.9 and it is far greater than 1. So there is a relationship between predictor and response variable.

    Which of the predictor variable are significant?
    All predictor variables are significant.

    Is this model fit?
    R2 is 0.7946 and this is more ( and better ) than our first and second model.

    Prediction

    Till now we were checking training-error but the real goal of the model is to reduce the testing error. As we already split the sample dataset into training and testing dataset, we will use test dataset to evaluate the model that we have arrived upon. We will make a prediction based on ‘Model 4’ and will evaluate the model. As the last step, we will predict the ‘test’ observation and will see the comparison between predicted response and actual response value. RMSE explains on an average how much of the predicted value will be from the actual value. Based on RMSE = 3.278, we can conclude that on an average predicted value will be off by 3.278 from the actual value.

    pred1 <- predict(model4, newdata = test) rmse <- sqrt(sum((exp(pred1) - test$medv)^2)/length(test$medv)) c(RMSE = rmse, R2=summary(model4)$r.squared) c(RMSE = rmse, R2=summary(model4)$r.squared) RMSE R2 3.2782608 0.7946003 par(mfrow=c(1,1)) plot(test$medv, exp(pred1))

    Gives this plot:

    Conclusion

    The example shows how to approach linear regression modeling. The model that is created still has scope for improvement as we can apply techniques like Outlier detection, Correlation detection to further improve the accuracy of more accurate prediction. One can as well use an advanced technique like Random Forest and Boosting technique to check whether the accuracy can be further improved for the model. A piece of warning is that we should refrain from overfitting the model for training data as the test accuracy of the model will reduce for test data in case of overfitting.

    Reference
      1. Statistics for Business By Robert Stine, Dean Foster
      2. An Introduction to Statistical Learning, with Application in R. By G. Casella, S. Fienberg, I. Olkin

      Related Post

      1. Linear Regression in Python; Predict The Bay Area’s Home Prices
      2. Building A Logistic Regression in Python, Step by Step
      3. Multicollinearity in R
      4. Scikit-Learn for Text Analysis of Amazon Fine Food Reviews
      5. Survival Analysis – Part I
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      Where have you been? Getting my Github activity

      Thu, 12/21/2017 - 01:00

      (This article was first published on Maëlle, and kindly contributed to R-bloggers)

      On my pretty and up-to-date CV, one of the first things one sees is my Github username, linking to my Github profile. What does a potential employer look at there? Hopefully not my non informative commit messages… My imitating a red Ampelmann, my being part of several organizations, my pinned repositories described with emojis… But how would they know where&how I’ve mostly been active without too much effort?

      A considerable part of my Github work happens in organizations: I’m a co-editor at rOpenSci onboarding of packages, I contribute content to the R Weekly newsletter, etc. Although my profile shows the organizations I belong to, one would need to dig into them for a while before seeing how much or how little I’ve done. Which is fine most of the time but less so when trying to profile myself for jobs, right? Let’s try and fetch some Github data to create a custom profile.

      Note: yep I’m looking for a job and ResearchGate’s suggestions are not helpful! Do you need an enthusiastic remote data scientist or research software engineer for your team? I’m available up to 24 hours a week! I care a lot about science, health, open source and community. Ideally I’d like to keep working in something close to public research but we can talk!

      Getting all my activity

      Note: for another look at Github activity, more related to when people work, check out Dean Attali’s cool Shiny app!

      I’m a big fan of the simplistic gh package that I’ve used in my blog post about initial/first commits and in my blog post about random seeds. My workflow is reading the API documentation to find which endpoints I need to answer my question, then writing code that loops/maps over pages of answers from the API and unnests results, all of this inspired by the package README and now my own old posts. If I used this data more often, I imagine I’d need less copy-pasting from my previous uses, but in this situation I’m very happy that my past self published this stuff. I’m also thankful I set a GITHUB_PAT environment variable as explained here.

      The problem with the /users/:username/events/public endpoint of Github API is that it only provides you with the 300 latest events from an user, which is not enough, so I had to branch out and look at another source of data, Github archive via Google BigQuery. I had never touched it and will let Google’s own words define it: “BigQuery is Google’s fully managed, petabyte scale, low cost analytics data warehouse.”. To interact with this tool, there are at least two R packages, bigQueryR and bigrquery. I chose the latter after reading the short comparison from the website of the former. I created a project as explained here.

      After that, here is the code I ran. I mostly adapted the example from Github archive page to:

      • ask for all my events in the last two years but not at once because of my quota for query bytes;

      • ask for my events either with my current username “maelle” or the one I had before that, “masalmon” (by the way I’m so happy I could claim “maelle” because that user was not active! I love having people use my first name to ping me!);

      • get all the events, not only issues opened, and their creation date.

      It took me less time as expected to get it, I was helped by my familiarity with Github data and with SQL queries, or my capacity to update existing queries to serve my needs, ahahah.

      I started out using the basic free tier, but it turned out I was reaching the quota by query all the time, so it was not going to work. I tried changing my query but finally took the plunge and used the free trial. You get 300$, and one TB of query costs 5$. One day queried by the code below is equivalent to something in the magnitude of 10 gigabytes processed, so I was going to be fine (see the pricing information).

      # using the very handy anytime package! dates <- seq(from = anytime::anydate("2015-12-20"), to = anytime::anydate("2017-12-19"), by = "1 day") library("bigrquery") get_one_day_events <- function(day){ project <- #projectid here sql <- paste0("/* count of issues opened, closed, and reopened between 1/1/2015 and 2/1/2015 */ SELECT type, repo.name, actor.login, created_at, JSON_EXTRACT(payload, '$.action') as event, FROM (TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('", as.character(day), "'), TIMESTAMP('", as.character(day + 1), "') )) WHERE (actor.login = 'maelle' OR actor.login = 'masalmon') ;") results <- try(query_exec(sql, project = project), silent = TRUE) if(is(results, "try-error")){ return(NULL) }else{ return(results) } } my_events <- purrr::map_df(dates, get_one_day_events) readr::write_csv(my_events, path = "data/2017-12-20-my_events.csv")

      I realized after the fact that I had done overlapping queries, well they overlapped in time so I got many events twice… Stupid me.

      Now we just need to wrangle the data a little bit. In particular, we’ll use the snakecase package to convert the upper camel case of the type column into something prettier (in my opinion).

      # no duplicate my_events <- unique(my_events) # Better format for the type my_events <- dplyr::mutate(my_events, type = stringr::str_replace(type, "Event", "")) my_events <- dplyr::mutate(my_events, type = snakecase::to_parsed_case(type)) my_events <- dplyr::mutate(my_events, type = stringr::str_replace_all(type, "_", " ")) my_events <- dplyr::mutate(my_events, type = tolower(type)) # get repo owner my_events <- dplyr::mutate(my_events, owner = stringr::str_replace_all(repo_name, "/.*", "")) # save with a fantastic filename... sigh readr::write_csv(my_events, path = "data/2017-12-20-my_events_clean.csv")

      We got 4991 events from 2015-12-28 09:53:23 to 2017-12-20 06:43:44!

      set.seed(42) knitr::kable(dplyr::sample_n(my_events, size = 10)) type repo_name actor_login created_at event owner watch lindbrook/cholera maelle 2017-08-11 15:25:04 “started” lindbrook push maelle/maelle.github.io maelle 2017-10-04 17:15:57 NA maelle issues masalmon/monkeylearn masalmon 2016-05-19 08:34:51 “closed” masalmon issue comment ropensci/onboarding maelle 2017-06-19 06:57:32 “created” ropensci issue comment osmdatar/osmdata masalmon 2017-02-15 09:09:59 “created” osmdatar issues ropensci/pdftools masalmon 2016-11-23 14:59:16 “opened” ropensci issue comment lockedata/PackageReviewR maelle 2017-04-20 12:08:29 “created” lockedata issue comment leeper/colourlovers masalmon 2016-03-21 16:30:39 “created” leeper issue comment osmdatar/osmdata masalmon 2017-02-27 10:42:37 “created” osmdatar push maelle/eml.tools maelle 2017-03-23 11:07:45 NA maelle Analysing my activity What kind of events are there?

      Events as defined in this Github data can be for instance

      • commenting in issues (main job as an rOpenSci editor, ahah),

      • pushing stuff (sadly this count as one event no matter the size of the commit, which is a limitation of this dataset, and I’m not willing to complete it).

      my_events %>% dplyr::group_by(type) %>% dplyr::summarize(n = n()) %>% dplyr::arrange(- n) %>% knitr::kable() type n push 1872 issue comment 1491 issues 705 watch 544 create 151 pull request 99 fork 47 commit comment 34 release 17 member 10 pull request review comment 9 public 7 delete 3 gollum 2

      All event types are defined here. I had no idea what a Gollum event was, but it’s not surprising given I almost never updated or created a Wiki. Member events are the ones corresponding to that highly exciting moment when you’re added (or well, removed, less exciting) as a collaborator to a repository, or when your permissions are changed.

      Regarding the difference between issues events and issue comments event, according to the API docs,

      • an issue event is “Triggered when an issue is assigned, unassigned, labeled, unlabeled, opened, edited, milestoned, demilestoned, closed, or reopened.”(at rOpenSci onboarding, we use assignments and labelling a lot)

      • while an issue comment event is “Triggered when an issue comment is created, edited, or deleted.”.

      How many events over time?

      Although that’s clearly not my goal here, I was curious to have a look at time series of counts of events! You can’t blame me, time series of counts were the subject of my dissertation, in which I luckily did a bit more than plotting some.

      We shall concentrate on most frequent events.

      library("ggplot2") library("hrbrthemes") library("lubridate") my_events %>% dplyr::filter(type %in% c("push", "issue comment", "issues", "watch", "create", "pull request")) %>% dplyr::mutate(day = as.Date(created_at), week = update(day, wday = 1)) %>% dplyr::group_by(week, type) %>% dplyr::summarise(n = n()) %>% ggplot() + geom_line(aes(week, n)) + facet_grid(type ~ ., scales = "free_y") + theme_ipsum(base_size = 20, axis_title_size = 12) + theme(legend.position = "none", strip.text.y = element_text(angle = 0)) + ggtitle("My Github contributions in the two last years", subtitle = "Data queried from Github archive on Google BigQuery")

      What I notice is:

      • there’s a decrease in my activity, as seen by issue comments and issues created or closed, since the autumn. In September, we moved and had no internet connection for a while, and then we had a baby, with whom I’m now on maternity leave, so this is not a surprise!

      • the number of pushes seems constant which might be surprising given I do code less in my experience… but my pushes, and of pull requests by the way, contain many contributions to R Weekly, where I might update the draft each time I see something interesting to be shared! Now that I’m a member of the organization, I do not need pull requests for that so the number of pull requests is going down, which might be the reason why they added me to the organization… no need to merge my contributions anymore!

      • Speaking of pull requests, this was something I was intimated by (and I haven’t used branches a lot in my own projects) until Jim Hester helped me create my first PR.

      • What Github calls watch events are most often starring events in my case.

      A look at the places where I’ve been active

      Let’s look where most of my events took place.

      my_events %>% dplyr::group_by(owner) %>% dplyr::summarize(n = n()) %>% dplyr::arrange(- n) %>% dplyr::filter(n > 40) %>% knitr::kable() owner n masalmon 1663 ropensci 974 maelle 421 ropenscilabs 364 openaq 135 osmdatar 135 rweekly 113 hrbrmstr 67 openjournals 46 cvitolo 44 rladies 43

      My events are scattered among 284 owners and 712 repos (I have starred more than 500 repos). I recognize well my two usernames, rOpenSci organizations ropensci and ropenscilabs, R Weekly, etc. I’m surprised I have many events in OpenAQ organization, I thought my contributions were nearly all in the repo of my R package accessing their data! But thinking of it I did participate in conversations in issues. R-Ladies organization is also not one where I felt I had done that much, because my R-Ladies involvement is much more related to the co-founding and organization of the Barcelona chapter, and these days tweeting for R-Ladies Global. I’m also glad to see “hrbrmstr” a.k.a Bob Rudis, in the complete list there were other such owners: people that have fantastic packages and make it a pleasure to open issues, contribute code, etc.

      Now I’d like to look at the breakdown of event types by owner of the repo. We’ll change the owner variable a bit to get something more interpretable.

      orgs <- dplyr::mutate(my_events, blog = repo_name == "maelle/maelle.github.io", my_code = owner %in% c("maelle", "masalmon") & !blog, ropensci_onboarding = repo_name %in% c("ropensci/onboarding", "ropensci/onboarding-meta"), ropensci_misc = owner %in% c("ropensci", "ropenscilabs") & !ropensci_onboarding, rweekly = owner == "rweekly", work_from_others = !blog &!my_code & !ropensci_onboarding & !ropensci_misc & !rweekly) %>% tidyr::gather("organisation", "value", blog:work_from_others) %>% dplyr::filter(value)

      Again we filter event types, even removing watch/starring events this time.

      orgs %>% dplyr::filter(type %in% c("push", "issue comment", "issues", "create", "pull request"))%>% ggplot() + geom_bar(aes(organisation, fill = type)) + theme_ipsum(base_size = 20, axis_title_size = 16) + theme(strip.text.y = element_text(angle = 0), axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("My Github contributions in the two last years", subtitle = "Data queried from Github archive on Google BigQuery") + viridis::scale_fill_viridis(discrete = TRUE)

      orgs %>% dplyr::filter(type %in% c("push", "issue comment", "issues", "create", "pull request"))%>% ggplot() + geom_bar(aes(organisation, fill = type), position = "fill") + theme_ipsum(base_size = 20, axis_title_size = 16) + theme(strip.text.y = element_text(angle = 0), axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("My Github contributions in the two last years", subtitle = "Data queried from Github archive on Google BigQuery") + viridis::scale_fill_viridis(discrete = TRUE)

      It looks as if blogging were a very small part of my work in my own repos but I think it’s due to the fact that I mostly push all material related to one blog post at once. Interestingly I seem to only create repositories under my own brand, which I think is true in general, although I did end up transferring a few packages to rOpenSci. Then, clearly in rOpenSci onboarding, I seem to have done a lot of issue commenting/labelling, etc. This is no surprise. I have also done that for miscellaneous repositories that do not belong to me: I have contributed a bit of code but have talked a lot. As mentioned earlier, my R Weekly contributions are mostly my committing one new entry each time I read something good.

      What about you, dear readers?

      I think I have only scraped the surface of how one could use Github data to profile themselves (or others, of course). Not to mention all the things one could do with Github archive… See this list for inspiration; and do not hesitate to add suggestions or links to your work in the comments! Where have you been?

      Last but not least, if you’d like to contribute more to R, for instance on Github, you should check out these slides by Charlotte Wickham, they’re full of great tips!

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Let it flow, let it flow, let it flow……

      Thu, 12/21/2017 - 01:00

      (This article was first published on HighlandR, and kindly contributed to R-bloggers)

      Animating dot plots to demonstrate patient flow –

      Dots..

      This is not the blog post I’d originally intended to write. But I’m glad – because this one is so much better.

      Some background. I’m one of the few Scotland based members of AphA – the Association of Professional Healthcare Analysts. They’ve had a couple of events recently that I was keeping my eye on via Twitter and it became apparent that a session demonstrating R had made some waves – in a good way.

      I’d been having a wee exchange with Neil Pettinger regarding R and took the opportunity to ask permission to use one of his Excel files. This featured a dot plot chart that demonstrated patient flow.
      I wanted to show an alternative way of creating the plot using R.

      Neil is an “information training consultant” – at least that’s what it says on his website but for me that is shorthand for guru of statistics, information and analysis with a huge breadth of experience working in and alongside analytical teams in the UK NHS. There can’t be many boards/ trusts within the NHS that Neil has not worked with at some point.

      Neil pointed out that instead of the 2014 version I was referring to, there was a new updated version, and he kindly forwarded me that instead, along with a guide on how to produce the plot using Excel.

      The plot in question shows admissions / discharges and transfers in/out on a fictional “good day” in a fictional hospital. I’m not going to try and explain Neil’s philosophy and vision regarding patient flow – (he sent me a brilliant presentation featuring works of art & sculpture -which if you attend one of his training sessions you may be fortunate to see) but suffice to say this plot is merely a starting point.

      Here it is:

      We have red dots for admissions (extra work/resource required), green for discharges (reduction in workload/ resource requirements).

      Transfers (moving from one area/ward within the hospital to another) are represented by grey dots.

      Finally, the above are split out across 3 “staging posts” – Accident & Emergency (A&E), Assessment areas and Wards.
      For example, if a patient comes to A&E, gets sent for assessment, and is then sent to a ward for further care, then they have 5 moves recorded in the dataset( into A&E, out of A&E, into the Assessment area, out of the Assessment area, and then in to the ward).

      If you want to follow along (go on) then you should head over to Neil’s site, download the excel file and take a look at the “how to” guide on the same page. Existing R users are already likely to be shuddering at all the manual manipulation required.

      For the first attempt, I followed Neil’s approach pretty closely, resulting in a lot of code to sort and group, although ggplot2 made the actual plotting much simpler.
      I shared my very first attempt ( produce with barely any ggplot2 code) which was quite good, but there were a few issues – the ins/ outs being coloured blue instead of grey, and overplotting of several points.

      Here’s the code:

      # Set limits for plotting lims <- as.POSIXct(strptime(c("2014-09-03 00:00","2014-09-03 24:00") , format = "%Y-%m-%d %H:%M")) ggplot(plot_data,aes(Movement15,Movement_15_SEQNO, colour=Movement_Type))+ geom_point()+ facet_grid(Staging_Post~.)+ scale_x_datetime(date_labels="%H:%M",date_breaks = "3 hours", limits = lims, timezone = "CET",expand = c(0,0))+ theme_minimal()+ theme(legend.position="bottom")+ theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank())

      And here’s what you get

      If you try and run that code as is – it won’t work, because we need to do some data prep first..

      The next day, I came back to the task with a view to coming up with a “slicker” process.
      I realised the bulk of the prep work required in Excel was actually straightforward in SQL.
      If you don’t know SQL, that’s OK – because this is R, and we have dplyr.

      Here’s how to do it:

      data <- read_xlsx("RedGreenGreyDots.xlsx", sheet = 1) #read raw data from Excel plot_data <- data %>% mutate(Movement15 = lubridate::floor_date(MovementDateTime,"15 minutes")) %>% group_by(IN_OUT, Movement_Type,Staging_Post,Movement15) %>% mutate(counter = case_when ( IN_OUT == 'IN' ~ 1, IN_OUT == 'OUT' ~ -1)) %>% mutate(Movement_15_SEQNO =cumsum(counter)) %>% ungroup() # Change "Tranfer In" or "Transfer Out" to "Transfer" plot_data$Movement_Type <- gsub("Transfer.*","Transfer",x=plot_data$Movement_Type)

      NOW you can try running the ggplot2 code above and you should get a very similar looking plot.

      If you’re new to R / dplyr and wondering what the heck’s going on:

      • First we read the data in using the readxlsx function from the readxl package

      • (Not shown) check the structure of the dataframe and make sure the data types all check out (always an especially good idea when importing from Excel)

      • use dplyr’s mutate function to create a new variable to group all the movements into 15 minute intervals, which is a piece of cake with lubridate’s floordate function.

      • Next we group the data by the IN_OUT, Movement_Type, Staging_Post and our new Movement15 variable

      • We then create another new column, this time to create a counter field, with a value of 1 when the IN_OUT column = “IN” (so that these points appear above the x axis horizon) and -1 when the value is “OUT” (so the points display below the horizon)

      • We create yet another variable, giving the cumulative sum of the counter field, so that we have a dots that stack upon each other at each 15 minute interval ( rather than just having one point representing the maximum / minimum values at each stage)

      The last thing to do is to tidy up the Movement_Type field – we don’t want 4 movement types on our final plot, so we just change the values to “Transfer” – they are colour coded grey regardless of whether they are a transfer in or transfer out.

      Plots..

      Here is how the plot looks once we work some ggplot2 magic:

      ggplot(plot_data,aes(Movement15,Movement_15_SEQNO, colour=Movement_Type))+ geom_jitter(width=0.10)+ scale_colour_manual(values=c("#D7100D","#40B578","grey60"))+ facet_grid(Staging_Post~.,switch = "y")+ scale_x_datetime(date_labels="%H:%M",date_breaks = "3 hours", limits = lims, timezone = "CET", expand = c(0,0))+ ggtitle(label = "Anytown General Hospital | Wednesday 3rd September 2014 00:00 to 23:59\n", subtitle="A&E AND INPATIENT ARRIVALS, DEPARTURES AND TRANSFERS")+ labs(x= NULL, y= NULL)+ theme_ipsum(base_family = "Arial Narrow")+ theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())+ theme(axis.text.x=element_text(size=7)) + theme(axis.ticks.x=element_blank())+ theme(legend.position="bottom")+ theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank())+ theme(strip.text.y = element_text(angle = 0))+ guides(color=guide_legend("Movement Type"))

      Not bad. Being really picky – I haven’t replicated the colour coding behind each of the facet labels that Neil had on his original plot (A&E should have a yellow background for instance). Similarly, the clever colour coding of the title text on the original chart does away with the need for a legend.
      This is kind of annoying because its very easy to do this in Excel but a real faff to emulate in ggplot2 (probably for good reason). I have found a couple of solutions to resolve both problems, but they involve a lot of mucking about with grid settings and I’m not sure its worth the aggro.

      A key change to the original ggplot2 code, apart from manually specifying the colours, is that I’m using geom_jitter instead of geom_point. This helps get rid of over-plotting issues when 2 different Movement Types occur at exactly the same 15 minute interval.

      During our initial discussion, Neil asked if it was possible to animate plots in R. I reassured him it was, but had to be honest and told him it was not something I had experience of. There was a jokey aside along the lines of “ lets get the static plots working, we can make movies later”.

      Well, here is the animated version of the above plot:

      This was produced using very similar ggplot2 code to the static version.
      The key difference was the use of the gganimate package, and the use of the “frame” and cumulative arguments, which basically instruct ggplot2 to make a plot showing the cumulative picture of movements from midnight up to each 15 minute segment.
      For example, the first image shows midnight to quarter past, the second midnight to half past, then midnight to quarter to midnight, then midnight to 1 AM, and so on.
      The resulting plots are then magically stitched together and saved to the desired output – in this case as a gif.

      There are a couple of ways of setting the animation options for the timescale (interval- basically how fast you want the plot to cycle through from start to finish).

      You can simply tag on an “ interval = x” argument when saving the plot, or you can use the animation package, and set the options, plot size etc up front.
      This is the approach I used, as you’ll see in the accompanying code, but either should work – you can uncomment /comment out the relevant lines in the code and try both. There’s a link to the repo at the top of the post.

      ..& Dr Cox

      To finish up – I let Neil have a sneak peak at this post and he came up with the seasonally inspired title.
      As it’s that time of year, and this is likely to be my last post before then – seasons greetings to all / both my readers, hope Santa is good to you, stay safe and well.

      All the best,

      John

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      How to Greatly Speed Up Your Spark Queries

      Wed, 12/20/2017 - 17:52

      (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

      For some time we have been teaching R users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis."

      The idea behind the advice is: working with fewer columns makes for quicker queries.


      photo: Jacques Henri Lartigue 1912

      The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are "denormalized marts" that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.

      Some wonder is this really an issue or is it something one can ignore in the hope the downstream query optimizer fixes the problem. In this note we will show the effect is real.

      Let’s set up our experiment. The data is a larger version of the problem from "Let’s Have Some Sympathy For The Part-time R User". We have expanded the number of subjects to 100000 and added 1000 irrelevant columns to the example. We define a new function that uses dplyr and Sparklyr to compute the diagnoses. We vary if the table is first limited to columns of interest and if the results are brought back to R.

      scale <- 0.237 dplyr_run <- function(narrow, collect = FALSE) { dR <- dT if(narrow) { dR <- dR %>% select(subjectID, surveyCategory, assessmentTotal) } dR <- dR %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% filter(row_number() == n()) %>% ungroup() %>% rename(diagnosis = surveyCategory) %>% select(subjectID, diagnosis, probability) %>% arrange(subjectID) if(collect) { dR <- collect(dR) } else { dR <- compute(dR) } dR } head(dplyr_run(narrow=FALSE)) ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## # Ordered by: probability, surveyCategory, subjectID ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.559 ## 2 2 withdrawal behavior 0.500 ## 3 3 positive re-framing 0.616 ## 4 4 positive re-framing 0.559 ## 5 5 withdrawal behavior 0.616 ## 6 6 positive re-framing 0.869 head(dplyr_run(narrow=TRUE)) ## # Source: lazy query [?? x 3] ## # Database: spark_connection ## # Ordered by: probability, surveyCategory, subjectID ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.559 ## 2 2 withdrawal behavior 0.500 ## 3 3 positive re-framing 0.616 ## 4 4 positive re-framing 0.559 ## 5 5 withdrawal behavior 0.616 ## 6 6 positive re-framing 0.869

      We can get timings for variations of the function:

      library("microbenchmark") timings <- microbenchmark(dplyr_run(narrow=FALSE), dplyr_run(narrow=TRUE), times = 20)

      And then present the results:

      print(timings) ## Unit: milliseconds ## expr min lq mean median uq ## dplyr_run(narrow = FALSE) 2371.2845 2432.6255 2545.488 2479.240 2526.328 ## dplyr_run(narrow = TRUE) 937.8512 974.4722 1128.068 1010.134 1080.414 ## max neval ## 4023.869 20 ## 2968.788 20 tdf <- as.data.frame(timings) # order the data tdf <- tdf %>% group_by(., expr) %>% mutate(., mtime = median(time)) %>% ungroup(.) tdf$expr <- reorder(tdf$expr, tdf$mtime) WVPlots::ScatterBoxPlotH(tdf, "time", "expr", pt_alpha=0.2, title="Execution times in NS")

      Notice the times where we have not per-narrowed the table are indeed much slower.

      The advice is confirmed: narrow to the columns of interest early in your analysis.

      Of course, narrowing to the exact columns used can be difficult: it can involve inspecting an arbitrarily long pipeline for column uses. That is part of why we are developing a new R query generator that automates that procedure: rquery.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      How to highlight countries on a map

      Wed, 12/20/2017 - 17:30

      (This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)



      Last week, we created a map of world oil production, by country.

      This week, I’ll show you how to make a slight modification. I’ll show you how to highlight specific countries according to a variable in your data frame.

      In this code, we will re-use our data from the last tutorial. You can find the code to create the data there.

      Before we do any plotting, let’s just inspect the data:

      map.oil %>% glimpse() # Observations: 99,338 # Variables: 9 # $ long -69.89912, -69.89571, -69.94219, -70.00415, -70.06612,... # $ lat 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.5... # $ group 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ... # $ order 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17,... # $ region "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", ... # $ subregion NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... # $ rank NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... # $ opec_ind 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... # $ oil_bbl_per_day NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

      So what do we have here?

      The data contains variables that enable us to plot countries as polygons: long, lat, region, and group.

      The data also contains a variable called oil_bbl_per_day, which is essentially the amount of oil produced by the country per day.

      Let’s make a basic map that plots this data, with each country “filled in” according to the amount of oil it produces.

      #===== # PLOT #===== # BASIC (this is a first draft) ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(fill = oil_bbl_per_day))



      Next, let’s make a slightly different map. Here, we’re going to remove the mapping to the fill aesthetic, and we’ll going to map a different variable – opec_ind – to the color aesthetic.

      #---------------------------------------------- # PLOT with red highlight around OPEC countries #---------------------------------------------- ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(color = as.factor(opec_ind))) + scale_color_manual(values = c('1' = 'red', '0' = NA))



      Essentially, what we’ve done here, is used the color aesthetic in combination with scale_color_manual() to manipulate the border color of the countries. Specifically, we have just highlighted OPEC countries with the color red.

      Now, let’s combine the two techniques: we will fill in the color of the countries using the fill aesthetic, and we will highlight the OPEC countries by mapping a variable to the color aesthetic.

      #------------------------------------------- # PLOT # - red highlight for OPEC # - fill value corresponds to oil production #------------------------------------------- ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(color = as.factor(opec_ind), fill = oil_bbl_per_day)) + scale_color_manual(values = c('1' = 'red', '0' = NA))



      There’s more that we’ll need to do to create the finalized version, but all things considered, this is pretty good. It essentially shows the information we want to display … it just needs some formatting.

      So, now lets create the final, formatted map.

      #===================== # FINAL, FORMATTED MAP #===================== ggplot(map.oil, aes( x = long, y = lat, group = group )) + geom_polygon(aes(fill = oil_bbl_per_day, color = as.factor(opec_ind))) + scale_fill_gradientn(colours = c('#461863','#404E88','#2A8A8C','#7FD157','#F9E53F') ,values = scales::rescale(c(100,96581,822675,3190373,10000000)) ,labels = comma ,breaks = c(100,96581,822675,3190373,10000000) ) + guides(fill = guide_legend(reverse = T)) + labs(fill = 'Barrels per day\n2016' ,color = 'OPEC Countries' ,title = 'OPEC countries produce roughly 44% of world oil' ,x = NULL ,y = NULL) + theme(text = element_text(family = 'Gill Sans', color = '#EEEEEE') ,plot.title = element_text(size = 28) ,plot.subtitle = element_text(size = 14) ,axis.ticks = element_blank() ,axis.text = element_blank() ,panel.grid = element_blank() ,panel.background = element_rect(fill = '#333333') ,plot.background = element_rect(fill = '#333333') ,legend.position = c(.18,.36) ,legend.background = element_blank() ,legend.key = element_blank() ) + annotate(geom = 'text' ,label = 'Source: U.S. Energy Information Administration\nhttps://en.wikipedia.org/wiki/List_of_countries_by_oil_production\nhttps://en.wikipedia.org/wiki/OPEC' ,x = 18, y = -55 ,size = 3 ,family = 'Gill Sans' ,color = '#CCCCCC' ,hjust = 'left' ) + scale_color_manual(values = c('1' = 'orange', '0' = NA), labels = c('1' = 'OPEC'), breaks = c('1'))



      Let’s point out a few things.

      First, the fill color scale has been carefully crafted to optimally show differences between countries.

      Second, we are simultaneously using the highlighting technique to highlight the OPEC countries.

      Finally, notice that we’re using the title to “tell a story” about the highlighted data.

      All told, there is a lot going on in this example.

      To create something like this, you need to understand basic ggplot2 syntax, dplyr, visual design, data storytelling, and more.

      Although it could take you a very long time to learn how to do this, if you know how to learn and how to practice data science, you could learn to do this within a few months (or faster).

      Sign up now, and discover how to rapidly master data science

      It’s possible to learn and master data science tools faster than you thought possible.

      Even though the example in this blog post is complicated, it is very easy to learn to create visualizations like this, if you know what tools to learn, how to practice those tools, and how to put those tools together.

      Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible. We teach data science, but we also teach you how to learn.

      Do you want to know more?

      Sign up now for our email list, and you’ll receive regular tutorials and lessons.

      You’ll learn:

      • How to do data visualization in R
      • How to practice data science
      • Learning hacks to accelerate your progress
      • … and more

      If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

      SIGN UP NOW

       

      The post How to highlight countries on a map appeared first on SHARP SIGHT LABS.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Surprising stories hide in seemingly mundane data

      Wed, 12/20/2017 - 16:37

      (This article was first published on R – thinkr, and kindly contributed to R-bloggers)

      Geospatial experimentation

      Recently I experimented with geospatial mapping techniques in R.  I looked at both static and interactive maps. Embedding the media into a WordPress blog would be simple enough with a static map. The latter would require (for me) a new technique to retain the interactivity inside a blog post.

      My web-site visitor log, combined with longitude and latitude data from MaxMind’s GeoLite2, offered a basis for analysis. Although less precise than the GeoIP2 database, this would be more than adequate for my purpose of getting to country and city level.  I settled on the Leaflet package for visualisation given the interactivity and pleasing choice of aesthetics.

      The results however were a little puzzling.


      Whiling away the hours in Kansas

      The concentration of page views in central London was of no immediate surprise as this was likely to be my site building, testing, and blogging. What did strike me as odd was the high concentration of page views in the centre of the US. More curious still, when I zoomed in on Kansas and found myself in the middle of the Cheney Reservoir.

      Non-interactive image of the Cheney Reservoir in Kansas, US

      I imagined someone drifting in the expanse of water with laptop, flask of coffee and box of sandwiches, whiling away the hours absorbed in my blog.  Perhaps not. How could such a small number of blog pages generate in excess of 2,000 page views in less than two months?

      Then I chanced upon a BBC news article from August 2016. When unable to locate IPs, MaxMind chose the geographical centre of the US as a default. This initially turned out to be a rented house in Kansas, which was rather unfortunate for the occupants, and brought upon them all kinds of unwanted attention.

      MaxMind subsequently changed its default centre points to be the middle of bodies of water. And this solved another puzzle. Some of the page views in London appeared to be in the middle of the River Thames.

      R tools used   Packages Functions purrr map_df readr read_csv rgeolocate maxmind rgdal readOGR dplyr inner_join; mutate; arrange; if_else stringr str_c leaflet colorFactor; addProviderTiles; setView; addPolygons; addCircleMarkers; addLegend htmlwidgets saveWidget

      The code may be viewed here.

      WordPress integration
      1. Install the WordPress plugin iframe.
      2. Upload the htmlwidget (created in R) to the WordPress media library.
      3. Embed the following shortcode in the WordPress post (ensuring it’s wrapped in square brackets, and replacing xxx with the path of the uploaded media file): iframe src=”xxx” width=”100%” height=”370″.
      Citations / Attributions

      Includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.

      Map tiles by Stamen Design, CC BY 3.0 — Map data © OpenStreetMap

      OpenStreetMap © CartoDB

      World borders dataset provided by thematic mapping.org.

      The post Surprising stories hide in seemingly mundane data appeared first on thinkr.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – thinkr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Sub-Gaussian property for the Beta distribution (part 2)

      Wed, 12/20/2017 - 14:39

       

      Left: What makes the Beta optimal proxy variance (red) so special? Right: The difference function has a double zero (black dot).

      As a follow-up on my previous post on the sub-Gaussian property for the Beta distribution [1], I’ll give here a visual illustration of the proof.

      A random variable with finite mean is sub-Gaussian if there is a positive number such that:

      We focus on X being a Beta random variable. Its moment generating function is known as the Kummer function, or confluent hypergeometric function . So is -sub-Gaussian as soon as the difference function

      remains positive on . This difference function is plotted on the right panel above for parameters . In the plot, is varying from green for the variance (which is a lower bound to the optimal proxy variance) to blue for the value , a simple upper bound given by Elder (2016), [2]. The idea of the proof is simple: the optimal proxy-variance corresponds to the value of  for which admits a double zero, as illustrated with the red curve (black dot). The left panel shows the curves with varying, interpolating from green for to blue for , with only one curve qualifying as the optimal proxy variance in red.

      References

      [1] Marchal and Arbel (2017), On the sub-Gaussianity of the Beta and Dirichlet distributions. Electronic Communications in Probability, 22:1–14, 2017. Code on GitHub.
      [2] Elder (2016), Bayesian Adaptive Data Analysis Guarantees from Subgaussianity, https://arxiv.org/abs/1611.00065

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      Pages