Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 24 min ago

Configuring Azure and RStudio for text analysis

Mon, 07/02/2018 - 00:58

(This article was first published on R – conjugateprior, and kindly contributed to R-bloggers)

I just finished teaching Computer-Assisted Content Analysis at the IQMR summer school at Syracuse. With three lecture and three labs the problem every year is getting the right R packages onto people’s machines. In particular, anything that involves compilation – and when you’re using quanteda, readtext, and stm, that’s lots of things – is going to be trouble. Over the years, R and the various operating systems it has to live in have got a lot better about this but ultimately the best solution is… not to do it at all. That is, to run everything on somebody else’s computers, excuse me, ‘in the cloud’. When students access an appropriately provisioned RStudio Server through their browsers they’re good to go from Lab one.

This post is about how to set all that up.

For the last few years, I’ve been using Amazon’s AWS infrastructure – big shout to James Lo, who helped me set it up the first time around. It was a bit involved, but it definitely worked. Meanwhile Microsoft was polishing its competitor, Azure, so this year, after confirming I didn’t have to actually run Windows anywhere, I decided I’d give that a try instead.

tl;dr It’s pretty good. You should try it too.

It took about a weekend to get all the details right, probably because, while I’m pretty familiar with Unix, I’m no devops expert. Also, some text-related R packages need quite specific stuff on the server before they’ll work. Hopefully these instructions will save you some fraction of that time. They borrow shamelessly from Colin Gillespie’s excellent blog post from last year, but are pitched at less knowledgeable folk, i.e. previous me.

We’ll start by configuring an Azure virtual machine, set up R and RStudio, install some useful text analysis packages, and finally add users.

Configuring an Azure VM

First you have to sign up for Azure. This is a bit tedious, but your Skype / MS Live login credentials should work.

Say “yes I’d like to do the 30 day trial of Azure”. Eventually you’ll be presented with a ‘portal’ web page with a Dashboard. There’s an icon on the left side called Virtual Machines. Press it and the ‘Create virtual machine’ button.

I’m going to choose Ubuntu 17.10 (Artful Aardvark) from the Ubuntu Server collection, because… who knows what all these distributions are? The last time I seriously ran Linux it was all Redhat 9 (apart from that fateful Gentoo episode that we don’t talk about). Anyway, press ‘Create’ in the panel that appears on the right.

Let’s start with a ‘Basics’ tab. I’ll call it rstudio, set my username to conjugateprior, switch the authentication to ‘Password’ (don’t @ me, I’m trying to minimize the steps).

Now to invent an enormous and fiendishly hard-to-guess password that’s so witty that it’s a tragedy know one can ever know it, and add it to your password manager and the password box, in that order. You do have a password manager don’t you?

We’ll need a ‘Resource group’ though I’m not entirely sure why. In the screen shot I’m using one I made earlier. You can provide some arbitrary name and leave the button on ‘Create new’.

Next we’ll pick a machine size. We can pick a smallish one at first because it’s easy to supersize it. Right now, I’ll just use something about the size of my desktop.

Now to the settings. Most of the defaults seem OK, but we should open some ports: SSH so we can log in and HTTP for regular web access. A bit later we’ll also open 8787 because that’s where RStudio Server listens.

Apparently we’ll need a ‘Diagnostics storage account’. Just give it a name. I’ve no idea what one of these is, but now we’ve got one. That should be enough to get going.

The ‘Summary’ tells us that the VM itself will be about 20c per hour, although there will be storage costs on top of that. Still, it’s a free trial with $200 of stuff thrown in, so we’ll probably survive.

Here comes the machine. It takes about 5 minutes to get going.

When it arrives there’s an ‘Overview’ screen with graphs and suchlike representing the state of the virtual machine. So let’s talk to it. Over in the top right is the IP address we need (where it says ‘Public IP address’).

Now pull up a terminal window. That is: open Terminal if you’re on a Mac, and do the ‘CMD’ thing to launch the weird black box if you’re on Windows. Now to log in. If the IP address is 23.56.19.11 (it isn’t) then that would be

ssh conjugateprior@23.56.19.11

 

and we’re in. Time to update the system:

sudo apt-get update sudo apt-get upgrade

That was the system. Now for R.

Set up R

This page describes how to get R onto an Ubuntu system, but for our purposes the instructions are in slightly the wrong order. The first thing we need to read is the part about ‘Secure APT’. I need the public key of the person who signs the Ubuntu R distribution before we can get it. This key-grabbing incantation should work first time:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

Now we tell APT where to get R from by opening up the apt/sources.list file. Let’s do it in nano because that’s built in and I hate vi (keep not @-ing me, folks).

sudo nano /etc/apt/sources.list

At the bottom of this file we add the line

deb https://cloud.r-project.org/bin/linux/ubuntu artful/

to get R 3.4, then save and exit, (control-O control-X).

Back at the command line, it’s time to update again

sudo apt-get update

and install base R

sudo apt-get install r-base

Normally this is the point that we install R’s development tools, but it seems that on this version of Ubuntu we don’t have to. Nice. I do want some other system stuff though because some of our text packages will depend on them. The comments tell you what they do

sudo apt-get install libxml2 libxml2-dev # XML sudo apt-get install libcairo2-dev # graphics device sudo apt-get install libssl-dev libcurl4-openssl-dev # web stuff sudo apt-get install libapparmor-dev # needed by sys package apparently sudo apt-get install libpoppler-cpp-dev # text conversion, needed by readtext Set up RStudio

Right, now back to the Portal to open up a port for RStudio. On the left side there is an icon for ‘Network settings’, click on that and the ‘Add inbound port rule’

Now our inbound ports should be 80 (HTTP), 22 (SSH), and 8787 (RStudio’s port) plus three Azure thingies that we’ll leave well alone.

Back to the terminal. I want to install an RStudio Server. The instructions for that are here but here’s what they currently amount to:

sudo apt-get install gdebi-core wget https://download2.rstudio.org/rstudio-server-1.1.453-amd64.deb sudo gdebi rstudio-server-1.1.453-amd64.deb

Don’t forget to type ‘y’ when asked (N is the default).

RStudio should install and then start, so give we’ll give it a moment before opening up a browser at the IP address in the portal on port 8787. That would be

http://23.56.19.11:8787

Is there an RStudio login page there? There is.

Now, since nobody can remember IP addresses, we better give this thing a name. Once more into the Portal dear friends.

In the ‘Overview’, over on the right there is a ‘DNS name’ link called ‘configure’, press it and provide a name.

Now the RStudio login address is:

http://iqmr.eastus.cloudapp.azure.com:8787

which is still a bit of a mouthful but at least won’t change whenever we resize the machine. In my limited experimentation it seems like this address won’t always work for ssh though, so it’s as well to have the Portal open showing the current IP if you, like me, have the digit span of a chipmunk.

The RStudio server should now be running. If you want to stop it from the command line

sudo rstudio-server stop

and replace stop with start to get it going again.

Now for the hard part: installing things system-wide and adding users.

Installing R packages

Back on the command line:

sudo R

to get into R, then at the R prompt type

.libPaths()

There should be a system path first on the list, probably /usr/local/lib/R/site-packages. If that’s not there, add it as a lib argument to all lines below. But it should be there.
Now to install a bunch of useful packages. We could do them all at once, but it’s better to do them one by one to check they all actually go through

install.packages("quanteda") # all things text install.packages("stm") # topic models install.packages("rvest") # scraping the web install.packages("dplyr") # needs no introduction install.packages("devtools") # to install stuff from github, 'remotes' might be lighter install.packages("rmarkdown") # create vignettes and suchlike install.packages("xtable") # needed so quanteda's overridden View function works install.packages("caTools") # because RStudio seems to needs this to compile Rmd files

That should keep us going for the show. Come on it’s time to go add users.

Add Users

On Ubuntu one can add users in bulk from a file with lines in the right format. If you have a quick peek at /etc/passwd we can see what it looks like (there’s a x where we’ll
have a password in the file). So we can easily construct such a file from a list of names with lines of the kind:

boo:boo_and_her_secretpassword:::boo,,,:/home/boo:/bin/bash

Now, in theory, the newusers command can just be given this file. In practice it tends to segfault for longish files, for reasons known only to the Ubuntu devs. I just worked around that by having a python script repeatedly write a one line files and then call newusers in a system call.. This is annoying, but if you’ve ever had to configure a laptop soundcard you barely feel it. If you’re not expecting a lot of users you can add them one by one, e.g. by typing

sudo adduser boo

and filling in her details. Either way, each user will be able to log in to the RStudio server as long as as the virtual machine is running. They will also be able to install packages into their local R directories, but maybe encourage them to ask you for packages because it’s better to install them systemwide.

OK. We now have RStudio in the cloud, a bunch of users, and an R setup suitable for doing quite a lot of fun things with text. Maybe log in to RStudio as boo to check it all works. When you get tired, or you decide it’s Bedtime for Bank Balance, just roll over to the Portal and press the big stop button.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – conjugateprior. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

nanotime 0.2.1

Mon, 07/02/2018 - 00:45

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new minor version of the nanotime package for working with nanosecond timestamps just arrived on CRAN.

nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic. Initially implemented using the S3 system, it now uses a more rigorous S4-based approach thanks to a rewrite by Leonardo Silvestri.

This release brings three different enhancements / fixes that robustify usage. No new features were added.

Changes in version 0.2.1 (2018-07-01)
  • Added attribute-preserving comparison (Leonardo in #33).

  • Added two integer64 casts in constructors (Dirk in #36).

  • Added two checks for empty arguments (Dirk in #37).

We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Factfulness: Building Gapminder Income Mountains

Mon, 07/02/2018 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract:

We work out the math behind the so called income mountain plots used in the book "Factfulness" by Hans Rosling and use these insight to generate such plots using tidyverse code. The trip includes a mixture of log-normals, the density transformation theorem, histogram vs. density and then skipping all those details again to make nice moving mountain plots.



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

Reading the book Factfulness by Hans Rosling seemed like a good thing to do during the summer months. The ‘possibilistic‘ writing style is contagious and his TedEx presentations and media interviews are legendary teaching material on how to support your arguments with data. What a shame he passed away in 2017.

What is really enjoyable about the book is that the Gapminder web page allows you to study many of the graphs from the book interactively and contains the data for download. Being a fan of transparency and reproducibility, I got interested in the so called income mountain plots, which show how incomes are distributed within individuals of a population:




Screenshot of the 2010 income mountain plot. Free material from www.gapminder.org.

One notices that the "mountains" are plotted on a log-base-2 x-axis and without a y-axis annotation. Why? Furthermore, world income data usually involve mean income per country, so I got curious how/if these plots were made without access to finer granularity level data? Aim of this blog post is to answer these questions by using Gapminder data freely available from their webpage. The answer ended up as a nice tidyverse exercise and could serve as motivating application for basic probability course content.

Data Munging Gapminder

Data on income, population and Gini coefficient were needed to analyse the above formulated questions. I have done this previously in order to visualize the Olympic Medal Table Gapminder Style. We start by downloading the GDP data, which is the annual gross domestic product per capita by Purchasing Power Parities (PPP) measured in international dollars, fixed 2011 prices. Hence, the inflation over the years and differences in the cost of living between countries is accounted for and can thus be compared – see the Gapminder documentation for further details. We download the data from Gapminder where they are available in wide format as Excel-file. For tidyverse handling we reshape them into long format.

##Download gdp data from gapminder - available under a CC BY-4 license. if (!file.exists(file.path(fullFigPath, "gapminder-gdp.xlsx"))) { download.file("https://github.com/Gapminder-Indicators/gdppc_cppp/raw/master/gdppc_cppp-by-gapminder.xlsx", destfile=file.path(fullFigPath,"gapminder-gdp.xlsx")) } gdp_long <- readxl::read_xlsx(file.path(fullFigPath, "gapminder-gdp.xlsx"), sheet=2) %>% rename(country=`geo.name`) %>% select(-geo,-indicator,-indicator.name) %>% gather(key="year", value="gdp", -country,) %>% filter(!is.na(gdp))

Furthermore, we rescale GDP per year to daily income, because this is the unit used in the book.

gdp_long %<>% mutate(gdp = gdp / 365.25)

Similar code segments are written for (see the code on github for details)

  • the gini (gini_long) and population (pop_long) data
  • the regional group (=continent) each country belongs two (group)

The four data sources are then joined into one long tibble gm. For each year we also compute the fraction a country’s population makes up of the world population that year (column w) as well as the fraction within the year and region the population makes up (column w_region) :

## # A tibble: 15,552 x 9 ## country region code year gini gdp population w w_region ## ## 1 Afghanistan Asia AFG 1800 0.305 1.65 3280000 0.00347 0.00518 ## 2 Albania Europe ALB 1800 0.389 1.83 410445 0.000434 0.00192 ## 3 Algeria Africa DZA 1800 0.562 1.96 2503218 0.00264 0.0342 ## 4 Andorra Europe AND 1800 0.4 3.28 2654 0.00000280 0.0000124 ## 5 Angola Africa AGO 1800 0.477 1.69 1567028 0.00166 0.0214 ## # ... with 1.555e+04 more rows Income Mountain Plots

The construction of the income mountain plots is thoroughly described on the Gapminder webpage, but without mathematical detail. With respect to the math it says: "Bas van Leeuwen shared his formulas with us and explained how to the math from ginis and mean income, to accumulated distribution shapes on a logarithmic scale." Unfortunately, the formulas are not shared with the reader. It’s not black magic though: The income distribution of a country is assumed to be log-normal with a given mean \(\mu\) and standard deviation \(\sigma\) on the log-scale, i.e. \(X \sim \operatorname{LogN}(\mu,\sigma^2)\). From knowing the mean income \(\overline{x}\) of the distribution as well as the Gini index \(G\) of the distribution, one can show that it’s possible to directly infer \((\mu, \sigma)\) of the log-normal distribution.

Because the Gini index of the log-normal distribution is given by \[
G = 2\Phi\left(\frac{\sigma}{\sqrt{2}}\right)-1,
\] where \(\Phi\) denotes the CDF of the standard normal distribution, and by knowing that the expectation of the log-normal is \(E(X) = \exp(\mu + \frac{1}{2}\sigma^2)\), it is possible to determine \((\mu,\sigma)\) as:

\[
\sigma = \frac{2}{\sqrt{2}} \Phi^{-1}\left(\frac{G+1}{2}\right)
\quad\text{and}\quad
\mu = \log(\overline{x}) – \frac{1}{2} \sigma^2.
\]

We can use this to determine the parameters of the log-normal for every country in each year.

Mixture distribution

The income distribution of a set of countries is now given as a Mixture distribution of log-normals, i.e. one component for each of the countries in the set with a weight proportional to the population of the country. As an example, the world income distribution would be a mixture of the 192 countries in the Gapminder dataset, i.e.

\[
f_{\text{mix}}(x) = \sum_{i=1}^{192} w_i \>\cdot
\>f_{\operatorname{LogN}}(x; \mu_i, \sigma_i^2), \quad\text{where}
\quad w_i = \frac{\text{population}_i}{\sum_{j=1}^{192} \text{population}_j},
\] and \(f_{\operatorname{LogN}}(x; \mu_i, \sigma_i^2)\) is the density of the log-normal distribution with country specific parameters. Note that we could have equally used the mixture approach to define the income of, e.g., a continent region. With the above definition we define standard R-functions for computing the PDF (dmix), CDF (pmix), quantile function (qmix) and a function for sampling from the distribution (rmix) – see the github code for details.

We use the mixture approach to compute the density of the world income distribution obtained by "mixing" all 192 log-normal distributions. This is shown below for the World income distribution of the year 2015. Note the \(\log_2\) x-axis. This presentation is Factfulness‘ preferred way of illustrating the skew income distribution.

##Restrict to year 2015 gm_recent <- gm %>% filter(year == 2015) %>% ungroup ##Make a data frame containing the densities of each region for ##the gm_recent dataset df_pdf <- data.frame(log2x=seq(-2,9,by=0.05)) %>% mutate(x=2^log2x) pdf_region <- gm_recent %>% group_by(region) %>% do({ pdf <- dmix(df_pdf$x, meanlog=.$meanlog, sdlog=.$sdlog, w=.$w_region) data.frame(x=df_pdf$x, pdf=pdf, w=sum(.$w), population=sum(.$population), w_pdf = pdf*sum(.$w)) }) ## Total is the sum over all regions - note the summation is done on ## the original income scale and NOT the log_2 scale. However, one can show that in the special case the result on the log-base-2-scale is the same as summing the individual log-base-2 transformed densities (see hidden CHECKMIXTUREPROPERTIES chunk). pdf_total <- pdf_region %>% group_by(x) %>% summarise(region="Total",w=sum(w), pdf = sum(w_pdf)) ## Expectation of the distribution mean_mix <- gm_recent %>% summarise(mean=sum(w * exp(meanlog + 1/2*sdlog^2))) %$% mean ## Median of the distribution median_mix <- qmix(0.5, gm_recent$meanlog, gm_recent$sdlog, gm_recent$w) ## Mode of the distribution on the log2-scale (not transformation invariant!) mode_mix <- pdf_total %>% mutate(pdf_log2x = log(2) * x * pdf) %>% filter(pdf_log2x == max(pdf_log2x)) %$% x

For illustration we compute a mixture distribution for each region using all countries within region. This is shown in the left pane. Note: because a log-base-2-transformation is used for the x-axis, we need to perform a change of variables, i.e. we compute the density for \(Y=\log_2(X)=g(X)\) where \(X\sim f_{\text{mix}}\), i.e. \[
f_Y(y) = \left| \frac{d}{dy}(g^{-1}(y)) \right| f_X(g^{-1}(y)) = \log(2) \cdot 2^y \cdot f_{\text{mix}}( 2^y) = \log(2) \cdot x \cdot f_{\text{mix}}(x), \text{ where } x=2^y.
\]

In the right pane we then show the region specific densities each weighted by their population fraction. These are then summed up to yield the world income shown as a thick blue line. The median of the resulting world income distribution is at 20.0 $/day, whereas the mean of the mixture is at an income of 39.9$/day and the mode (on the log-base-2 scale) is 17.1$/day. Note that the later is not transformation invariant, i.e. the value is not the mode of the income distribution, but of \(\log_2(X)\).

To get the income mountain plots as shown in Factfulness, we additionally need to obtain number of people on the \(y\)-axis and not density. We do this by partitioning the x-axis into non-overlapping intervals and then compute the number of individuals expected to fall into a given interval with limits \([l, u]\). Under our model this expectation is

\[n \cdot (F_{\text{mix}}(u)-F_{\text{mix}}(l)),\]

where \(F_{\text{mix}}\) is the CDF of the mixture distribution and \(n\) is the total world population. The mountain plot below shows this for a given partition with \(n=7,305,116,647\). Note that \(2.5\cdot 10^8\) corresponds to 250 mio people. Also note the \(\log_2\) x-axis, and hence (on the linear scale) unequally wide intervals of the partitioning. Contrary to Factfulness‘, I prefer to make this more explicit by indicating the intervals explicitly on the x-axis of the mountain plot, because it is about number of people in certain income brackets.

##Function to prepare the data.frame to be used in a mountain plot make_mountain_df <- function(gm_df, log2x=seq(-2,9,by=0.25)) { ##Make a data.frame containing the intervals with appropriate annotation df <- data.frame(log2x=log2x) %>% mutate(x=2^log2x) %>% mutate(xm1 = lag(x), log2xm1=lag(log2x)) %>% mutate(xm1=if_else(is.na(xm1),0,xm1), log2xm1=if_else(is.na(log2xm1),0,log2xm1), mid_log2 = (log2x+log2xm1)/2, width = (x-xm1), width_log2 = (log2x-log2xm1)) %>% ##Format the interval character representation mutate(interval=if_else(xm1<2, sprintf("[%6.1f-%6.1f]",xm1,x), sprintf("[%4.0f-%4.0f]",xm1,x)), interval_log2x=sprintf("[2^(%4.1f)-2^(%4.1f)]",log2xm1,log2x)) ##Compute expected number of individuals in each bin. people <- gm_df %>% group_by(region) %>% do({ countries <- . temp <- df %>% slice(-1) %>% rowwise %>% mutate( prob_mass = diff(pmix(c(xm1,x), meanlog=countries$meanlog, sdlog=countries$sdlog, w=countries$w_region)), people = prob_mass * sum(countries$population) ) temp %>% mutate(year = max(gm_df$year)) }) ##Done return(people) } ##Create mountain plot data set for gm_recent with default spacing. (people <- make_mountain_df(gm_recent)) ## # A tibble: 176 x 13 ## # Groups: region [4] ## region log2x x xm1 log2xm1 mid_log2 width width_log2 interval interval_log2x prob_mass people year ## ## 1 Africa -1.75 0.297 0.25 -2 -1.88 0.0473 0.25 [ 0.2- 0.3] [2^(-2.0)-2^(-1.8)] 0.00134 1586808. 2015 ## 2 Africa -1.5 0.354 0.297 -1.75 -1.62 0.0563 0.25 [ 0.3- 0.4] [2^(-1.8)-2^(-1.5)] 0.00205 2432998. 2015 ## 3 Africa -1.25 0.420 0.354 -1.5 -1.38 0.0669 0.25 [ 0.4- 0.4] [2^(-1.5)-2^(-1.2)] 0.00307 3639365. 2015 ## 4 Africa -1 0.5 0.420 -1.25 -1.12 0.0796 0.25 [ 0.4- 0.5] [2^(-1.2)-2^(-1.0)] 0.00448 5305674. 2015 ## 5 Africa -0.75 0.595 0.5 -1 -0.875 0.0946 0.25 [ 0.5- 0.6] [2^(-1.0)-2^(-0.8)] 0.00636 7537067. 2015 ## # ... with 171 more rows

This can then be plotted with ggplot2:

In light of all the talk about gaps, it can also be healthy to plot the income distribution on the linear scale. From this it becomes obvious that linearly there indeed are larger absolute differences in income, but -as argued in the book- the exp-scale (base 2) incorporates peoples perception about the worth of additional income.

Because the intervals are not equally wide, only the height of the bars should be interpreted in this plot. However, the eye perceives area, which in this case is misguiding. Showing histograms with unequal bin widths is a constant dilemma between area, height, density and perception. The recommendation would be that if one wants to use the linear-scale, then one should use equal width linear intervals or directly plot the density. As a consequence, plots like the above are not recommended, but they make obvious the tail behaviour of the income distribution – a feature which is somewhat hidden by the log-base-2-scale plots.

Of course none of the above plots looks as nice as the Gapminder plots, but they have proper x and y-axes annotation and, IMHO, are clearer to interpret, because they do not mix the concept of density with the concept of individuals falling into income bins. As the bin-width converges to zero, one gets the density multiplied by \(n\), but this complication of infinitesimal width bins is impossible to communicate. In the end this was the talent of Hans Rosling and Gapminder – to make the complicated easy and intuitive! We honor this by skipping the math1 and celebrate the result as the art it is!

##Make mountain plot with smaller intervals than in previous plot. ggplot_oneyear_mountain <- function(people, ymax=NA) { ##Make the ggplot p <- ggplot(people %>% rename(Region=region), aes(x=mid_log2,y=people, fill=Region)) + geom_col(width=min(people$width_log2)) + ylab("Number of individuals") + xlab("Income [$/day]") + scale_x_continuous(minor_breaks = NULL, trans="identity", breaks = trans_breaks("identity", function(x) x,n=11), labels = trans_format(trans="identity", format=function(x) ifelse(x<0, sprintf("%.1f",2^x), sprintf("%.0f",2^x)))) + theme(axis.text.y=element_blank(), axis.ticks.y=element_blank()) + scale_y_continuous(minor_breaks = NULL, breaks = NULL, limits=c(0,ymax)) + ggtitle(paste0("World Income Mountain ",max(people$year))) + NULL #Show it and return it. print(p) invisible(p) } ##Create the mountain plot for 2015 gm_recent %>% make_mountain_df(log2x=seq(-2,9,by=0.01)) %>% ggplot_oneyear_mountain()

Discussion

Our replicated mountain plots do not exactly match those made by Gapminder (c.f. the screenshot). It appears as if our distributions are located slightly more to the right. It is not entirely clear why there is a deviation, but one possible problem could be that we do the translation into income per day differently? I’m not an econometrician, so this could be a trivial blunder on my side, however, the values in this post are roughly of the same magnitude as the graph on p. 46 in van Zanden et al. (2011) mentioned in the Gapminder documentation page, whereas the Gapminder curves appear too far to the left. It might be worthwhile to check individual country results underlying the graphs to see where the difference is.

We end the post by animating the dynamics of the income mountains since 1950 using gganimate. To put it in possibilistic terms: Let the world move forward! It is not as bad as it seems. Facts matter.



Literature

van Zanden, J. L., J. Baten, P. Foldvari, and B. van Leeuwen. 2011. “The Changing Shape of Global Inequality – exploring a new dataset.” Working Papers 0001. Utrecht University, Centre for Global Economic History. https://ideas.repec.org/p/ucg/wpaper/0001.html.

  1. Shown is the expected number of individuals in thin bins of size 0.01 on the log-base-2-scale. As done in Factfulness we also skip the interval annotation on the x-axis and, as a consequence, do without y-axis tick marks, which would require one to explain the interval widths.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Exploring The Quick, Draw! Dataset With R: The Mona Lisa

Sun, 07/01/2018 - 20:36

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

All that noise, and all that sound, all those places I have found (Speed of Sound, Coldplay)

Some days ago, my friend Jorge showed me one of the coolest datasets I’ve ever seen: the Google quick draw dataset. In its Github website you can see a detailed description of the data. Briefly, it contains  around 50 million of drawings of people around the world in .ndjson format. In this experiment, I used the simplified version of drawings where strokes are simplified and resampled with a 1 pixel spacing. Drawings are also aligned to top-left corner and scaled to have a maximum value of 255. All these things make data easier to manage and to represent into a plot.

Since .ndjson files may be very large, I used LaF package to access randon lines of the file rather than reading it completely. I wrote a script to explore The Mona Lisa.ndjson file, which contains more than 120.000 drawings that the TensorFlow engine from Google recognized as being The Mona Lisa. It is quite funny to see them. Whit this script you can:

  • Reproduce a random single drawing
  • Create a 9×9 mosaic of random drawings
  • Create an animation simulating the way the drawing was created

I use ggplot2 package to render drawings and gganimate package of David Robinson to create animations.

This is an example of a single drawing:

This is an example of a 3×3 mosaic:

This is an example of animation:

If you want to try by yourself, you can find the code here.

Note: to work with gganimate, I downloaded the portable version and pointed to it with Sys.setenv command as explained here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time Series Deep Learning, Part 2: Predicting Sunspot Frequency with Keras LSTM In R

Sun, 07/01/2018 - 09:45

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

One of the ways Deep Learning can be used in business is to improve the accuracy of time series forecasts (prediction). We recently showed how a Long Short Term Memory (LSTM) Models developed with the Keras library in R could be used to take advantage of autocorrelation to predict the next 10 years of monthly Sunspots (a solar phenomenon that’s tracked by NASA). In this article, we teamed up with RStudio to take another look at the Sunspots data set, this time implementing some really advanced Deep Learning functionality available with TensorFlow for R. Sigrid Keydana, TF Developer Advocate at RStudio, put together an amazing Deep Learning tutorial using keras for implementing Keras in R and tfruns, a suite of tools for trackingtracking, visualizing, and managing TensorFlow training runs and experiments from R. Sounds amazing, right? It is! Let’s get started with this Deep Learning Tutorial!

Related Articles In This Series Learning Trajectory

In this DEEP LEARNING TUTORIAL_, you will learn:

In fact, one of the coolest things you’ll develop is this plot of backtested LSTM forecasts.

Backtested LSTM Forecasts

Time Series Deep Learning In Business

Introduction by Matt Dancho, Founder of Business Science

Time Series Forecasting is a key area that can lead to Return On Investment (ROI) in a business. Think about this: A 10% improvement in forecast accuracy can save an organization millions of dollars. How is this possible? Let’s find out.

We’ll take NVIDIA, a semiconductor manufacturer that manufactures state-of-the-art chips for Artificial Intelligence (AI) and Deep Learning (DL), as an example. NVIDIA builds Graphics Processing Unitis or GPUs, which are necessary for the computational intensitity resulting from the massive number of numerical calculations required in high-performance Deep Learning. The chips look like this.

Source: NVIDIA USA

Like all manufacturers, NVIDIA needs to forecast demand for their products. Why? So they can build the right amount of chips to supply their customers. This forecast is critical and it takes a lot of skill and some luck to get this right.

What we are talking about is the Sales Forecast, which drives every manufacturing decision that NVIDIA makes. This includes how much raw material to purchase, how many people to have on staff to build the chips, and how much machining and assembly operations to budget. The more error in the sales forecast, the more unnecessary cost NVIDIA experiences because all of these activities (supply chain, inventory management, financial planning, etc) get thrown off!

Time Series Deep Learning For Business

Time Series Deep Learning is amazingly accurate on data that has a high presence of autocorrelation because algorithms including LSTMs and GRUs can learn information from sequences regardless of when the pattern occurred. These special RNNs are designed to have a long memory meaning that they excel at learning patterns from both recent observations and observeration that occurred long ago. This makes them perfect for time series! But are they great for sales data? Maybe. Let’s discuss.

Sales data comes in all flavors, but often it has seasonal patterns and trend. The trend can be flat, linear, exponential, etc. This is typically not where the LSTM will have success, but other traditional forecasting methods can detect trend. However, seasonality is different. Seasonality in sales data is a pattern that can happen at multiple frequencies (annual, quarterly, monthly, weekly, and even daily). The LSTM is great at detecting seasonality because it often has autocorrelation. Therefore, LSTM’s and GRU’s can be great options to help improve seasonality detection, and therefore reduce overall forecast error in the Sales Forecast.

About The Authors

This DEEP LEARNING TUTORIAL was a combined effort: Sigrid Keydana and Matt Dancho.

Sigrid is the TensorFlow Developer Advocate at RStudio, where she develops amazing deep learning using the R TensorFlow API. She’s passionate about deep learning, machine learning and statistics, R, Linux and functional programming. In a short period of time, I’ve developed a tremendous respect for Sigrid. You’ll see why when you walk through the R TensorFlow code in this tutorial

You know me (Matt). I’m the Founder of Business Science where I strive for one mission: To empower Data scientists interested in Business and Finance.

Deep Learning for Time Series Forecasting: Predicting Sunspot Frequency with Keras

By Sigrid Keydana, TensorFlow Developer Advocate at RStudio,
And Matt Dancho, Founder of Business Science

Forecasting sunspots with deep learning

In this post we will examine making time series predictions using the sunspots dataset that ships with base R. Sunspots are dark spots on the sun, associated with lower temperature. Here’s an image from NASA showing the solar phenomenon.

Source: NASA

We’re using the monthly version of the dataset, sunspots.month (there is a yearly version, too).
It contains 265 years worth of data (from 1749 through 2013) on the number of sunspots per month.

Forecasting this dataset is challenging because of high short term variability as well as long-term irregularities evident in the cycles. For example, maximum amplitudes reached by the low frequency cycle differ a lot, as does the number of high frequency cycle steps needed to reach that maximum low frequency cycle height.

Our post will focus on two dominant aspects: how to apply deep learning to time series forecasting, and how to properly apply cross validation in this domain.
For the latter, we will use the rsample package that allows to do resampling on time series data.
As to the former, our goal is not to reach utmost performance but to show the general course of action when using recurrent neural networks to model this kind of data.

Recurrent neural networks

When our data has a sequential structure, it is recurrent neural networks (RNNs) we use to model it.

As of today, among RNNs, the best established architectures are the GRU (Gated Recurrent Unit) and the LSTM (Long Short Term Memory). For today, let’s not zoom in on what makes them special, but on what they have in common with the most stripped-down RNN: the basic recurrence structure.

In contrast to the prototype of a neural network, often called Multilayer Perceptron (MLP), the RNN has a state that is carried on over time. This is nicely seen in this diagram from Goodfellow et al., a.k.a. the “bible of deep learning”:

At each time, the state is a combination of the current input and the previous hidden state. This is reminiscent of autoregressive models, but with neural networks, there has to be some point where we halt the dependence.

That’s because in order to determine the weights, we keep calculating how our loss changes as the input changes.
Now if the input we have to consider, at an arbitrary timestep, ranges back indefinitely – then we will not be able to calculate all those gradients.
In practice, then, our hidden state will, at every iteration, be carried forward through a fixed number of steps.

We’ll come back to that as soon as we’ve loaded and pre-processed the data.

Setup, pre-processing, and exploration Libraries

Here, first, are the libraries needed for this tutorial.

# Core Tidyverse library(tidyverse) library(glue) library(forcats) # Time Series library(timetk) library(tidyquant) library(tibbletime) # Visualization library(cowplot) # Preprocessing library(recipes) # Sampling / Accuracy library(rsample) library(yardstick) # Modeling library(keras) library(tfruns)

If you have not previously run Keras in R, you will need to install Keras using the install_keras() function.

# Install Keras if you have not installed before install_keras() Data

sunspot.month is a ts class (not tidy), so we’ll convert to a tidy data set using the tk_tbl() function from timetk. We use this instead of as.tibble() from tibble to automatically preserve the time series index as a zoo yearmon index. Last, we’ll convert the zoo index to date using lubridate::as_date() (loaded with tidyquant) and then change to a tbl_time object to make time series operations easier.

sun_spots <- datasets::sunspot.month %>% tk_tbl() %>% mutate(index = as_date(index)) %>% as_tbl_time(index = index) sun_spots # A time tibble: 3,177 x 2 # Index: index index value 1 1749-01-01 58 2 1749-02-01 62.6 3 1749-03-01 70 4 1749-04-01 55.7 5 1749-05-01 85 6 1749-06-01 83.5 7 1749-07-01 94.8 8 1749-08-01 66.3 9 1749-09-01 75.9 10 1749-10-01 75.5 # ... with 3,167 more rows Exploratory data analysis

The time series is long (265 years!). We can visualize the time series both in full, and zoomed in on the first 10 years to get a feel for the series.

Visualizing sunspot data with cowplot

We’ll make two ggplots and combine them using cowplot::plot_grid(). Note that for the zoomed in plot, we make use of tibbletime::time_filter(), which is an easy way to perform time-based filtering.

p1 <- sun_spots %>% ggplot(aes(index, value)) + geom_point(color = palette_light()[[1]], alpha = 0.5) + theme_tq() + labs( title = "From 1749 to 2013 (Full Data Set)" ) p2 <- sun_spots %>% filter_time("start" ~ "1800") %>% ggplot(aes(index, value)) + geom_line(color = palette_light()[[1]], alpha = 0.5) + geom_point(color = palette_light()[[1]]) + geom_smooth(method = "loess", span = 0.2, se = FALSE) + theme_tq() + labs( title = "1749 to 1759 (Zoomed In To Show Changes over the Year)", caption = "datasets::sunspot.month" ) p_title <- ggdraw() + draw_label("Sunspots", size = 18, fontface = "bold", colour = palette_light()[[1]]) plot_grid(p_title, p1, p2, ncol = 1, rel_heights = c(0.1, 1, 1))

Backtesting: time series cross validation

When doing cross validation on sequential data, the time dependencies on preceding samples must be preserved. We can create a cross validation sampling plan by offsetting the window used to select sequential sub-samples. In essence, we’re creatively dealing with the fact that there’s no future test data available by creating multiple synthetic “futures” – a process often, esp. in finance, called “backtesting”.

As mentioned in the introduction, the rsample package includes facitlities for backtesting on time series. The vignette, “Time Series Analysis Example”, describes a procedure that uses the rolling_origin() function to create samples designed for time series cross validation. We’ll use this approach.

Developing a backtesting strategy

The sampling plan we create uses 50 years (initial = 12 x 50 samples) for the training set and ten years (assess = 12 x 10) for the testing (validation) set. We select a skip span of about twenty years (skip = 12 x 20 – 1) to approximately evenly distribute the samples into 6 sets that span the entire 265 years of sunspots history. Last, we select cumulative = FALSE to allow the origin to shift which ensures that models on more recent data are not given an unfair advantage (more observations) over those operating on less recent data. The tibble return contains the rolling_origin_resamples.

periods_train <- 12 * 100 periods_test <- 12 * 50 skip_span <- 12 * 22 - 1 rolling_origin_resamples <- rolling_origin( sun_spots, initial = periods_train, assess = periods_test, cumulative = FALSE, skip = skip_span ) rolling_origin_resamples # Rolling origin forecast resampling # A tibble: 6 x 2 splits id 1 Slice1 2 Slice2 3 Slice3 4 Slice4 5 Slice5 6 Slice6 Visualizing the backtesting strategy

We can visualize the resamples with two custom functions. The first, plot_split(), plots one of the resampling splits using ggplot2. Note that an expand_y_axis argument is added to expand the date range to the full sun_spots dataset date range. This will become useful when we visualize all plots together.

# Plotting function for a single split plot_split <- function(split, expand_y_axis = TRUE, alpha = 1, size = 1, base_size = 14) { # Manipulate data train_tbl <- training(split) %>% add_column(key = "training") test_tbl <- testing(split) %>% add_column(key = "testing") data_manipulated <- bind_rows(train_tbl, test_tbl) %>% as_tbl_time(index = index) %>% mutate(key = fct_relevel(key, "training", "testing")) # Collect attributes train_time_summary <- train_tbl %>% tk_index() %>% tk_get_timeseries_summary() test_time_summary <- test_tbl %>% tk_index() %>% tk_get_timeseries_summary() # Visualize g <- data_manipulated %>% ggplot(aes(x = index, y = value, color = key)) + geom_line(size = size, alpha = alpha) + theme_tq(base_size = base_size) + scale_color_tq() + labs( title = glue("Split: {split$id}"), subtitle = glue("{train_time_summary$start} to {test_time_summary$end}"), y = "", x = "" ) + theme(legend.position = "none") if (expand_y_axis) { sun_spots_time_summary <- sun_spots %>% tk_index() %>% tk_get_timeseries_summary() g <- g + scale_x_date(limits = c(sun_spots_time_summary$start, sun_spots_time_summary$end)) } return(g) }

The plot_split() function takes one split (in this case Slice01), and returns a visual of the sampling strategy. We expand the axis to the range for the full dataset using expand_y_axis = TRUE.

rolling_origin_resamples$splits[[1]] %>% plot_split(expand_y_axis = TRUE) + theme(legend.position = "bottom")

The second function, plot_sampling_plan(), scales the plot_split() function to all of the samples using purrr and cowplot.

# Plotting function that scales to all splits plot_sampling_plan <- function(sampling_tbl, expand_y_axis = TRUE, ncol = 3, alpha = 1, size = 1, base_size = 14, title = "Sampling Plan") { # Map plot_split() to sampling_tbl sampling_tbl_with_plots <- sampling_tbl %>% mutate(gg_plots = map(splits, plot_split, expand_y_axis = expand_y_axis, alpha = alpha, base_size = base_size)) # Make plots with cowplot plot_list <- sampling_tbl_with_plots$gg_plots p_temp <- plot_list[[1]] + theme(legend.position = "bottom") legend <- get_legend(p_temp) p_body <- plot_grid(plotlist = plot_list, ncol = ncol) p_title <- ggdraw() + draw_label(title, size = 14, fontface = "bold", colour = palette_light()[[1]]) g <- plot_grid(p_title, p_body, legend, ncol = 1, rel_heights = c(0.05, 1, 0.05)) return(g) }

We can now visualize the entire backtesting strategy with plot_sampling_plan(). We can see how the sampling plan shifts the sampling window with each progressive slice of the train/test splits.

rolling_origin_resamples %>% plot_sampling_plan(expand_y_axis = T, ncol = 3, alpha = 1, size = 1, base_size = 10, title = "Backtesting Strategy: Rolling Origin Sampling Plan")

And, we can set expand_y_axis = FALSE to zoom in on the samples.

rolling_origin_resamples %>% plot_sampling_plan(expand_y_axis = F, ncol = 3, alpha = 1, size = 1, base_size = 10, title = "Backtesting Strategy: Zoomed In")

We’ll use this backtesting strategy (6 samples from one time series each with 50/10 split in years and a ~20 year offset) when testing the veracity of the LSTM model on the sunspots dataset.

The LSTM model

To begin, we’ll develop an LSTM model on a single sample from the backtesting strategy, namely, the most recent slice. We’ll then apply the model to all samples to investigate modeling performance.

example_split <- rolling_origin_resamples$splits[[6]] example_split_id <- rolling_origin_resamples$id[[6]]

We can reuse the plot_split() function to visualize the split. Set expand_y_axis = FALSE to zoom in on the subsample.

plot_split(example_split, expand_y_axis = FALSE, size = 0.5) + theme(legend.position = "bottom") + ggtitle(glue("Split: {example_split_id}"))

Data setup

To aid hyperparameter tuning, besides the training set we also need a validation set.
For example, we will use a callback, callback_early_stopping, that stops training when no significant performance is seen on the validation set (what’s considered significant is up to you).

We will dedicate 2 thirds of the analysis set to training, and 1 third to validation.

df_trn <- analysis(example_split)[1:800, , drop = FALSE] df_val <- analysis(example_split)[801:1200, , drop = FALSE] df_tst <- assessment(example_split)

First, let’s combine the training and testing data sets into a single data set with a column key that specifies where they came from (either “training” or “testing)”. Note that the tbl_time object will need to have the index respecified during the bind_rows() step, but this issue should be corrected in dplyr soon.

df <- bind_rows( df_trn %>% add_column(key = "training"), df_val %>% add_column(key = "validation"), df_tst %>% add_column(key = "testing") ) %>% as_tbl_time(index = index) df # A time tibble: 1,800 x 3 # Index: index index value key 1 1849-06-01 81.1 training 2 1849-07-01 78 training 3 1849-08-01 67.7 training 4 1849-09-01 93.7 training 5 1849-10-01 71.5 training 6 1849-11-01 99 training 7 1849-12-01 97 training 8 1850-01-01 78 training 9 1850-02-01 89.4 training 10 1850-03-01 82.6 training # ... with 1,790 more rows Preprocessing with recipes

The LSTM algorithm will usually work better if the input data has been centered and scaled. We can conveniently accomplish this using the recipes package. In addition to step_center and step_scale, we’re using step_sqrt to reduce variance and remov outliers. The actual transformations are executed when we bake the data according to the recipe:

rec_obj <- recipe(value ~ ., df) %>% step_sqrt(value) %>% step_center(value) %>% step_scale(value) %>% prep() df_processed_tbl <- bake(rec_obj, df) df_processed_tbl # A tibble: 1,800 x 3 index value key 1 1849-06-01 0.714 training 2 1849-07-01 0.660 training 3 1849-08-01 0.473 training 4 1849-09-01 0.922 training 5 1849-10-01 0.544 training 6 1849-11-01 1.01 training 7 1849-12-01 0.974 training 8 1850-01-01 0.660 training 9 1850-02-01 0.852 training 10 1850-03-01 0.739 training # ... with 1,790 more rows

Next, let’s capture the original center and scale so we can invert the steps after modeling. The square root step can then simply be undone by squaring the back-transformed data.

center_history <- rec_obj$steps[[2]]$means["value"] scale_history <- rec_obj$steps[[3]]$sds["value"] c("center" = center_history, "scale" = scale_history) center.value scale.value 6.694468 3.238935 Reshaping the data

Keras LSTM expects the input as well as the target data to be in a specific shape.
The input has to be a 3-d array of size num_samples, num_timesteps, num_features.

Here, num_samples is the number of observations in the set. This will get fed to the model in portions of batch_size. The second dimension, num_timesteps, is the length of the hidden state we were talking about above. Finally, the third dimension is the number of predictors we’re using. For univariate time series, this is 1.

How long should we choose the hidden state to be? This generally depends on the dataset and our goal.
If we did one-step-ahead forecasts – thus, forecasting the following month only – our main concern would be choosing a state length that allows to learn any patterns present in the data.

Now say we wanted to forecast 12 months instead, as does SILSO, the World Data Center for the production, preservation and dissemination of the international sunspot number.
The way we can do this, with Keras, is by wiring the LSTM hidden states to sets of consecutive outputs of the same length. Thus, if we want to produce predictions for 12 months, our LSTM should have a hidden state length of 12.

These 12 time steps will then get wired to 12 linear predictor units using a time_distributed() wrapper.
That wrapper’s task is to apply the same calculation (i.e., the same weight matrix) to every state input it receives.

Now, what’s the target array’s format supposed to be? As we’re forecasting several timesteps here, the target data again needs to be 3-dimensional. Dimension 1 again is the batch dimension, dimension 2 again corresponds to the number of timesteps (the forecasted ones), and dimension 3 is the size of the wrapped layer.
In our case, the wrapped layer is a layer_dense() of a single unit, as we want exactly one prediction per point in time.

So, let’s reshape the data. The main action here is creating the sliding windows of 12 steps of input, followed by 12 steps of output each. This is easiest to understand with a shorter and simpler example. Say our input were the numbers from 1 to 10, and our chosen sequence length (state size) were 4. Tthis is how we would want our training input to look:

1,2,3,4 2,3,4,5 3,4,5,6

And our target data, correspondingly:

5,6,7,8 6,7,8,9 7,8,9,10

We’ll define a short function that does this reshaping on a given dataset.
Then finally, we add the third axis that is formally needed (even though that axis is of size 1 in our case).

# these variables are being defined just because of the order in which # we present things in this post (first the data, then the model) # they will be superseded by FLAGS$n_timesteps, FLAGS$batch_size and n_predictions # in the following snippet n_timesteps <- 12 n_predictions <- n_timesteps batch_size <- 10 # functions used build_matrix <- function(tseries, overall_timesteps) { t(sapply(1:(length(tseries) - overall_timesteps + 1), function(x) tseries[x:(x + overall_timesteps - 1)])) } reshape_X_3d <- function(X) { dim(X) <- c(dim(X)[1], dim(X)[2], 1) X } # extract values from data frame train_vals <- df_processed_tbl %>% filter(key == "training") %>% select(value) %>% pull() valid_vals <- df_processed_tbl %>% filter(key == "validation") %>% select(value) %>% pull() test_vals <- df_processed_tbl %>% filter(key == "testing") %>% select(value) %>% pull() # build the windowed matrices train_matrix <- build_matrix(train_vals, n_timesteps + n_predictions) valid_matrix <- build_matrix(valid_vals, n_timesteps + n_predictions) test_matrix <- build_matrix(test_vals, n_timesteps + n_predictions) # separate matrices into training and testing parts # also, discard last batch if there are fewer than batch_size samples # (a purely technical requirement) X_train <- train_matrix[, 1:n_timesteps] y_train <- train_matrix[, (n_timesteps + 1):(n_timesteps * 2)] X_train <- X_train[1:(nrow(X_train) %/% batch_size * batch_size), ] y_train <- y_train[1:(nrow(y_train) %/% batch_size * batch_size), ] X_valid <- valid_matrix[, 1:n_timesteps] y_valid <- valid_matrix[, (n_timesteps + 1):(n_timesteps * 2)] X_valid <- X_valid[1:(nrow(X_valid) %/% batch_size * batch_size), ] y_valid <- y_valid[1:(nrow(y_valid) %/% batch_size * batch_size), ] X_test <- test_matrix[, 1:n_timesteps] y_test <- test_matrix[, (n_timesteps + 1):(n_timesteps * 2)] X_test <- X_test[1:(nrow(X_test) %/% batch_size * batch_size), ] y_test <- y_test[1:(nrow(y_test) %/% batch_size * batch_size), ] # add on the required third axis X_train <- reshape_X_3d(X_train) X_valid <- reshape_X_3d(X_valid) X_test <- reshape_X_3d(X_test) y_train <- reshape_X_3d(y_train) y_valid <- reshape_X_3d(y_valid) y_test <- reshape_X_3d(y_test) Building the LSTM model

Now that we have our data in the required form, let’s finally build the model.
As always in deep learning, an important, and often time-consuming, part of the job is tuning hyperparameters. To keep this post self-contained, and considering this is primarily a tutorial on how to use LSTM in R, let’s assume the following settings were found after extensive experimentation (in reality experimentation did take place, but not to a degree that performance couldn’t possibly be improved).

Instead of hard coding the hyperparameters, we’ll use tfruns to set up an environment where we could easily perform grid search.

We’ll quickly comment on what these parameters do but mainly leave those topics to further posts.

FLAGS <- flags( # There is a so-called "stateful LSTM" in Keras. While LSTM is stateful per se, # this adds a further tweak where the hidden states get initialized with values # from the item at same position in the previous batch. # This is helpful just under specific circumstances, or if you want to create an # "infinite stream" of states, in which case you'd use 1 as the batch size. # Below, we show how the code would have to be changed to use this, but it won't be further # discussed here. flag_boolean("stateful", FALSE), # Should we use several layers of LSTM? # Again, just included for completeness, it did not yield any superior performance on this task. # This will actually stack exactly one additional layer of LSTM units. flag_boolean("stack_layers", FALSE), # number of samples fed to the model in one go flag_integer("batch_size", 10), # size of the hidden state, equals size of predictions flag_integer("n_timesteps", 12), # how many epochs to train for flag_integer("n_epochs", 100), # fraction of the units to drop for the linear transformation of the inputs flag_numeric("dropout", 0.2), # fraction of the units to drop for the linear transformation of the recurrent state flag_numeric("recurrent_dropout", 0.2), # loss function. Found to work better for this specific case than mean squared error flag_string("loss", "logcosh"), # optimizer = stochastic gradient descent. Seemed to work better than adam or rmsprop here # (as indicated by limited testing) flag_string("optimizer_type", "sgd"), # size of the LSTM layer flag_integer("n_units", 128), # learning rate flag_numeric("lr", 0.003), # momentum, an additional parameter to the SGD optimizer flag_numeric("momentum", 0.9), # parameter to the early stopping callback flag_integer("patience", 10) ) # the number of predictions we'll make equals the length of the hidden state n_predictions <- FLAGS$n_timesteps # how many features = predictors we have n_features <- 1 # just in case we wanted to try different optimizers, we could add here optimizer <- switch(FLAGS$optimizer_type, sgd = optimizer_sgd(lr = FLAGS$lr, momentum = FLAGS$momentum)) # callbacks to be passed to the fit() function # We just use one here: we may stop before n_epochs if the loss on the validation set # does not decrease (by a configurable amount, over a configurable time) callbacks <- list( callback_early_stopping(patience = FLAGS$patience) )

After all these preparations, the code for constructing and training the model is rather short!
Let’s first quickly view the “long version”, that would allow you to test stacking several LSTMs or use a stateful LSTM, then go through the final short version (that does neither) and comment on it.

This, just for reference, is the complete code.

model <- keras_model_sequential() model %>% layer_lstm( units = FLAGS$n_units, batch_input_shape = c(FLAGS$batch_size, FLAGS$n_timesteps, n_features), dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, return_sequences = TRUE, stateful = FLAGS$stateful ) if (FLAGS$stack_layers) { model %>% layer_lstm( units = FLAGS$n_units, dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, return_sequences = TRUE, stateful = FLAGS$stateful ) } model %>% time_distributed(layer_dense(units = 1)) model %>% compile( loss = FLAGS$loss, optimizer = optimizer, metrics = list("mean_squared_error") ) if (!FLAGS$stateful) { model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), batch_size = FLAGS$batch_size, epochs = FLAGS$n_epochs, callbacks = callbacks ) } else { for (i in 1:FLAGS$n_epochs) { model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), callbacks = callbacks, batch_size = FLAGS$batch_size, epochs = 1, shuffle = FALSE ) model %>% reset_states() } } if (FLAGS$stateful) model %>% reset_states()

Now let’s step through the simpler, yet better (or equally) performing configuration below.

# create the model model <- keras_model_sequential() # add layers # we have just two, the LSTM and the time_distributed model %>% layer_lstm( units = FLAGS$n_units, # the first layer in a model needs to know the shape of the input data batch_input_shape = c(FLAGS$batch_size, FLAGS$n_timesteps, n_features), dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, # by default, an LSTM just returns the final state return_sequences = TRUE ) %>% time_distributed(layer_dense(units = 1)) model %>% compile( loss = FLAGS$loss, optimizer = optimizer, # in addition to the loss, Keras will inform us about current MSE while training metrics = list("mean_squared_error") ) history <- model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), batch_size = FLAGS$batch_size, epochs = FLAGS$n_epochs, callbacks = callbacks )

As we see, training was stopped after ~55 epochs as validation loss did not decrease any more.
We also see that performance on the validation set is way worse than performance on the training set – normally indicating overfitting.

This topic too, we’ll leave to a separate discussion another time, but interestingly regularization using higher values of dropout and recurrent_dropout (combined with increasing model capacity) did not yield better generalization performance. This is probably related to the characteristics of this specific time series we mentioned in the introduction.

plot(history, metrics = "loss")

Now let’s see how well the model was able to capture the characteristics of the training set.

pred_train <- model %>% predict(X_train, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values to original scale pred_train <- (pred_train * scale_history + center_history) ^2 compare_train <- df %>% filter(key == "training") # build a dataframe that has both actual and predicted values for (i in 1:nrow(pred_train)) { varname <- paste0("pred_train", i) compare_train <- mutate(compare_train,!!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_train[i,], rep(NA, nrow(compare_train) - FLAGS$n_timesteps * 2 - i + 1) )) }

We compute the average RSME over all sequences of predictions.

coln <- colnames(compare_train)[4:ncol(compare_train)] cols <- map(coln, quo(sym(.))) rsme_train <- map_dbl(cols, function(col) rmse( compare_train, truth = value, estimate = !!col, na.rm = TRUE )) %>% mean() rsme_train 21.01495

How do these predictions really look? As a visualization of all predicted sequences would look pretty crowded, we arbitrarily pick start points at regular intervals.

ggplot(compare_train, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_train1), color = "cyan") + geom_line(aes(y = pred_train50), color = "red") + geom_line(aes(y = pred_train100), color = "green") + geom_line(aes(y = pred_train150), color = "violet") + geom_line(aes(y = pred_train200), color = "cyan") + geom_line(aes(y = pred_train250), color = "red") + geom_line(aes(y = pred_train300), color = "red") + geom_line(aes(y = pred_train350), color = "green") + geom_line(aes(y = pred_train400), color = "cyan") + geom_line(aes(y = pred_train450), color = "red") + geom_line(aes(y = pred_train500), color = "green") + geom_line(aes(y = pred_train550), color = "violet") + geom_line(aes(y = pred_train600), color = "cyan") + geom_line(aes(y = pred_train650), color = "red") + geom_line(aes(y = pred_train700), color = "red") + geom_line(aes(y = pred_train750), color = "green") + ggtitle("Predictions on the training set")

This looks pretty good. From the validation loss, we don’t quite expect the same from the test set, though.

Let’s see.

pred_test <- model %>% predict(X_test, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values to original scale pred_test <- (pred_test * scale_history + center_history) ^2 pred_test[1:10, 1:5] %>% print() compare_test <- df %>% filter(key == "testing") # build a dataframe that has both actual and predicted values for (i in 1:nrow(pred_test)) { varname <- paste0("pred_test", i) compare_test <- mutate(compare_test,!!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_test[i,], rep(NA, nrow(compare_test) - FLAGS$n_timesteps * 2 - i + 1) )) } compare_test %>% write_csv(str_replace(model_path, ".hdf5", ".test.csv")) compare_test[FLAGS$n_timesteps:(FLAGS$n_timesteps + 10), c(2, 4:8)] %>% print() coln <- colnames(compare_test)[4:ncol(compare_test)] cols <- map(coln, quo(sym(.))) rsme_test <- map_dbl(cols, function(col) rmse( compare_test, truth = value, estimate = !!col, na.rm = TRUE )) %>% mean() rsme_test 31.31616 ggplot(compare_test, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_test1), color = "cyan") + geom_line(aes(y = pred_test50), color = "red") + geom_line(aes(y = pred_test100), color = "green") + geom_line(aes(y = pred_test150), color = "violet") + geom_line(aes(y = pred_test200), color = "cyan") + geom_line(aes(y = pred_test250), color = "red") + geom_line(aes(y = pred_test300), color = "green") + geom_line(aes(y = pred_test350), color = "cyan") + geom_line(aes(y = pred_test400), color = "red") + geom_line(aes(y = pred_test450), color = "green") + geom_line(aes(y = pred_test500), color = "cyan") + geom_line(aes(y = pred_test550), color = "violet") + ggtitle("Predictions on test set")

That’s not as good as on the training set, but not bad either, given this time series is quite challenging.

Having defined and run our model on a manually chosen example split, let’s now revert to our overall re-sampling frame.

Backtesting the model on all splits

To obtain predictions on all splits, we move the above code into a function and apply it to all splits.
First, here’s the function. It returns a list of two dataframes, one for the training and test sets each, that contain the model’s predictions together with the actual values.

obtain_predictions <- function(split) { df_trn <- analysis(split)[1:800, , drop = FALSE] df_val <- analysis(split)[801:1200, , drop = FALSE] df_tst <- assessment(split) df <- bind_rows( df_trn %>% add_column(key = "training"), df_val %>% add_column(key = "validation"), df_tst %>% add_column(key = "testing") ) %>% as_tbl_time(index = index) rec_obj <- recipe(value ~ ., df) %>% step_sqrt(value) %>% step_center(value) %>% step_scale(value) %>% prep() df_processed_tbl <- bake(rec_obj, df) center_history <- rec_obj$steps[[2]]$means["value"] scale_history <- rec_obj$steps[[3]]$sds["value"] FLAGS <- flags( flag_boolean("stateful", FALSE), flag_boolean("stack_layers", FALSE), flag_integer("batch_size", 10), flag_integer("n_timesteps", 12), flag_integer("n_epochs", 100), flag_numeric("dropout", 0.2), flag_numeric("recurrent_dropout", 0.2), flag_string("loss", "logcosh"), flag_string("optimizer_type", "sgd"), flag_integer("n_units", 128), flag_numeric("lr", 0.003), flag_numeric("momentum", 0.9), flag_integer("patience", 10) ) n_predictions <- FLAGS$n_timesteps n_features <- 1 optimizer <- switch(FLAGS$optimizer_type, sgd = optimizer_sgd(lr = FLAGS$lr, momentum = FLAGS$momentum)) callbacks <- list( callback_early_stopping(patience = FLAGS$patience) ) train_vals <- df_processed_tbl %>% filter(key == "training") %>% select(value) %>% pull() valid_vals <- df_processed_tbl %>% filter(key == "validation") %>% select(value) %>% pull() test_vals <- df_processed_tbl %>% filter(key == "testing") %>% select(value) %>% pull() train_matrix <- build_matrix(train_vals, FLAGS$n_timesteps + n_predictions) valid_matrix <- build_matrix(valid_vals, FLAGS$n_timesteps + n_predictions) test_matrix <- build_matrix(test_vals, FLAGS$n_timesteps + n_predictions) X_train <- train_matrix[, 1:FLAGS$n_timesteps] y_train <- train_matrix[, (FLAGS$n_timesteps + 1):(FLAGS$n_timesteps * 2)] X_train <- X_train[1:(nrow(X_train) %/% FLAGS$batch_size * FLAGS$batch_size),] y_train <- y_train[1:(nrow(y_train) %/% FLAGS$batch_size * FLAGS$batch_size),] X_valid <- valid_matrix[, 1:FLAGS$n_timesteps] y_valid <- valid_matrix[, (FLAGS$n_timesteps + 1):(FLAGS$n_timesteps * 2)] X_valid <- X_valid[1:(nrow(X_valid) %/% FLAGS$batch_size * FLAGS$batch_size),] y_valid <- y_valid[1:(nrow(y_valid) %/% FLAGS$batch_size * FLAGS$batch_size),] X_test <- test_matrix[, 1:FLAGS$n_timesteps] y_test <- test_matrix[, (FLAGS$n_timesteps + 1):(FLAGS$n_timesteps * 2)] X_test <- X_test[1:(nrow(X_test) %/% FLAGS$batch_size * FLAGS$batch_size),] y_test <- y_test[1:(nrow(y_test) %/% FLAGS$batch_size * FLAGS$batch_size),] X_train <- reshape_X_3d(X_train) X_valid <- reshape_X_3d(X_valid) X_test <- reshape_X_3d(X_test) y_train <- reshape_X_3d(y_train) y_valid <- reshape_X_3d(y_valid) y_test <- reshape_X_3d(y_test) model <- keras_model_sequential() model %>% layer_lstm( units = FLAGS$n_units, batch_input_shape = c(FLAGS$batch_size, FLAGS$n_timesteps, n_features), dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, return_sequences = TRUE ) %>% time_distributed(layer_dense(units = 1)) model %>% compile( loss = FLAGS$loss, optimizer = optimizer, metrics = list("mean_squared_error") ) model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), batch_size = FLAGS$batch_size, epochs = FLAGS$n_epochs, callbacks = callbacks ) pred_train <- model %>% predict(X_train, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values pred_train <- (pred_train * scale_history + center_history) ^ 2 compare_train <- df %>% filter(key == "training") for (i in 1:nrow(pred_train)) { varname <- paste0("pred_train", i) compare_train <- mutate(compare_train, !!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_train[i, ], rep(NA, nrow(compare_train) - FLAGS$n_timesteps * 2 - i + 1) )) } pred_test <- model %>% predict(X_test, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values pred_test <- (pred_test * scale_history + center_history) ^ 2 compare_test <- df %>% filter(key == "testing") for (i in 1:nrow(pred_test)) { varname <- paste0("pred_test", i) compare_test <- mutate(compare_test, !!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_test[i, ], rep(NA, nrow(compare_test) - FLAGS$n_timesteps * 2 - i + 1) )) } list(train = compare_train, test = compare_test) }

Mapping the function over all splits yields a list of predictions.

all_split_preds <- rolling_origin_resamples %>% mutate(predict = map(splits, obtain_predictions))

Calculate RMSE on all splits:

calc_rmse <- function(df) { coln <- colnames(df)[4:ncol(df)] cols <- map(coln, quo(sym(.))) map_dbl(cols, function(col) rmse( df, truth = value, estimate = !!col, na.rm = TRUE )) %>% mean() } all_split_preds <- all_split_preds %>% unnest(predict) all_split_preds_train <- all_split_preds[seq(1, 11, by = 2), ] all_split_preds_test <- all_split_preds[seq(2, 12, by = 2), ] all_split_rmses_train <- all_split_preds_train %>% mutate(rmse = map_dbl(predict, calc_rmse)) %>% select(id, rmse) all_split_rmses_test <- all_split_preds_test %>% mutate(rmse = map_dbl(predict, calc_rmse)) %>% select(id, rmse)

How does it look? Here’s RMSE on the training set for the 6 splits.

all_split_rmses_train # A tibble: 6 x 2 id rmse 1 Slice1 22.2 2 Slice2 20.9 3 Slice3 18.8 4 Slice4 23.5 5 Slice5 22.1 6 Slice6 21.1 all_split_rmses_test # A tibble: 6 x 2 id rmse 1 Slice1 21.6 2 Slice2 20.6 3 Slice3 21.3 4 Slice4 31.4 5 Slice5 35.2 6 Slice6 31.4

Looking at these numbers, we see something interesting: Generalization performance is much better for the first three slices of the time series than for the latter ones. This confirms our impression, stated above, that there seems to be some hidden development going on, rendering forecasting more difficult.

And here are visualizations of the predictions on the respective training and test sets.

First, the training sets:

plot_train <- function(slice, name) { ggplot(slice, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_train1), color = "cyan") + geom_line(aes(y = pred_train50), color = "red") + geom_line(aes(y = pred_train100), color = "green") + geom_line(aes(y = pred_train150), color = "violet") + geom_line(aes(y = pred_train200), color = "cyan") + geom_line(aes(y = pred_train250), color = "red") + geom_line(aes(y = pred_train300), color = "red") + geom_line(aes(y = pred_train350), color = "green") + geom_line(aes(y = pred_train400), color = "cyan") + geom_line(aes(y = pred_train450), color = "red") + geom_line(aes(y = pred_train500), color = "green") + geom_line(aes(y = pred_train550), color = "violet") + geom_line(aes(y = pred_train600), color = "cyan") + geom_line(aes(y = pred_train650), color = "red") + geom_line(aes(y = pred_train700), color = "red") + geom_line(aes(y = pred_train750), color = "green") + ggtitle(name) } train_plots <- map2(all_split_preds_train$predict, all_split_preds_train$id, plot_train) p_body_train <- plot_grid(plotlist = train_plots, ncol = 3) p_title_train <- ggdraw() + draw_label("Backtested Predictions: Training Sets", size = 18, fontface = "bold") plot_grid(p_title_train, p_body_train, ncol = 1, rel_heights = c(0.05, 1, 0.05))

And the test sets:

plot_test <- function(slice, name) { ggplot(slice, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_test1), color = "cyan") + geom_line(aes(y = pred_test50), color = "red") + geom_line(aes(y = pred_test100), color = "green") + geom_line(aes(y = pred_test150), color = "violet") + geom_line(aes(y = pred_test200), color = "cyan") + geom_line(aes(y = pred_test250), color = "red") + geom_line(aes(y = pred_test300), color = "green") + geom_line(aes(y = pred_test350), color = "cyan") + geom_line(aes(y = pred_test400), color = "red") + geom_line(aes(y = pred_test450), color = "green") + geom_line(aes(y = pred_test500), color = "cyan") + geom_line(aes(y = pred_test550), color = "violet") + ggtitle(name) } test_plots <- map2(all_split_preds_test$predict, all_split_preds_test$id, plot_test) p_body_test <- plot_grid(plotlist = test_plots, ncol = 3) p_title_test <- ggdraw() + draw_label("Backtested Predictions: Test Sets", size = 18, fontface = "bold") plot_grid(p_title_test, p_body_test, ncol = 1, rel_heights = c(0.05, 1, 0.05))

This has been a long post, and necessarily will have left a lot of questions open, first and foremost: How do we obtain good settings for the hyperparameters (learning rate, number of epochs, dropout)?
How do we choose the length of the hidden state? Or even, can we have an intuition how well LSTM will perform on a given dataset (with its specific characteristics)?
We will tackle questions like the above in upcoming posts.

Learning More

Check out our other articles on Time Series!

Business Science University

Business Science University is a revolutionary new online platform that get’s you results fast.

Why learn from Business Science University? You could spend years trying to learn all of the skills required to confidently apply Data Science For Business (DS4B). Or you can take the first course in our Virtual Workshop, Data Science For Business (DS4B 201). In 10 weeks, you’ll learn:

  • A 100% ROI-driven Methodology – Everything we teach is to maximize ROI.

  • A clear, systematic plan that we’ve successfully used with clients

  • Critical thinking skills necessary to solve problems

  • Advanced technology: _H2O Automated Machine Learning__

  • How to do 95% of the skills you will need to use when wrangling data, investigating data, building high-performance models, explaining the models, evaluating the models, and building tools with the models

You can spend years learning this information or in 10 weeks (one chapter per week pace). Get started today!


Sign Up Now!

DS4B Virtual Workshop: Predicting Employee Attrition

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.


Get Started Today!

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.


Get Started Today!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

Footnotes var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Missing data imputation and instrumental variables regression: the tidy approach

Sun, 07/01/2018 - 02:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


In this blog post I will discuss missing data imputation and instrumental variables regression. This
is based on a short presentation I will give at my job. You can find the data used here on this
website: http://eclr.humanities.manchester.ac.uk/index.php/IV_in_R

The data is used is from Wooldridge’s book, Econometrics: A modern Approach.
You can download the data by clicking here.

This is the variable description:

1. inlf =1 if in labor force, 1975 2. hours hours worked, 1975 3. kidslt6 # kids < 6 years 4. kidsge6 # kids 6-18 5. age woman's age in yrs 6. educ years of schooling 7. wage estimated wage from earns., hours 8. repwage reported wage at interview in 1976 9. hushrs hours worked by husband, 1975 10. husage husband's age 11. huseduc husband's years of schooling 12. huswage husband's hourly wage, 1975 13. faminc family income, 1975 14. mtr fed. marginal tax rate facing woman 15. motheduc mother's years of schooling 16. fatheduc father's years of schooling 17. unem unem. rate in county of resid. 18. city =1 if live in SMSA 19. exper actual labor mkt exper 20. nwifeinc (faminc - wage*hours)/1000 21. lwage log(wage) 22. expersq exper^2

The goal is to first impute missing data in the data set, and then determine the impact of one added
year of education on wages. If one simply ignores missing values, bias can be introduced depending on
the missingness mechanism. The second problem here is that education is likely to be endogeneous
(and thus be correlated to the error term), as it is not randomly assigned. This causes biased estimates
and may lead to seriously wrong conclusions. So missingness and endogeneity should be dealt with, but
dealing with both issues is more of a programming challenge than an econometrics challenge.
Thankfully, the packages contained in the {tidyverse} as well as {mice} will save the day!

If you inspect the data, you will see that there are no missing values. So I will use the {mice}
package to first ampute the data (which means adding missing values). This, of course, is done
for education purposes. If you’re lucky enough to not have missing values in your data, you shouldn’t
add them!

Let’s load all the packages needed:

library(tidyverse) library(AER) library(naniar) library(mice)

So first, let’s read in the data, and ampute it:

wages_data <- read_csv("http://eclr.humanities.manchester.ac.uk/images/5/5f/Mroz.csv") ## Parsed with column specification: ## cols( ## .default = col_integer(), ## wage = col_character(), ## repwage = col_double(), ## huswage = col_double(), ## mtr = col_double(), ## unem = col_double(), ## nwifeinc = col_double(), ## lwage = col_character() ## ) ## See spec(...) for full column specifications.

First, I only select the variables I want to use and convert them to the correct class:

wages_data <- wages_data %>% select(wage, educ, fatheduc, motheduc, inlf, hours, kidslt6, kidsge6, age, huswage, mtr, unem, city, exper) %>% mutate_at(vars(kidslt6, kidsge6, hours, educ, age, wage, huswage, mtr, motheduc, fatheduc, unem, exper), as.numeric) %>% mutate_at(vars(city, inlf), as.character) ## Warning in evalq(as.numeric(wage), ): NAs introduced by ## coercion

In the data, some women are not in the labour force, and thus do not have any wages; meaning they
should have a 0 there. Instead, this is represented with the following symbol: “.”. So I convert
these dots to 0. One could argue that the wages should not be 0, but that they’re truly missing.
This is true, and there are ways to deal with such questions (Heckman’s selection model for instance),
but this is not the point here.

wages_data <- wages_data %>% mutate(wage = ifelse(is.na(wage), 0, wage))

Let’s double check if there are any missing values in the data, using naniar::vis_miss():

vis_miss(wages_data)

Nope! Let’s ampute it:

wages_mis <- ampute(wages_data)$amp ## Warning: Data is made numeric because the calculation of weights requires ## numeric data

ampute() returns an object where the amp element is the amputed data. This is what I save into
the new variable wages_mis.

Let’s take a look:

vis_miss(wages_mis)

Ok, so now we have missing values. Let’s use the recently added mice::parlmice() function to
impute the dataset, in parallel:

imp_wages <- parlmice(data = wages_mis, m = 10, maxit = 20, cl.type = "FORK")

For reproducibility, I save these objects to disk:

write_csv(wages_mis, "wages_mis.csv") saveRDS(imp_wages, "imp_wages.rds")

As a sanity check, let’s look at the missingness pattern for the first completed dataset:

vis_miss(complete(imp_wages))

mice::parlmice() was able to impute the dataset. I imputed it 10 times, so now I have 10 imputed
datasets. If I want to estimate a model using this data, I will need to do so 10 times.
This is where the tidyverse comes into play. First, let’s combine all the 10 imputed datasets into
one long dataset, with an index to differentiate them. This is done easily with mice::complete():

imp_wages_df <- mice::complete(imp_wages, "long")

Let’s take a look at the data:

head(imp_wages_df) ## .imp .id wage educ fatheduc motheduc inlf hours kidslt6 kidsge6 age ## 1 1 1 3.3540 12 7 12 1 1610 1 0 32 ## 2 1 2 1.3889 12 7 7 1 1656 0 2 30 ## 3 1 3 4.5455 12 7 12 1 1980 0 3 35 ## 4 1 4 1.0965 12 7 7 1 456 0 3 34 ## 5 1 5 4.5918 14 14 12 1 1568 1 2 31 ## 6 1 6 4.7421 12 7 14 1 2032 0 0 54 ## huswage mtr unem city exper ## 1 4.0288 0.7215 5.0 0 14 ## 2 8.4416 0.6615 11.0 1 5 ## 3 3.5807 0.6915 5.0 0 15 ## 4 3.5417 0.7815 5.0 0 6 ## 5 10.0000 0.6215 9.5 1 14 ## 6 4.7364 0.6915 7.5 1 33

As you can see, there are two new columns, .id and .imp. .imp equals i means that it is the
ith imputed dataset.

Because I have 0’s in my dependent variable, I will not log the wages but instead use the Inverse
Hyperbolic Sine transformation. Marc F. Bellemare wrote a nice post about
it here.

ihs <- function(x){ log(x + sqrt(x**2 + 1)) }

I can now apply this function, but first I have to group by .imp. Remember, these are 10 separated
datasets. I also create the experience squared:

imp_wages_df <- imp_wages_df %>% group_by(.imp) %>% mutate(ihs_wage = ihs(wage), exper2 = exper**2)

Now comes some tidyverse magic. I will create a new dataset by using the nest() function from tidyr.

(imp_wages <- imp_wages_df %>% group_by(.imp) %>% nest()) ## # A tibble: 10 x 2 ## .imp data ## ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6 ## 7 7 ## 8 8 ## 9 9 ## 10 10

As you can see, imp_wages is now a dataset with two columns: .imp, indexing the imputed datasets,
and a column called data, where each element is itself a tibble! data is a so-called list-column.
You can read more about it on the
purrr tutorial written by
Jenny Bryan.

Estimating a model now is easy, if you’re familiar with purrr. This is how you do it:

imp_wages_reg = imp_wages %>% mutate(lin_reg = map(data, ~lm(ihs_wage ~ educ + inlf + hours + kidslt6 + kidsge6 + age + huswage + mtr + unem + city + exper + exper2, data = .)))

Ok, so what happened here? imp_wages is a data frame, so it’s possible to add a column to it
with mutate(). I call that column lin_reg and use map() on the column called data (remember,
this column is actually a list of data frame objects, and map() takes a list as an argument, and then a
function or formula) with the following formula:

~lm(ihs_wage ~ educ + inlf + hours + kidslt6 + kidsge6 + age + huswage + mtr + unem + city + exper + exper2, data = .)

This formula is nothing more that a good old linear regression. The last line data = . means that
the data to be used inside lm() should be coming from the list called data, which is the second
column of imp_wages. As I’m writing these lines, I realize it is confusing as hell. But I promise
you that learning to use purrr is a bit like learning how to use a bicycle. Very difficult to explain,
but once you know how to do it, it feels super natural. Take some time to play with the lines above
to really understand what happened.

Now, let’s take a look at the result:

imp_wages_reg ## # A tibble: 10 x 3 ## .imp data lin_reg ## ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6 ## 7 7 ## 8 8 ## 9 9 ## 10 10

imp_wages_reg now has a third column called lin_reg where each element is a linear model, estimated
on the data from the data column! We can now pool the results of these 10 regressions using
mice::pool():

pool_lin_reg <- pool(imp_wages_reg$lin_reg) summary(pool_lin_reg) ## estimate std.error statistic df p.value ## (Intercept) 1.2868701172 3.214473e-01 4.0033628 737.9337 6.876133e-05 ## educ 0.0385310276 8.231906e-03 4.6806931 737.9337 3.401935e-06 ## inlf 1.8845418354 5.078235e-02 37.1101707 737.9337 0.000000e+00 ## hours -0.0001164143 3.011378e-05 -3.8658143 737.9337 1.204773e-04 ## kidslt6 -0.0438925013 3.793152e-02 -1.1571510 737.9337 2.475851e-01 ## kidsge6 -0.0117978229 1.405226e-02 -0.8395678 737.9337 4.014227e-01 ## age -0.0030084595 2.666614e-03 -1.1281946 737.9337 2.596044e-01 ## huswage -0.0231736955 5.607364e-03 -4.1327255 737.9337 3.995866e-05 ## mtr -2.2109176781 3.188827e-01 -6.9333267 737.9337 8.982592e-12 ## unem 0.0028775444 5.462973e-03 0.5267360 737.9337 5.985352e-01 ## city 0.0157414671 3.633755e-02 0.4332011 737.9337 6.649953e-01 ## exper 0.0164364027 6.118875e-03 2.6861806 737.9337 7.389936e-03 ## exper2 -0.0002022602 1.916146e-04 -1.0555575 737.9337 2.915159e-01

This function averages the results from the 10 regressions and computes correct standard errors. This
is based on Rubin’s rules (Rubin, 1987, p. 76). As you can see, the linear regression indicates that
one year of added education has a positive, significant effect of log wages (they’re not log wages,
I used the IHS transformation, but log wages just sounds better than inverted hyperbolic sined wages).
This effect is almost 4%.

But education is not randomly assigned, and as such might be endogenous. This is where instrumental
variables come into play. An instrument is a variables that impacts the dependent variable only through
the endogenous variable (here, education). For example, the education of the parents do not have
a direct impact over one’s wage, but having college-educated parents means that you are likely
college-educated yourself, and thus have a higher wage that if you only have a high school diploma.

I am thus going to instrument education with both parents’ education:

imp_wages_reg = imp_wages_reg %>% mutate(iv_reg = map(data, ~ivreg(ihs_wage ~ educ + inlf + hours + kidslt6 + kidsge6 + age + huswage + mtr + unem + city + exper + exper2 |.-educ + fatheduc + motheduc, data = .)))

The only difference from before is the formula:

~ivreg(ihs_wage ~ educ + inlf + hours + kidslt6 + kidsge6 + age + huswage + mtr + unem + city + exper + exper2 |.-educ + fatheduc + motheduc, data = .) ## ~ivreg(ihs_wage ~ educ + inlf + hours + kidslt6 + kidsge6 + age + ## huswage + mtr + unem + city + exper + exper2 | . - educ + ## fatheduc + motheduc, data = .)

Instead of lm() I use AER::ivreg() and the formula has a second part, after the | symbol. This
is where I specify that I instrument education with the parents’ education.

imp_wages_reg now looks like this:

imp_wages_reg ## # A tibble: 10 x 4 ## .imp data lin_reg iv_reg ## ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6 ## 7 7 ## 8 8 ## 9 9 ## 10 10

Let’s take a look at the results:

pool_iv_reg <- pool(imp_wages_reg$iv_reg) summary(pool_iv_reg) ## estimate std.error statistic df p.value ## (Intercept) 2.0091904157 5.146812e-01 3.9037568 737.9337 1.033832e-04 ## educ 0.0038859137 2.086592e-02 0.1862326 737.9337 8.523136e-01 ## inlf 1.9200207113 5.499457e-02 34.9129122 737.9337 0.000000e+00 ## hours -0.0001313866 3.157375e-05 -4.1612608 737.9337 3.537881e-05 ## kidslt6 -0.0234593391 4.000689e-02 -0.5863824 737.9337 5.577979e-01 ## kidsge6 -0.0123239220 1.422241e-02 -0.8665145 737.9337 3.864897e-01 ## age -0.0040874625 2.763340e-03 -1.4791748 737.9337 1.395203e-01 ## huswage -0.0242737100 5.706497e-03 -4.2536970 737.9337 2.373189e-05 ## mtr -2.6385172445 3.998419e-01 -6.5989008 737.9337 7.907430e-11 ## unem 0.0047331976 5.622137e-03 0.8418859 737.9337 4.001246e-01 ## city 0.0255647706 3.716783e-02 0.6878197 737.9337 4.917824e-01 ## exper 0.0180917073 6.258779e-03 2.8906127 737.9337 3.957817e-03 ## exper2 -0.0002291007 1.944599e-04 -1.1781381 737.9337 2.391213e-01

As you can see, education is not statistically significant anymore! This is why it is quite important
to think about endogeneity issues. However, it is not always very easy to find suitable instruments.
A series of tests exist to determine if you have relevant and strong instruments, but this blog post
is already long enough. I will leave this for a future blog post.

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Interacting with AWS from R

Sat, 06/30/2018 - 15:30

(This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers)

Getting set up

If there is one realisation in life, it is the fact that you will never have enough CPU or RAM available for your analytics. Luckily for us, cloud computing is becoming cheaper and cheaper each year. One of the more established providers of cloud services is AWS. If you don’t know yet, they provide a free, yes free, option. Their t2.micro instance is a 1 CPU, 500MB machine, which doesn’t sound like much, but I am running a Rstudio and Docker instance on one of these for a small project.

The management console has the following interface:

So, how cool would it be if you could start up one of these instances from R? Well, with the cloudyr project it makes R a lot better at interacting with cloud based computing infrastructure. With this in mind, I have been playing with the aws.ec2 package which is a simple client package for the Amazon Web Services (‘AWS’) Elastic Cloud Compute EC2API. There is some irritating setup that has to be done, so if you want to use this package, you need to follow the instructions on the github page to create AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION parameters in the ENV. But once you have figured out this step, the fun starts.

I always enjoy getting the development version of a package, so I am going to install the package straight from github:

devtools::install_github("cloudyr/aws.ec2")

Next we are going to use a Amazon Machine Images (AMI) which is a pre-build image that already contains all the necessary installations such as R and RStudio. You can build your own AMI and I suggest you build your own if you comfortable with Linux CLI.

Release the beast library(aws.ec2) # Describe the AMI (from: http://www.louisaslett.com/RStudio_AMI/) aws.signature::locate_credentials() image <- "ami-3b0c205e" describe_images(image)

In the code snippet above you will notice I call a function aws.signature::locate_credentials(). I use this function to confirm my credentials. You will need to populate your own credentials after creating a user profile on IAM management console and have generated an ACCESS_KEY for the use of the API. My preferred method of implementing the credentials, is to add the information to the environment using usethis::edit_r_environ().

Here is my (fake) .Renviron:

AWS_ACCESS_KEY_ID=F8D6E9131F0E0CE508126 AWS_SECRET_ACCESS_KEY=AAK53148eb87db04754+f1f2c8b8cae222a2 AWS_DEFAULT_REGION=us-east-2

Now we are almost ready to test out the package and its functions, but first, I recommend you source a handy function I wrote that helps to tidy the outputs from selected functions from the aws.ec2 package.

source("https://bit.ly/2KnkdzV")

I found the list object returned from functions such as describe_images(), describe_instance() and instance_status() very verbose and difficult to work with. The tidy_describe() function is there to clean up the outputs and only return the most important information. The function also implements a pretty_print option which
will cat the output in a table to the screen for a quick overview of the information contained in the object.

Lets use this function to see the output from the describe_images() as a pretty_print. Print the aws_describe object without this handy function at your own peril.

image <- "ami-3b0c205e" aws_describe <- describe_images(image) aws_describe %>% tidy_describe(.) -------------------------------------- Summary -------------------------------------- imageId : ami-3b0c205e imageOwnerId : 732690581533 creationDate : 2017-10-17T09:28:45.000Z name : RStudio-1.1.383_R-3.4.2_Julia-0.6.0_CUDA-8_cuDNN-6_ubuntu-16.04-LTS-64bit description : Ready to run RStudio + Julia/Python server for statistical computation (www.louisaslett.com). Connect to instance public DNS in web brower (standard port 80), username rstudio and password rstudio To return as tibble: pretty_print = FALSE

Once we have confirmed that we are happy with the image, we need to save the subnet information as well as the security group information.

s <- describe_subnets() g <- describe_sgroups()

Now that you have specified those two things, you have all the pieces to spin up the machine of your choice. To have a look at what machines are available, visit the instance type webpage to choose your machine. Warning: choosing big machines with lots of CPU and a ton of RAM can be addictive. Winners know when to stop

In this example I spin up a t2.micro instance, which is part of the free tier from which Amazon provides.

# Launch the instance using appropriate settings i <- run_instances(image = image, type = "t2.micro", # <- you might want to change this to something like x1e.32xlarge ($26.688 p/h) if you feeling adventurous subnet = s[[1]], sgroup = g[[1]])

Once I have executed the code above, I can check on the instance using instance_status to see if the machine is ready, or describe_instance to get the meta information on the machine such as ip. Again, I use the custom tidy_describe

aws_instance <- describe_instances(i) aws_instance %>% tidy_describe() -------------------------------------- Summary -------------------------------------- ownerId : 748485365675 instanceId : i-007fd9116488691fe imageId : ami-3b0c205e instanceType : t2.micro launchTime : 2018-06-30T13:15:50.000Z availabilityZone : us-east-2b privateIpAddress : 172.31.16.198 ipAddress : 18.222.174.186 coreCount : 1 threadsPerCore : 1 To return as tibble: pretty_print = FALSE aws_status <- instance_status(i) aws_status %>% tidy_describe() -------------------------------------- Summary -------------------------------------- instanceId : i-007fd9116488691fe availabilityZone : us-east-2b code : 16 name : running To return as tibble: pretty_print = FALSE

The final bit of code (which is VERY important when running large instance), is to stop the instance and confirm that it has been terminated:

# Stop and terminate the instances stop_instances(i[[1]]) terminate_instances(i[[1]]) Final comments

Working with AWS-instances for a while now has really been a game changer in the way I conduct/approach any analytical project. Having the capability to switch on large machines on demand and quickly run any of my analytical scripts opened up new opportunities on what I can do as a consultant who has very limited budget to spend on hardware – also, where will I keep my 96 core 500GB RAM machine once I have scraped enough cash together to actually build such a machine?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spend on petrol by income by @ellis2013nz

Sat, 06/30/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Fuel tax debates

So, there’s currently a vibrant debate on a small New Zealandish corner of Twitter about a petrol tax coming into effect in Auckland today, and the different impacts of such taxes on richer and poorer households.

The Government has released analysis from the Stats NZ Household Expenditure Survey showing higher petrol consumption per household for higher income households (and hence paying more of the new tax). Sam Warburton, an economist with the New Zealand Institute, argues in response that poorer households have older, often larger, less efficient vehicles, leading to higher fuel costs per kilometre. This means a fuel tax will not only result in poor people paying more as a percentage of their income (as any sales tax on basic commodities will do), but paying more per kilometre. Further, poor people are more likely to live out of the big urban centres, and when in an urban area are less likely to be well serviced by public transport.

Mr Warburton also argues that the results from the Household Expenditure Survey are misleading because they include “people who can’t afford or otherwise don’t own cars”. His own analysis from the Ministry of Transport’s Household Travel Survey shows that households including two or more children, with Māori, with unemployed people, or in the poorer regions all pay more tax per kilometre.

I’m totally convinced by Mr Warburton on the argument that poorer people are paying more tax per kilometre, which makes a fuel tax a particularly regressive tax. Even paying the same rate per kilometre would be regressive because transport costs are a higher proportion of income for poorer people; so more tax per kilometre is really rubbing salt into the wound. I’m not convinced though by the suggestion that they will pay more of the new tax even in absolute terms; I’m inclined to trust the Household Economic Survey on this one.

The USA Consumer Expenditure Survey

One good thing about this debate was it motivated me to do something that’s been on my list for a while, which is to get my toes in the water with the USA Bureau of Labor Statistics’ impressive Consumer Expenditure Survey. Huge amounts of confidentialised microdata from this mammoth program are freely available for download – without even filling in any forms! This makes it suitable to use in a blog post in a way I couldn’t with the New Zealand or Australian Household Expenditure Surveys (both of which have Confidentialised Unit Record Files for researchers, but with restrictions that get in the way of quick usage in public like this).

Big caveat for what follows – literally I looked at this survey for the first time today, and it is very complex. Just for starters, the Consumer Expenditure Survey really comprises two surveys – an Interview Survey for “major and/or recurring items” and the Diary Survey for “minor or frequently purchased items”. It is very possible that I am using not-the-best variables. Feedback welcomed.

Densities of income and fuel spend

Let’s get started by downloading the data. Here’s a couple of graphs of what I think are the main variables of interest here. These draw on:

  • gasmocq “Gasoline and motor oil this quarter”
  • fincbtxm “Total amount of family income – imputed or collected” (in the past 12 months? although the data dictionary only implies this)
  • fam_size “Number of Members in CU” (ie in responding household)
  • bls_urbn “Is this CU located in an urban or a rural area”

Both these graphics are showing a quantity divided by household size to get a simple estimate of amount per person. A better approach would be to used equivalised figures, taking into account economies of scale for larger households and different cost structures for different age groups, but it probably would have taken me all morning just to work out a safe way of doing that so I’ve stuck with the simpler method.

Both these graphics – and most of those that follow – also use John and Draper’s modulus transform that I wrote about in some of my first ever posts on this blog. It’s a great way to effectively visualise heavily skewed data that has zero and negative values as well as positive, a common occurrence with economic variables. The helper functions for the transformation are available in my bag-of-misc R package frs available from GitHub.

Here’s code to do that download, prepare the data for later and draw graphs:

# mandatory reading: https://www.bls.gov/cex/pumd_novice_guide.pdf library(haven) library(tidyverse) library(scales) library(frs) # for modulus transforms of scales library(MASS) # for rlm. Watch out for dplyr::select and MASS::select clash. library(ggExtra) library(survey) # download the latest year's data (caution - 71MB): dir.create("bls-cd") download.file("https://www.bls.gov/cex/pumd/data/stata/intrvw16.zip", destfile = "bls-cd/intrvw16.zip", mode = "wb") # download the data dictionary download.file("https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx", destfile = "bls-cd/ce_pumd_interview_diary_dictionary.xlsx", mode = "wb") unzip("bls-cd/intrvw16.zip", exdir = "bls-cd") # The FMLI files have characteristics, income, weights, and summary expenditure # the numbers in file names refer to year and quarter. So intrvw16/fmli162.dta is # 2016, second quarter # Import 2017 first quarter: fmli171 <- read_dta("bls-cd/intrvw16/fmli171.dta") # looks like no attributes associated with the various columns; have to use the Excel data dictionary str(attributes(fmli171)) attr(fmli171, "labels") attr(fmli171, "label") # which columns are numeric, candidates to hold income and expenditure non-binned data: formats <- sapply(fmli171, function(x){attr(x, "format.stata")}) formats[grepl("[0-9]g", formats)] # Bit of basic processing for convenience. Make a new data frame called d with some labels we need later, # and the per person variables calculated in one place. # First, the code labels for income classes: inclasses <- c("Less than $5,000", "$5,000 to $9,999", "$10,000 to $14,999", "$15,000 to $19,999", "$20,000 to $29,999", "$30,000 to $39,999", "$40,000 to $49,999", "$50,000 to $69,999", "$70,000 and over") inclass_lu <- data_frame( inclass = paste0("0", 1:9), inclass_ch = ordered(inclasses, levels = inclasses) ) # create the working version of the data frame: d <- fmli171 %>% mutate(gas_pp = gasmocq / fam_size * 4, inc_pp = fincbtxm / fam_size, gas_p_inc = gasmocq * 4 / fincbtxm, no_gas = as.integer(gas_pp == 0), bls_urbn_ch = ifelse(bls_urbn == 1, "Urban", "Rural"), ) %>% left_join(inclass_lu, by = "inclass") # parameter for how much to transform the dollar scales: lambda <- 0.2 ggplot(d, aes(x = gas_pp, colour = bls_urbn_ch)) + geom_density() + scale_x_continuous(label = dollar, trans = modulus_trans(lambda), breaks = modulus_breaks(lambda)) + geom_rug() + ggtitle("Spend on petrol has a spike at $0 and a skewed distribution beyond that", "Unweighted, 2017 quarter 1. Horizontal scale has been transformed.") + labs(caption = "Source: USA Bureau of Labor Statistics Consumer Expenditure Survey", colour = "", x = "Expenditure per person on gasoline and motor oil this quarter x 4") # some looks at various variables as described in the Excel data dictionary: summary(fmli171$cuincome) # total income; not present summary(fmli171$earnincx) # earnings before tax; not present table(fmli171$earncomp) # composition of earners table(fmli171$cutenure) # housing tenure (6 categories) table(fmli171$inclass) # income bracket summary(fmli171$inc_rank) # Weighted cumulative percent ranking based on total current income before taxes (for complete income reporters) summary(fmli171$othrincm) # other income summary(fmli171$ffrmincx) # farm income, not present summary(fmli171$fincbtax) # total amount of family income before taxes in the past 12 months summary(fmli171$finlwt21) # final calibrated weight ggplot(d, aes(x = inc_pp, colour = bls_urbn_ch)) + geom_density() + geom_rug() + ggtitle("After imputation, income has a skewed distribution with no spike at zero", "Unweighted, 2017 quarter 1. Horizontal scale has been transformed.") + scale_x_continuous(label = dollar, trans = modulus_trans(lambda), breaks = modulus_breaks(lambda)) + labs(caption = "Source: USA Bureau of Labor Statistics Consumer Expenditure Survey", colour = "", x = "Family income per person before taxes in the past 12 months") Relationship of income and fuel spend

Here are some different ways of looking at the relationship between household income and the amount spent on fuel. They all show a lot of variation between individual households, but significant evidence of a material positive relationship between the two.

First, here’s a graph that tries to combine the binned “income classification” with the ranking of the household on income(ie its quantile if you like). The categories aren’t as neat as might be expected, I’m pretty sure because of these variables representing different states of imputation:

The collection of people who don’t spend any money on petrol drags the regression line downwards, but it’s the slope that counts; definitely upwards. The higher ranked a household is on income, the more they spend per person on gasoline and motor oil.

BTW, note that the points in these plots are different sizes. The size is mapped to the calibrated survey weight indicating how many people in the US population each sample point is representing; this is a good starting point for trying to represented weighted data in a scatter plot.

I’m wary of using quantiles or rankings in this sort of analysis; I don’t see much or any gain over other transformations and new risks and interpretability problems are introduced. Perhaps more usefully, here are some straightforward scatterplots of income per person and vehicle fuel expenditure per person:

No doubt about that strong relationship; poorer households spend less on fuel (and nearly everything else, of course, although that’s not shown) than do richer households.

On the other hand, there’s equally no doubt that poorer households spend more on fuel as a proportion of their income:

Here’s the code for those four graphics:

#-----------income quantile v fuel scatter plot------------------- ggplot(d, aes(y = gas_pp, x = inc_rank, size = finlwt21)) + geom_point(aes(colour = inclass_ch), alpha = 0.2) + geom_smooth(method = "rlm", aes(weight = finlwt21)) + scale_y_continuous(label = dollar, trans = modulus_trans(lambda), breaks = modulus_breaks(lambda)) + labs(x = "Income quantile") + ggtitle("Spend on petrol increases as income quantile of the household increases", "Blue line shows robust regression using M estimator. Vertical axis is transformed.") + labs(caption = "Source: USA Bureau of Labor Statistics Consumer Expenditure Survey", colour = str_wrap("Income class of household based on income before taxes", 20), y = "Expenditure per person on gasoline and motor oil this quarter") + theme(legend.position = "right") + guides(colour = guide_legend(override.aes = list(size = 4, alpha = 1))) + scale_size_area(guide = "none") #-----------scatter plots---------------- p <- ggplot(d, aes(x = inc_pp, y = gas_pp, size = finlwt21)) + geom_point(alpha = 0.3) + geom_smooth(method = "rlm") + scale_x_continuous("Income per person in household in past 12 months", label = dollar, trans = modulus_trans(lambda), breaks = modulus_breaks(lambda)) + scale_y_continuous(label = dollar, trans = modulus_trans(lambda), breaks = modulus_breaks(lambda)) + theme(legend.position = "none") + ggtitle("Spend on petrol increases as income of the household increases", "Blue line shows robust regression using M estimator. Both axes are transformed.") + labs(caption = "Source: USA Bureau of Labor Statistics Consumer Expenditure Survey", y = "Expenditure per person on gasoline and motor oil this quarter x 4") ggMarginal(p, type = "density", fill = "grey", colour = NA) ggMarginal(p %+% filter(d, gas_pp > 0 & inc_pp > 0), type = "density", fill = "grey", colour = NA) #-----------proportion spent on fuel------------- d %>% filter(inc_pp > 1000) %>% ggplot(aes(x = inc_pp, y = gas_p_inc, size = finlwt21)) + geom_point(alpha = 0.3) + scale_x_continuous("Income per person in household in past 12 months", label = dollar, trans = modulus_trans(lambda), breaks = modulus_breaks(lambda)) + coord_cartesian(ylim = c(0, 1)) + scale_y_continuous() + geom_smooth(se = FALSE) + theme(legend.position = "none") + ggtitle("Poorer households spend more on petrol proportionately than do richer households") + labs(caption = "Source: USA Bureau of Labor Statistics Consumer Expenditure Survey", y = "Expenditure per person on gasoline and motor oil\nas a proportion of income") Who is likely to spend nothing on petrol at all?

Finally, I was interested in who spends nothing on petrol at all. This relates to Mr Warburton’s argument that the New Zealand Household Economic Survey if flawed for these purposes because the average spend on petrol includes people who have been priced out of vehicles altogether. In fact, with the US data, there is a very strong negative relationship between household income and the probability of spending zero on gasoline and motor oil.:


However, as the previous scatterplots showed, removing either or both of the zero income and zero fuel spend cases from the US Consumer Expenditure survey doesn’t serve to remove the relationship between income and gasoline spend.

Finally, here’s the code for this last bit of analysis:

p1 <- d %>% ggplot(aes(x = inc_pp, y = no_gas, size = finlwt21)) + geom_point(alpha = 0.1) + geom_smooth(method = "glm", method.args = list(family = "binomial")) + scale_x_continuous("Income per person in household in past 12 months", label = dollar, trans = modulus_trans(lambda), breaks = modulus_breaks(lambda)) + labs(y = "Probability of spending $0 on gasoline and motor oil this quarter\n") + theme(legend.position = "none") + labs(caption = "Source: USA Bureau of Labor Statistics Consumer Expenditure Survey 2017 Q1") + ggtitle("Poorer households are more likely to spend nothing on petrol at all", "Blue line shows logistic regression; points show American households") print(p1) p2 <- p1 %+% filter(d, inc_pp > 0) + ggtitle("", "Effect holds even if restricted to households with positive income") print(p2) # Check out the relationship with slightly more formal modelling than just on the fly in the graphic. # crude approximation of the survey design, that takes just the primary sampling units and weights # into account but ignores the stratification and post-stratification calibration (this will be # conservative for inference so that's ok). # Also, better would be to transform `inc_pp`, or use a gam, or something. This will do for now! dd <- svydesign(~psu, data = d, weights = ~finlwt21) mod <- svyglm(no_gas ~ inc_pp, design = dd, family = "quasibinomial") summary(mod)

Just as I was finalising this post, new discussion started suggesting the American Time Use Survey as a good source of microdata to directly analyse the question of whether poorer households travel more or less. Looks a good topic to come back to at some point.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.8.600.0.0

Sat, 06/30/2018 - 04:11

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new RcppArmadillo release 0.8.600.0.0, based on the new Armadillo release 8.600.0 from this week, just arrived on CRAN.

It follows our (and Conrad’s) bi-monthly release schedule. We have made interim and release candidate versions available via the GitHub repo (and as usual thoroughly tested them) but this is the real release cycle. A matching Debian release will be prepared in due course.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 479 other packages on CRAN.

A high-level summary of changes follows (which omits the two rc releases leading up to 8.600.0). Conrad did his usual impressive load of upstream changes, but we are also grateful for the RcppArmadillo fixes added by Keith O’Hara and Santiago Olivella.

Changes in RcppArmadillo version 0.8.600.0.0 (2018-06-28)
  • Upgraded to Armadillo release 8.600.0 (Sabretooth Rugrat)

    • added hess() for Hessenberg decomposition

    • added .row(), .rows(), .col(), .cols() to subcube views

    • expanded .shed_rows() and .shed_cols() to handle cubes

    • expanded .insert_rows() and .insert_cols() to handle cubes

    • expanded subcube views to allow non-contiguous access to slices

    • improved tuning of sparse matrix element access operators

    • faster handling of tridiagonal matrices by solve()

    • faster multiplication of matrices with differing element types when using OpenMP

Changes in RcppArmadillo version 0.8.500.1.1 (2018-05-17) [GH only]
  • Upgraded to Armadillo release 8.500.1 (Caffeine Raider)

    • bug fix for banded matricex
  • Added slam to Suggests: as it is used in two unit test functions [CRAN requests]

  • The RcppArmadillo.package.skeleton() function now works with example_code=FALSE when pkgKitten is present (Santiago Olivella in #231 fixing #229)

  • The LAPACK tests now cover band matrix solvers (Keith O’Hara in #230).

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Punctuation in literature

Sat, 06/30/2018 - 02:00

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

This morning I was scrolling through Twitter and noticed Alberto Cairo share this lovely data visualization piece by Adam J. Calhoun about the varying prevalence of punctuation in literature. I thought, “I want to do that!” It also offers me the opportunity to chat about a few of the new options available for tokenizing in tidytext via updates to the tokenizers package.
Adam’s original piece explores how punctuation is used in nine novels, including my favorite Pride and Prejudice.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Global Migration, animated with R

Fri, 06/29/2018 - 23:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The animation below, by Shanghai University professor Guy Abel, shows migration within and between regions of the world from 1960 to 2015. The data and the methodology behind the chart is described in this paper. The curved bars around the outside represent the peak migrant flows for each region; globally, migration peaked during the 2005-2010 period and the declined in 2010-2015, the latest data available.

This animated chord chart was created entirely using the R language. The chord plot showing the flows between regions was created using the circlize package; the tweenr package created the smooth transitions between time periods, and the magick package created the animated GIF you see above. You can find a tutorial on making this animation, including the complete R code, at the link below.

Guy Abel: Animated Directional Chord Diagrams (via Cal Carrie)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

modelDown: a website generator for your predictive models

Fri, 06/29/2018 - 10:19

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

I love the pkgdown package. With a single line of code you can create a complete website with examples, vignettes and documentation for your package. Brilliant!

So what about a website generator for predictive models?
Imagine that you can take a set of predictive models (generated with caret, mlr, glm, xgboost or randomForest, anything) and automagically generate a website with an exploration/documentation for these models. A documentation with archvist hooks to models, with tables and graphs for model performance explainers, conditional model response explainers or explainers for particular predictions.

During the summer semester three students from Warsaw University of Technology (Kamil Romaszko, Magda Tatarynowicz, Mateusz Urbański) developed modelDown package for R as an team project assignment. You can find the package here. Visit an example website created with this package for four example models (instructions). And read more about this package at its pkgdown website or below.

BTW: If you want to learn more about model explainers, please come to our DALEX workshops at WhyR? 2018 conference in Wroclaw or UseR! 2018 conference in Brisbane.

Getting started with modelDown
by Kamil Romaszko, Magda Tatarynowicz, Mateusz Urbański

Introduction

Did you ever want to have one place where you can find information explaining your model? Or maybe you were missing a tool that can show difference in multiple models for the same dataset? Well, here comes modelDown package. By using DALEX package, it creates one html page with plots and information related to the model(s) you want to analyze.

If you want to check out example website generated with modelDown, check out this link (along with script that was used to create the html). Read on to see how to use package for your own models and what features it provides.

The examples presented here were generated for dataset HR_data from breakDown package (available on CRAN). The dataset contains various information about employees (for example their satisfaction from work or their salary). The information we predict is whether they left the company.

Installation
First things first – how can you use this package? Install it from github:

devtools::install_github("MI2DataLab/modelDown")

When you have the package successfully installed, you need to create DALEX explainers for you models. Here is a simple example. Please refer to DALEX package documentation in order to learn more.

# assuming you have two models: glm_model and ranger_model for HR_data explainer_glm <- DALEX::explain(glm_model, data = HR_data, y = HR_data$left) explainer_ranger <- DALEX::explain(ranger_model, data = HR_data, y = HR_data$left)

Next, just pass all created explainers to function modelDown. For example:

modelDown::modelDown(explainer_ranger, explainer_glm)

That’s it! Now you should have your html page generated with default options.

Features

Let’s quickly describe the sections of your page. If you want to know more about how the plots are generated, again, check out DALEX package documentation.

Index page

Always know your data before you analyze the model – the index page helps you do exactly that.

You can see basic information about your data, like dimensions and summary of all variables. For numerical variables there is some statistical data presented, for categorical ones you see how many observations were in each category.

Model Performance

The most general informations about how correct were the predictions.

For our two models – clearly ranger model has lower residual values, which suggests its better performance for this dataset.

Variable Importance

Variable importance plot is extremely useful when you want to see how removing single variable impacts the response – basically how important every variable is.

Here, it is clear that for linear model there are two most important variables – number_project and satisfaction_level. For ranger model, there are 4 most important variables. Also, for each model different variable was picked as the most important one.

Variable Response

In variable response plot you can see how one variable impacts response.

For example, for variable average_monthly_hours and glm model, there is a linear dependency – the more hours someone works, the greater chance he will leave the company. For ranger model, this is not so clear – chance of leaving drastically increases for people working more than 270 hours a month. By default the plots are generated for every variable, so you can make similar conclusions for all variables in the model.

Prediction BreakDown

Prediction breakdown shows detailed informations for particular observations in a model.By default for each model one observation with worst predicted value is presented.

On the example, for ranger model the value of satisfaction_level had the biggest part in final response calculation. So even though this particular employee’s satisfaction level was lower than half of scale used to measure, he still didn’t leave the company. The model prediction was not correct in this case.

Prediction breakdown makes it easier to understand how model acted. It can be useful for tuning your model and improving its capabilities.

Summary

The idea of the package was to help you understand your models in a condensed and easy way. We hope that using this package will make models’ performance clear to you. Feel free to use it and provide your feedback.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Benchmarking a SSD drive in reading and writing files with R

Fri, 06/29/2018 - 09:00

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

I recently bought a new computer for home and it came with two drives,
one HDD and other SSD. The later is used for the OS and the former for
all of my files. From all computers I had, both home and work, this is
definitely the fastest. While some of the merits are due to the newer
CPUS and RAM, the SSD drive can make all the difference in file
operations.

My research usually deals with large files from financial markets. Being
efficient in reading those files is key to my productivity. Given that,
I was very curious in understanding how much I would benefit in speed
when reading/writing files in my SSD drive instead of the HDD. For that,
I wrote a simple function that will time a particular operation. The
function will take as input the number of rows in the data (1..Inf), the
type of function used to save the file (rds, csv, fst) and the
type of drive (HDD or SSD). See next.

bench.fct <- function(N = 2500000, type.file = 'rds', type.hd = 'HDD') { # Function for timing read and write operations # # INPUT: N - Number of rows in dataframe to be read and write # type.file - format of output file (rds, csv, fst) # type.hd - where to save (hdd or ssd) # # OUTPUT: A dataframe with results require(tidyverse) require(fst) my.df <- data_frame(x = runif(N), char.vec = sample(letters, size = N, replace = TRUE)) path.file <- switch(type.hd, 'SSD' = '~', 'HDD' = '/mnt/HDD/') my.file <- file.path(path.file, switch (type.file, 'rds-base' = 'temp_rds.rds', 'rds-readr' = 'temp_rds.rds', 'fst' = 'temp_fst.fst', 'csv-readr' = 'temp_csv.csv', 'csv-base' = 'temp_csv.csv')) if (type.file == 'rds-base') { time.write <- system.time(saveRDS(my.df, my.file, compress = FALSE)) time.read <- system.time(readRDS(my.file)) } else if (type.file == 'rds-readr') { time.write <- system.time(write_rds(x = my.df, path = my.file, compress = 'none')) time.read <- system.time(read_rds(path = my.file )) } else if (type.file == 'fst') { time.write <- system.time(write.fst(x = my.df, path = my.file)) time.read <- system.time(read_fst(my.file)) } else if (type.file == 'csv-readr') { time.write <- system.time(write_csv(x = my.df, path = my.file)) time.read <- system.time(read_csv(file = my.file, col_types = cols(x = col_double(), char.vec = col_character()))) } else if (type.file == 'csv-base') { time.write <- system.time(write.csv(x = my.df, file = my.file)) time.read <- system.time(read.csv(file = my.file)) } # clean up file.remove(my.file) # save output df.out <- data_frame(type.file = type.file, type.hd = type.hd, N = N, type.time = c('write', 'read'), times = c(time.write[3], time.read[3])) return(df.out) }

Now that we have my function, its time to use it for all combinations
between number of rows, the formats of the file and type of drive:

library(purrr) df.grid <- expand.grid(N = seq(1, 500000, by = 50000), type.file = c('rds-readr', 'rds-base', 'fst', 'csv-readr', 'csv-base'), type.hd = c('HDD', 'SSD'), stringsAsFactors = F) l.out <- pmap(list(N = df.grid$N, type.file = df.grid$type.file, type.hd = df.grid$type.hd), .f = bench.fct) df.res <- do.call(what = bind_rows, args = l.out)

Lets check the result in a nice plot:

library(ggplot2) p <- ggplot(df.res, aes(x = N, y = times, linetype = type.hd)) + geom_line() + facet_grid(type.file ~ type.time) print(p)

As you can see, the csv-base format is messing with the y axis. Let’s
remove it for better visualization:

library(ggplot2) p <- ggplot(filter(df.res, !(type.file %in% c('csv-base'))), aes(x = N, y = times, linetype = type.hd)) + geom_line() + facet_grid(type.file ~ type.time) print(p)

When it comes to the file format, we learn:

  • By far, the fst format is the best. It takes less time to read
    and write than the others. However, it’s probably unfair to compare
    it to csv and rds as it uses many of the 16 cores of my
    computer.

  • readr is a great package for writing and reading csv files.
    You can see a large difference of time from using the base
    functions. This is likely due to the use of low level functions to
    write and read the text files.

  • When using the rds format, the base function do not differ much
    from the readr functions
    .

As for the effect of using SSD, its clear that it DOES NOT effect
the time of reading and writing. The differences between using HDD and
SSD looks like noise. Seeking to provide a more robust analysis, let’s
formally test this hypothesis using a simple t-test for the means:

tab <- df.res %>% group_by(type.file, type.time) %>% summarise(mean.HDD = mean(times[type.hd == 'HDD']), mean.SSD = mean(times[type.hd == 'SSD']), p.value = t.test(times[type.hd == 'SSD'], times[type.hd == 'HDD'])$p.value) print(tab) ## # A tibble: 10 x 5 ## # Groups: type.file [?] ## type.file type.time mean.HDD mean.SSD p.value ## ## 1 csv-base read 0.554 0.463 0.605 ## 2 csv-base write 0.405 0.405 0.997 ## 3 csv-readr read 0.142 0.126 0.687 ## 4 csv-readr write 0.0711 0.0706 0.982 ## 5 fst read 0.015 0.0084 0.0584 ## 6 fst write 0.00900 0.00910 0.964 ## 7 rds-base read 0.0321 0.0303 0.848 ## 8 rds-base write 0.0253 0.025 0.969 ## 9 rds-readr read 0.0323 0.0304 0.845 ## 10 rds-readr write 0.0251 0.0247 0.957

As we can see, the null hypothesis of equal means easily fails to be
rejected for almost all types of files and operations at 10%. The
exception was for the fst format in a reading operation. In other
words, statistically, it does not make any difference in time from using
SSD or HDD to read or write files in different formats.

I am very surprised by this result. Independently of the type of format,
I expected a large difference as SSD drives are much faster within an
OS. Am I missing something? Is this due to the OS being in the SSD? What
you guys think?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Code for Workshop: Introduction to Machine Learning with R

Fri, 06/29/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

These are the slides from my workshop: Introduction to Machine Learning with R which I gave at the University of Heidelberg, Germany on June 28th 2018. The entire code accompanying the workshop can be found below the video.

The workshop covered the basics of machine learning. With an example dataset I went through a standard machine learning workflow in R with the packages caret and h2o:

  • reading in data
  • exploratory data analysis
  • missingness
  • feature engineering
  • training and test split
  • model training with Random Forests, Gradient Boosting, Neural Nets, etc.
  • hyperparameter tuning


Workshop – Introduction to Machine Learning with R from Shirin Glander

Setup

All analyses are done in R using RStudio. For detailed session information including R version, operating system and package versions, see the sessionInfo() output at the end of this document.

All figures are produced with ggplot2.

  • libraries
library(tidyverse) # for tidy data analysis library(readr) # for fast reading of input files library(mice) # mice package for Multivariate Imputation by Chained Equations (MICE)

Data preparation The dataset

The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read_delim("/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg/datasets/breast-cancer-wisconsin.data.txt", delim = ",", col_names = c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes")) %>% mutate(bare_nuclei = as.numeric(bare_nuclei), classes = ifelse(classes == "2", "benign", ifelse(classes == "4", "malignant", NA))) summary(bc_data) ## sample_code_number clump_thickness uniformity_of_cell_size ## Min. : 61634 Min. : 1.000 Min. : 1.000 ## 1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 ## Median : 1171710 Median : 4.000 Median : 1.000 ## Mean : 1071704 Mean : 4.418 Mean : 3.134 ## 3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 ## Max. :13454352 Max. :10.000 Max. :10.000 ## ## uniformity_of_cell_shape marginal_adhesion single_epithelial_cell_size ## Min. : 1.000 Min. : 1.000 Min. : 1.000 ## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000 ## Median : 1.000 Median : 1.000 Median : 2.000 ## Mean : 3.207 Mean : 2.807 Mean : 3.216 ## 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 4.000 ## Max. :10.000 Max. :10.000 Max. :10.000 ## ## bare_nuclei bland_chromatin normal_nucleoli mitosis ## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 ## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 ## Median : 1.000 Median : 3.000 Median : 1.000 Median : 1.000 ## Mean : 3.545 Mean : 3.438 Mean : 2.867 Mean : 1.589 ## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 1.000 ## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 ## NA's :16 ## classes ## Length:699 ## Class :character ## Mode :character ## ## ## ##

Missing data # how many NAs are in the data md.pattern(bc_data, plot = FALSE) ## sample_code_number clump_thickness uniformity_of_cell_size ## 683 1 1 1 ## 16 1 1 1 ## 0 0 0 ## uniformity_of_cell_shape marginal_adhesion single_epithelial_cell_size ## 683 1 1 1 ## 16 1 1 1 ## 0 0 0 ## bland_chromatin normal_nucleoli mitosis classes bare_nuclei ## 683 1 1 1 1 1 0 ## 16 1 1 1 1 0 1 ## 0 0 0 0 16 16 bc_data <- bc_data %>% drop_na() %>% select(classes, everything(), -sample_code_number) head(bc_data) ## # A tibble: 6 x 10 ## classes clump_thickness uniformity_of_cell_si… uniformity_of_cell_sha… ## ## 1 benign 5 1 1 ## 2 benign 5 4 4 ## 3 benign 3 1 1 ## 4 benign 6 8 8 ## 5 benign 4 1 1 ## 6 malignant 8 10 10 ## # ... with 6 more variables: marginal_adhesion , ## # single_epithelial_cell_size , bare_nuclei , ## # bland_chromatin , normal_nucleoli , mitosis

Missing values can be imputed with the mice package.

More info and tutorial with code: https://shirinsplayground.netlify.com/2018/04/flu_prediction/

Data exploration
  • Response variable for classification
ggplot(bc_data, aes(x = classes, fill = classes)) + geom_bar()

More info on dealing with unbalanced classes: https://shiring.github.io/machine_learning/2017/04/02/unbalanced

  • Response variable for regression
ggplot(bc_data, aes(x = clump_thickness)) + geom_histogram(bins = 10)

  • Features
gather(bc_data, x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = classes, fill = classes)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

  • Correlation graphs
co_mat_benign <- filter(bc_data, classes == "benign") %>% select(-1) %>% cor() co_mat_malignant <- filter(bc_data, classes == "malignant") %>% select(-1) %>% cor() library(igraph) g_benign <- graph.adjacency(co_mat_benign, weighted = TRUE, diag = FALSE, mode = "upper") g_malignant <- graph.adjacency(co_mat_malignant, weighted = TRUE, diag = FALSE, mode = "upper") # http://kateto.net/networks-r-igraph cut.off_b <- mean(E(g_benign)$weight) cut.off_m <- mean(E(g_malignant)$weight) g_benign_2 <- delete_edges(g_benign, E(g_benign)[weight < cut.off_b]) g_malignant_2 <- delete_edges(g_malignant, E(g_malignant)[weight < cut.off_m]) c_g_benign_2 <- cluster_fast_greedy(g_benign_2) c_g_malignant_2 <- cluster_fast_greedy(g_malignant_2) par(mfrow = c(1,2)) plot(c_g_benign_2, g_benign_2, vertex.size = colSums(co_mat_benign) * 10, vertex.frame.color = NA, vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = E(g_benign_2)$weight * 15, layout = layout_with_fr(g_benign_2), main = "Benign tumors") plot(c_g_malignant_2, g_malignant_2, vertex.size = colSums(co_mat_malignant) * 10, vertex.frame.color = NA, vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = E(g_malignant_2)$weight * 15, layout = layout_with_fr(g_malignant_2), main = "Malignant tumors")

Principal Component Analysis library(ellipse) # perform pca and extract scores pcaOutput <- prcomp(as.matrix(bc_data[, -1]), scale = TRUE, center = TRUE) pcaOutput2 <- as.data.frame(pcaOutput$x) # define groups for plotting pcaOutput2$groups <- bc_data$classes centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean) conf.rgn <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t) data.frame(groups = as.character(t), ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]), centre = as.matrix(centroids[centroids$groups == t, 2:3]), level = 0.95), stringsAsFactors = FALSE))) ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) + geom_point(size = 2, alpha = 0.6) + labs(color = "", fill = "")

Multidimensional Scaling select(bc_data, -1) %>% dist() %>% cmdscale %>% as.data.frame() %>% mutate(group = bc_data$classes) %>% ggplot(aes(x = V1, y = V2, color = group)) + geom_point()

t-SNE dimensionality reduction library(tsne) select(bc_data, -1) %>% dist() %>% tsne() %>% as.data.frame() %>% mutate(group = bc_data$classes) %>% ggplot(aes(x = V1, y = V2, color = group)) + geom_point()

Machine Learning packages for R caret # configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret)

Training, validation and test data set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] bind_rows(data.frame(group = "train", train_data), data.frame(group = "test", test_data)) %>% gather(x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = group, fill = group)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

Regression set.seed(42) model_glm <- caret::train(clump_thickness ~ ., data = train_data, method = "glm", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE)) model_glm ## Generalized Linear Model ## ## 479 samples ## 9 predictor ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 432, 431, 431, 431, ... ## Resampling results: ## ## RMSE Rsquared MAE ## 1.972314 0.5254215 1.648832 predictions <- predict(model_glm, test_data) # model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values data.frame(residuals = resid(model_glm), predictors = model_glm$finalModel$linear.predictors) %>% ggplot(aes(x = predictors, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

# y == train_data$clump_thickness data.frame(residuals = resid(model_glm), y = model_glm$finalModel$y) %>% ggplot(aes(x = y, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

data.frame(actual = test_data$clump_thickness, predicted = predictions) %>% ggplot(aes(x = actual, y = predicted)) + geom_jitter() + geom_smooth(method = "lm")

Classification Decision trees

rpart

library(rpart) library(rpart.plot) set.seed(42) fit <- rpart(classes ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) rpart.plot(fit, extra = 100)

Random Forests

Random Forests predictions are based on the generation of multiple classification trees. They can be used for both, classification and regression tasks. Here, I show a classification task.

set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3, savePredictions = TRUE, verboseIter = FALSE))

When you specify savePredictions = TRUE, you can access the cross-validation resuls with model_rf$pred.

model_rf ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 2 0.9776753 0.9513499 ## 5 0.9757957 0.9469999 ## 9 0.9714200 0.9370285 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2. model_rf$finalModel$confusion ## benign malignant class.error ## benign 304 7 0.02250804 ## malignant 5 163 0.02976190 Dealing with unbalanced data

Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.

set.seed(42) model_rf_down <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, sampling = "down")) model_rf_down ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Addtional sampling using down-sampling prior to pre-processing ## ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 2 0.9797503 0.9563138 ## 5 0.9741198 0.9438326 ## 9 0.9699578 0.9346310 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2.

Feature Importance imp <- model_rf$finalModel$importance imp[order(imp, decreasing = TRUE), ] ## uniformity_of_cell_size uniformity_of_cell_shape ## 43.936945 39.840595 ## bare_nuclei bland_chromatin ## 33.820345 31.984813 ## normal_nucleoli single_epithelial_cell_size ## 21.686039 17.761202 ## clump_thickness marginal_adhesion ## 16.318817 9.518437 ## mitosis ## 2.220633 # estimate variable importance importance <- varImp(model_rf, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_rf, test_data), as.factor(test_data$classes)) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 128 4 ## malignant 5 67 ## ## Accuracy : 0.9559 ## 95% CI : (0.9179, 0.9796) ## No Information Rate : 0.652 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9031 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.9624 ## Specificity : 0.9437 ## Pos Pred Value : 0.9697 ## Neg Pred Value : 0.9306 ## Prevalence : 0.6520 ## Detection Rate : 0.6275 ## Detection Prevalence : 0.6471 ## Balanced Accuracy : 0.9530 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_rf, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Extreme gradient boosting trees

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training. Can be used for classification and regression tasks. Here, I show a classification task.

set.seed(42) model_xgb <- caret::train(classes ~ ., data = train_data, method = "xgbTree", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3, savePredictions = TRUE, verboseIter = FALSE)) model_xgb ## eXtreme Gradient Boosting ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## eta max_depth colsample_bytree subsample nrounds Accuracy ## 0.3 1 0.6 0.50 50 0.9567788 ## 0.3 1 0.6 0.50 100 0.9544912 ## 0.3 1 0.6 0.50 150 0.9513572 ## 0.3 1 0.6 0.75 50 0.9576164 ## 0.3 1 0.6 0.75 100 0.9536448 ## 0.3 1 0.6 0.75 150 0.9525987 ## 0.3 1 0.6 1.00 50 0.9559409 ## 0.3 1 0.6 1.00 100 0.9555242 ## 0.3 1 0.6 1.00 150 0.9551031 ## 0.3 1 0.8 0.50 50 0.9718588 ## 0.3 1 0.8 0.50 100 0.9720583 ## 0.3 1 0.8 0.50 150 0.9699879 ## 0.3 1 0.8 0.75 50 0.9726964 ## 0.3 1 0.8 0.75 100 0.9724664 ## 0.3 1 0.8 0.75 150 0.9705868 ## 0.3 1 0.8 1.00 50 0.9714202 ## 0.3 1 0.8 1.00 100 0.9710035 ## 0.3 1 0.8 1.00 150 0.9705866 ## 0.3 2 0.6 0.50 50 0.9559448 ## 0.3 2 0.6 0.50 100 0.9565397 ## 0.3 2 0.6 0.50 150 0.9555063 ## 0.3 2 0.6 0.75 50 0.9530150 ## 0.3 2 0.6 0.75 100 0.9550985 ## 0.3 2 0.6 0.75 150 0.9551070 ## 0.3 2 0.6 1.00 50 0.9532320 ## 0.3 2 0.6 1.00 100 0.9551072 ## 0.3 2 0.6 1.00 150 0.9557237 ## 0.3 2 0.8 0.50 50 0.9720583 ## 0.3 2 0.8 0.50 100 0.9735166 ## 0.3 2 0.8 0.50 150 0.9720540 ## 0.3 2 0.8 0.75 50 0.9722494 ## 0.3 2 0.8 0.75 100 0.9726703 ## 0.3 2 0.8 0.75 150 0.9716374 ## 0.3 2 0.8 1.00 50 0.9716327 ## 0.3 2 0.8 1.00 100 0.9724622 ## 0.3 2 0.8 1.00 150 0.9718416 ## 0.3 3 0.6 0.50 50 0.9548905 ## 0.3 3 0.6 0.50 100 0.9557237 ## 0.3 3 0.6 0.50 150 0.9555198 ## 0.3 3 0.6 0.75 50 0.9561404 ## 0.3 3 0.6 0.75 100 0.9546820 ## 0.3 3 0.6 0.75 150 0.9552982 ## 0.3 3 0.6 1.00 50 0.9577983 ## 0.3 3 0.6 1.00 100 0.9573819 ## 0.3 3 0.6 1.00 150 0.9567655 ## 0.3 3 0.8 0.50 50 0.9733131 ## 0.3 3 0.8 0.50 100 0.9728829 ## 0.3 3 0.8 0.50 150 0.9718499 ## 0.3 3 0.8 0.75 50 0.9751879 ## 0.3 3 0.8 0.75 100 0.9743546 ## 0.3 3 0.8 0.75 150 0.9735212 ## 0.3 3 0.8 1.00 50 0.9743372 ## 0.3 3 0.8 1.00 100 0.9737122 ## 0.3 3 0.8 1.00 150 0.9743461 ## 0.4 1 0.6 0.50 50 0.9548861 ## 0.4 1 0.6 0.50 100 0.9528290 ## 0.4 1 0.6 0.50 150 0.9498772 ## 0.4 1 0.6 0.75 50 0.9557239 ## 0.4 1 0.6 0.75 100 0.9513529 ## 0.4 1 0.6 0.75 150 0.9492779 ## 0.4 1 0.6 1.00 50 0.9559365 ## 0.4 1 0.6 1.00 100 0.9551031 ## 0.4 1 0.6 1.00 150 0.9536361 ## 0.4 1 0.8 0.50 50 0.9710164 ## 0.4 1 0.8 0.50 100 0.9697577 ## 0.4 1 0.8 0.50 150 0.9687074 ## 0.4 1 0.8 0.75 50 0.9710122 ## 0.4 1 0.8 0.75 100 0.9707996 ## 0.4 1 0.8 0.75 150 0.9691455 ## 0.4 1 0.8 1.00 50 0.9705911 ## 0.4 1 0.8 1.00 100 0.9697446 ## 0.4 1 0.8 1.00 150 0.9697576 ## 0.4 2 0.6 0.50 50 0.9544866 ## 0.4 2 0.6 0.50 100 0.9542694 ## 0.4 2 0.6 0.50 150 0.9536357 ## 0.4 2 0.6 0.75 50 0.9540611 ## 0.4 2 0.6 0.75 100 0.9542694 ## 0.4 2 0.6 0.75 150 0.9549033 ## 0.4 2 0.6 1.00 50 0.9540653 ## 0.4 2 0.6 1.00 100 0.9555239 ## 0.4 2 0.6 1.00 150 0.9546818 ## 0.4 2 0.8 0.50 50 0.9720670 ## 0.4 2 0.8 0.50 100 0.9695629 ## 0.4 2 0.8 0.50 150 0.9702006 ## 0.4 2 0.8 0.75 50 0.9722627 ## 0.4 2 0.8 0.75 100 0.9720500 ## 0.4 2 0.8 0.75 150 0.9716289 ## 0.4 2 0.8 1.00 50 0.9726705 ## 0.4 2 0.8 1.00 100 0.9708042 ## 0.4 2 0.8 1.00 150 0.9708129 ## 0.4 3 0.6 0.50 50 0.9555150 ## 0.4 3 0.6 0.50 100 0.9553021 ## 0.4 3 0.6 0.50 150 0.9548943 ## 0.4 3 0.6 0.75 50 0.9555281 ## 0.4 3 0.6 0.75 100 0.9563662 ## 0.4 3 0.6 0.75 150 0.9555324 ## 0.4 3 0.6 1.00 50 0.9575900 ## 0.4 3 0.6 1.00 100 0.9571735 ## 0.4 3 0.6 1.00 150 0.9559104 ## 0.4 3 0.8 0.50 50 0.9737255 ## 0.4 3 0.8 0.50 100 0.9745501 ## 0.4 3 0.8 0.50 150 0.9730874 ## 0.4 3 0.8 0.75 50 0.9747539 ## 0.4 3 0.8 0.75 100 0.9724664 ## 0.4 3 0.8 0.75 150 0.9720498 ## 0.4 3 0.8 1.00 50 0.9747539 ## 0.4 3 0.8 1.00 100 0.9749624 ## 0.4 3 0.8 1.00 150 0.9734996 ## Kappa ## 0.9050828 ## 0.8999999 ## 0.8930637 ## 0.9067208 ## 0.8982284 ## 0.8959903 ## 0.9028825 ## 0.9022543 ## 0.9014018 ## 0.9382467 ## 0.9386326 ## 0.9340573 ## 0.9400323 ## 0.9395968 ## 0.9353783 ## 0.9372262 ## 0.9362148 ## 0.9353247 ## 0.9032270 ## 0.9047203 ## 0.9024465 ## 0.8968511 ## 0.9015282 ## 0.9016169 ## 0.8971329 ## 0.9015111 ## 0.9028614 ## 0.9387022 ## 0.9419143 ## 0.9387792 ## 0.9391933 ## 0.9401872 ## 0.9379714 ## 0.9377309 ## 0.9397601 ## 0.9384827 ## 0.9008861 ## 0.9029797 ## 0.9024531 ## 0.9037859 ## 0.9004226 ## 0.9019909 ## 0.9074584 ## 0.9064701 ## 0.9051441 ## 0.9414031 ## 0.9405025 ## 0.9380734 ## 0.9456856 ## 0.9438986 ## 0.9419994 ## 0.9438642 ## 0.9426000 ## 0.9439780 ## 0.9007223 ## 0.8964381 ## 0.8897615 ## 0.9027951 ## 0.8931520 ## 0.8886910 ## 0.9030461 ## 0.9014362 ## 0.8982364 ## 0.9363059 ## 0.9334254 ## 0.9311383 ## 0.9361883 ## 0.9357131 ## 0.9320657 ## 0.9353688 ## 0.9333607 ## 0.9334467 ## 0.8999756 ## 0.8997888 ## 0.8983861 ## 0.8991356 ## 0.8998960 ## 0.9013529 ## 0.8990428 ## 0.9023340 ## 0.9004889 ## 0.9387165 ## 0.9332663 ## 0.9345567 ## 0.9393855 ## 0.9389455 ## 0.9380863 ## 0.9401366 ## 0.9361847 ## 0.9361724 ## 0.9021263 ## 0.9017938 ## 0.9010613 ## 0.9025263 ## 0.9043436 ## 0.9024744 ## 0.9069828 ## 0.9059579 ## 0.9031829 ## 0.9424523 ## 0.9442537 ## 0.9410193 ## 0.9447486 ## 0.9397683 ## 0.9388701 ## 0.9449064 ## 0.9454375 ## 0.9422358 ## ## Tuning parameter 'gamma' was held constant at a value of 0 ## ## Tuning parameter 'min_child_weight' was held constant at a value of 1 ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were nrounds = 50, max_depth = 3, ## eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 ## and subsample = 0.75.

  • Feature Importance
importance <- varImp(model_xgb, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_xgb, test_data), as.factor(test_data$classes)) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 128 3 ## malignant 5 68 ## ## Accuracy : 0.9608 ## 95% CI : (0.9242, 0.9829) ## No Information Rate : 0.652 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9142 ## Mcnemar's Test P-Value : 0.7237 ## ## Sensitivity : 0.9624 ## Specificity : 0.9577 ## Pos Pred Value : 0.9771 ## Neg Pred Value : 0.9315 ## Prevalence : 0.6520 ## Detection Rate : 0.6275 ## Detection Prevalence : 0.6422 ## Balanced Accuracy : 0.9601 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_xgb, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Available models in caret

https://topepo.github.io/caret/available-models.html

Feature Selection

Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

  • Correlation

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.

library(corrplot) # calculate correlation matrix corMatMy <- cor(train_data[, -1]) corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70, highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)] ## Compare row 2 and column 3 with corr 0.908 ## Means: 0.709 vs 0.594 so flagging column 2 ## Compare row 3 and column 7 with corr 0.749 ## Means: 0.67 vs 0.569 so flagging column 3 ## All correlations <= 0.7 # which variables are flagged for removal? highlyCor ## [1] "uniformity_of_cell_size" "uniformity_of_cell_shape" #then we remove these variables train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]

  • Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.

set.seed(7) results_rfe <- rfe(x = train_data[, -1], y = as.factor(train_data$classes), sizes = c(1:9), rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10)) # chosen features predictors(results_rfe) ## [1] "bare_nuclei" "clump_thickness" ## [3] "uniformity_of_cell_size" "uniformity_of_cell_shape" ## [5] "bland_chromatin" "normal_nucleoli" ## [7] "marginal_adhesion" "single_epithelial_cell_size" train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]

  • Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

set.seed(27) model_ga <- gafs(x = train_data[, -1], y = as.factor(train_data$classes), iters = 10, # generations of algorithm popSize = 10, # population size for each generation levels = c("malignant", "benign"), gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel = TRUE, # Use parallel programming allowParallel = TRUE)) plot(model_ga) # Plot mean fitness (AUC) by generation

train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]

Hyperparameter tuning with caret
  • Cartesian Grid

  • mtry: Number of variables randomly sampled as candidates at each split.

set.seed(42) grid <- expand.grid(mtry = c(1:10)) model_rf_tune_man <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE), tuneGrid = grid) model_rf_tune_man ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9785044 0.9532161 ## 2 0.9772586 0.9504377 ## 3 0.9774625 0.9508246 ## 4 0.9766333 0.9488778 ## 5 0.9753789 0.9460274 ## 6 0.9737078 0.9422613 ## 7 0.9730957 0.9408547 ## 8 0.9714155 0.9371611 ## 9 0.9718280 0.9380578 ## 10 0.9718280 0.9380135 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 1. plot(model_rf_tune_man)

  • Random Search
set.seed(42) model_rf_tune_auto <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, search = "random"), tuneGrid = grid, tuneLength = 15) model_rf_tune_auto ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9785044 0.9532161 ## 2 0.9772586 0.9504377 ## 3 0.9774625 0.9508246 ## 4 0.9766333 0.9488778 ## 5 0.9753789 0.9460274 ## 6 0.9737078 0.9422613 ## 7 0.9730957 0.9408547 ## 8 0.9714155 0.9371611 ## 9 0.9718280 0.9380578 ## 10 0.9718280 0.9380135 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 1. plot(model_rf_tune_auto)

Grid search with h2o

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

library(h2o) h2o.init(nthreads = -1) ## Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 26 minutes 45 seconds ## H2O cluster timezone: Europe/Berlin ## H2O data parsing timezone: UTC ## H2O cluster version: 3.20.0.2 ## H2O cluster version age: 13 days ## H2O cluster name: H2O_started_from_R_shiringlander_jrj894 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.24 GB ## H2O cluster total cores: 8 ## H2O cluster allowed cores: 8 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 ## R Version: R version 3.5.0 (2018-04-23) h2o.no_progress() bc_data_hf <- as.h2o(bc_data) h2o.describe(bc_data_hf) %>% gather(x, y, Zeros:Sigma) %>% mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% ggplot(aes(x = Label, y = as.numeric(y), color = x)) + geom_point(size = 4, alpha = 0.6) + scale_color_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + facet_grid(group ~ ., scales = "free") + labs(x = "Feature", y = "Value", color = "")

library(reshape2) # for melting bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1]) cor <- h2o.cor(bc_data_hf) rownames(cor) <- colnames(cor) melt(cor) %>% mutate(Var2 = rep(rownames(cor), nrow(cor))) %>% mutate(Var2 = factor(Var2, levels = colnames(cor))) %>% mutate(variable = factor(variable, levels = colnames(cor))) %>% ggplot(aes(x = variable, y = Var2, fill = value)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "Cor.") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(x = "", y = "")

Training, validation and test data splits <- h2o.splitFrame(bc_data_hf, ratios = c(0.7, 0.15), seed = 1) train <- splits[[1]] valid <- splits[[2]] test <- splits[[3]] response <- "classes" features <- setdiff(colnames(train), response) summary(as.factor(train$classes), exact_quantiles = TRUE) ## classes ## benign :313 ## malignant:167 summary(as.factor(valid$classes), exact_quantiles = TRUE) ## classes ## benign :64 ## malignant:38 summary(as.factor(test$classes), exact_quantiles = TRUE) ## classes ## benign :67 ## malignant:34 pca <- h2o.prcomp(training_frame = train, x = features, validation_frame = valid, transform = "NORMALIZE", impute_missing = TRUE, k = 3, seed = 42) eigenvec <- as.data.frame(pca@model$eigenvectors) eigenvec$label <- features library(ggrepel) ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) + geom_point(color = "navy", alpha = 0.7) + geom_text_repel()

Classification Random Forest hyper_params <- list( ntrees = c(25, 50, 75, 100), max_depth = c(10, 20, 30), min_rows = c(1, 3, 5) ) search_criteria <- list( strategy = "RandomDiscrete", max_models = 50, max_runtime_secs = 360, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.0005, seed = 42 ) rf_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, # alternatively h2o.gbm # for Gradient boosting trees x = features, y = response, grid_id = "rf_grid", training_frame = train, validation_frame = valid, nfolds = 25, fold_assignment = "Stratified", hyper_params = hyper_params, search_criteria = search_criteria, seed = 42 ) # performance metrics where smaller is better -> order with decreasing = FALSE sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss") for (sort_by_1 in sort_options_1) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path="models", force = TRUE) } # performance metrics where bigger is better -> order with decreasing = TRUE sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity") for (sort_by_2 in sort_options_2) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path = "models", force = TRUE) } files <- list.files(path = "/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg/models") rf_models <- files[grep("rf_grid_model", files)] for (model_id in rf_models) { path <- paste0("/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg", "/models/", model_id) best_model <- h2o.loadModel(path) mse_auc_test <- data.frame(model_id = model_id, mse = h2o.mse(h2o.performance(best_model, test)), auc = h2o.auc(h2o.performance(best_model, test))) if (model_id == rf_models[[1]]) { mse_auc_test_comb <- mse_auc_test } else { mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test) } } mse_auc_test_comb %>% gather(x, y, mse:auc) %>% ggplot(aes(x = model_id, y = y, fill = model_id)) + facet_grid(x ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8, position = "dodge") + scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) + labs(x = "", y = "value", fill = "")

for (model_id in rf_models) { best_model <- h2o.getModel(model_id) finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)), actual = as.vector(test$classes), as.data.frame(h2o.predict(object = best_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) if (model_id == rf_models[[1]]) { finalRf_predictions_comb <- finalRf_predictions } else { finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions) } } finalRf_predictions_comb %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 2) + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions_comb %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 2) + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

rf_model <- h2o.loadModel("/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg/models/rf_grid_model_0") h2o.varimp_plot(rf_model)

#h2o.varimp(rf_model) h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE) ## train valid xval ## 0.02196246 0.02343750 0.02515735 h2o.confusionMatrix(rf_model, valid = TRUE) ## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.533333333333333: ## benign malignant Error Rate ## benign 61 3 0.046875 =3/64 ## malignant 0 38 0.000000 =0/38 ## Totals 61 41 0.029412 =3/102 plot(rf_model, timestep = "number_of_trees", metric = "classification_error")

plot(rf_model, timestep = "number_of_trees", metric = "logloss")

plot(rf_model, timestep = "number_of_trees", metric = "AUC")

plot(rf_model, timestep = "number_of_trees", metric = "rmse")

h2o.auc(rf_model, train = TRUE) ## [1] 0.9907214 h2o.auc(rf_model, valid = TRUE) ## [1] 0.9829359 h2o.auc(rf_model, xval = TRUE) ## [1] 0.9903005 perf <- h2o.performance(rf_model, test) perf ## H2OBinomialMetrics: drf ## ## MSE: 0.03258482 ## RMSE: 0.1805127 ## LogLoss: 0.1072519 ## Mean Per-Class Error: 0.02985075 ## AUC: 0.9916594 ## Gini: 0.9833187 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## benign malignant Error Rate ## benign 63 4 0.059701 =4/67 ## malignant 0 34 0.000000 =0/34 ## Totals 63 38 0.039604 =4/101 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.306667 0.944444 18 ## 2 max f2 0.306667 0.977011 18 ## 3 max f0point5 0.720000 0.933735 13 ## 4 max accuracy 0.533333 0.960396 16 ## 5 max precision 1.000000 1.000000 0 ## 6 max recall 0.306667 1.000000 18 ## 7 max specificity 1.000000 1.000000 0 ## 8 max absolute_mcc 0.306667 0.917235 18 ## 9 max min_per_class_accuracy 0.533333 0.955224 16 ## 10 max mean_per_class_accuracy 0.306667 0.970149 18 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)` plot(perf)

perf@metrics$thresholds_and_metric_scores %>% ggplot(aes(x = fpr, y = tpr)) + geom_point() + geom_line() + geom_abline(slope = 1, intercept = 0) + labs(x = "False Positive Rate", y = "True Positive Rate")

h2o.logloss(perf) ## [1] 0.1072519 h2o.mse(perf) ## [1] 0.03258482 h2o.auc(perf) ## [1] 0.9916594 head(h2o.metric(perf)) ## Metrics for Thresholds: Binomial metrics as a function of classification thresholds ## threshold f1 f2 f0point5 accuracy precision recall ## 1 1.000000 0.583333 0.466667 0.777778 0.801980 1.000000 0.411765 ## 2 0.986667 0.666667 0.555556 0.833333 0.831683 1.000000 0.500000 ## 3 0.973333 0.716981 0.612903 0.863636 0.851485 1.000000 0.558824 ## 4 0.960000 0.740741 0.641026 0.877193 0.861386 1.000000 0.588235 ## 5 0.946667 0.763636 0.668790 0.889831 0.871287 1.000000 0.617647 ## 6 0.920000 0.807018 0.723270 0.912698 0.891089 1.000000 0.676471 ## specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy ## 1 1.000000 0.563122 0.411765 0.705882 ## 2 1.000000 0.631514 0.500000 0.750000 ## 3 1.000000 0.675722 0.558824 0.779412 ## 4 1.000000 0.697542 0.588235 0.794118 ## 5 1.000000 0.719221 0.617647 0.808824 ## 6 1.000000 0.762280 0.676471 0.838235 ## tns fns fps tps tnr fnr fpr tpr idx ## 1 67 20 0 14 1.000000 0.588235 0.000000 0.411765 0 ## 2 67 17 0 17 1.000000 0.500000 0.000000 0.500000 1 ## 3 67 15 0 19 1.000000 0.441176 0.000000 0.558824 2 ## 4 67 14 0 20 1.000000 0.411765 0.000000 0.588235 3 ## 5 67 13 0 21 1.000000 0.382353 0.000000 0.617647 4 ## 6 67 11 0 23 1.000000 0.323529 0.000000 0.676471 5 finalRf_predictions <- data.frame(actual = as.vector(test$classes), as.data.frame(h2o.predict(object = rf_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) finalRf_predictions %>% group_by(actual, predict) %>% dplyr::summarise(n = n()) ## # A tibble: 4 x 3 ## # Groups: actual [?] ## actual predict n ## ## 1 benign benign 64 ## 2 benign malignant 3 ## 3 malignant benign 1 ## 4 malignant malignant 33 finalRf_predictions %>% group_by(actual, predict_stringent) %>% dplyr::summarise(n = n()) ## # A tibble: 5 x 3 ## # Groups: actual [?] ## actual predict_stringent n ## ## 1 benign benign 62 ## 2 benign malignant 2 ## 3 benign uncertain 3 ## 4 malignant malignant 29 ## 5 malignant uncertain 5 finalRf_predictions %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

df <- finalRf_predictions[, c(1, 3, 4)] thresholds <- seq(from = 0, to = 1, by = 0.1) prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA) for (threshold in thresholds) { pred <- ifelse(df$benign > threshold, "benign", "malignant") pred_t <- ifelse(pred == df$actual, TRUE, FALSE) group <- data.frame(df, "pred" = pred_t) %>% group_by(actual, pred) %>% dplyr::summarise(n = n()) group_b <- filter(group, actual == "benign") prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n) prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b group_m <- filter(group, actual == "malignant") prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n) prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m } prop_table %>% gather(x, y, prop_true_b:prop_true_m) %>% ggplot(aes(x = threshold, y = y, color = x)) + geom_point() + geom_line() + scale_color_brewer(palette = "Set1") + labs(y = "proportion of true predictions", color = "b: benign cases\nm: malignant cases")

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog – https://shirinsplayground.netlify.com/categories/#posts-list-machine-learninghttps://shiring.github.io/categories.html#machine_learning-ref

stopCluster(cl) h2o.shutdown() ## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)? sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] ggrepel_0.8.0 reshape2_1.4.3 h2o_3.20.0.2 ## [4] corrplot_0.84 caret_6.0-80 doParallel_1.0.11 ## [7] iterators_1.0.9 foreach_1.4.4 ellipse_0.4.1 ## [10] igraph_1.2.1 bindrcpp_0.2.2 mice_3.1.0 ## [13] lattice_0.20-35 forcats_0.3.0 stringr_1.3.1 ## [16] dplyr_0.7.5 purrr_0.2.5 readr_1.1.1 ## [19] tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 ## [22] tidyverse_1.2.1 ## ## loaded via a namespace (and not attached): ## [1] minqa_1.2.4 colorspace_1.3-2 class_7.3-14 ## [4] rprojroot_1.3-2 pls_2.6-0 rstudioapi_0.7 ## [7] DRR_0.0.3 prodlim_2018.04.18 lubridate_1.7.4 ## [10] xml2_1.2.0 codetools_0.2-15 splines_3.5.0 ## [13] mnormt_1.5-5 robustbase_0.93-1 knitr_1.20 ## [16] RcppRoll_0.3.0 jsonlite_1.5 nloptr_1.0.4 ## [19] broom_0.4.4 ddalpha_1.3.4 kernlab_0.9-26 ## [22] sfsmisc_1.1-2 compiler_3.5.0 httr_1.3.1 ## [25] backports_1.1.2 assertthat_0.2.0 Matrix_1.2-14 ## [28] lazyeval_0.2.1 cli_1.0.0 htmltools_0.3.6 ## [31] tools_3.5.0 gtable_0.2.0 glue_1.2.0 ## [34] Rcpp_0.12.17 cellranger_1.1.0 nlme_3.1-137 ## [37] blogdown_0.6 psych_1.8.4 timeDate_3043.102 ## [40] xfun_0.2 gower_0.1.2 lme4_1.1-17 ## [43] rvest_0.3.2 pan_1.4 DEoptimR_1.0-8 ## [46] MASS_7.3-50 scales_0.5.0 ipred_0.9-6 ## [49] hms_0.4.2 RColorBrewer_1.1-2 yaml_2.1.19 ## [52] rpart_4.1-13 stringi_1.2.3 randomForest_4.6-14 ## [55] e1071_1.6-8 lava_1.6.1 geometry_0.3-6 ## [58] bitops_1.0-6 rlang_0.2.1 pkgconfig_2.0.1 ## [61] evaluate_0.10.1 bindr_0.1.1 recipes_0.1.3 ## [64] labeling_0.3 CVST_0.2-2 tidyselect_0.2.4 ## [67] plyr_1.8.4 magrittr_1.5 bookdown_0.7 ## [70] R6_2.2.2 mitml_0.3-5 dimRed_0.1.0 ## [73] pillar_1.2.3 haven_1.1.1 foreign_0.8-70 ## [76] withr_2.1.2 RCurl_1.95-4.10 survival_2.42-3 ## [79] abind_1.4-5 nnet_7.3-12 modelr_0.1.2 ## [82] crayon_1.3.4 jomo_2.6-2 xgboost_0.71.2 ## [85] utf8_1.1.4 rmarkdown_1.10 grid_3.5.0 ## [88] readxl_1.1.0 data.table_1.11.4 ModelMetrics_1.1.0 ## [91] digest_0.6.15 stats4_3.5.0 munsell_0.5.0 ## [94] magic_1.5-8 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Beeswarms instead of histograms

Thu, 06/28/2018 - 22:19

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

Histograms are good, density plots are also good. Violin and bean plots too. Recently I had someone ask for a plot where you could see each individual point along a continuum, give the points specific colours based on a second variable (similar to the figure), which deviates somewhat from the typical density type plots. Apparently, they’re called beeplots or beeswarms. And there’s a way to make them in R (of course, there’s probably more than one… ggplot2??).

Here’s one way (slightly modified from the packages help files)…

library(beeswarm) # assuming you've installed it ;) data(breast) beeswarm(time_survival ~ ER, data = breast, pch = 16, pwcol = 1 + as.numeric(event_survival), xlab = "", ylab = "Follow-up time (months)", horizontal = TRUE, labels = c("ER neg", "ER pos"), method = "c") legend("topright", legend = c("Yes", "No"), title = "Censored", pch = 16, col = 1:2)

Horizontal is optional of course…

Feel free to comment if you know of other approaches.

See here for more examples.

Hope that helps someone

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Choroplethr v3.6.2 is now on CRAN

Thu, 06/28/2018 - 00:33

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

A new version of Choroplethr is now on CRAN! You can get it by typing the following from the R Console:

install.packages("choroplethr")

I encourage all Choroplethr users to install this new version!

Overview

A new version of ggplot2 will be released shortly. This new version of ggplot2 introduces many new features. These new features, however, come at a cost: some packages which use ggplot2 (such as Choroplethr) are not compatible with this new version.

In fact, if you try to draw a map with the new version of ggplot2 and the old version of Choroplethr, you might get an error. This new version of Chorplethr (v3.6.2) fixes those issues.

Technical Details

Choroplethr was one of (if not the) first R package to create maps of the US where Alaska and Hawaii are rendered as insets. Here’s an example map, along with code:

library(choroplethr) data(df_pop_state) state_choropleth(df_pop_state)

The new version of ggplot2 was causing Choroplethr to crash whenever it attempted to render the above map. In order to understand the underlying issue, it is helpful to first understand how Choroplethr creates the above map.

Insets and ?annotation_custom

The most important thing to notice about the above map is this: it is inaccurate.

Alaska and Hawaii are not really located where they appear. And they are not drawn to scale, either.

Here is how the 50 US states “really” look:

library(choroplethrMaps) library(ggplot2) data(state.map) ggplot(state.map, aes(long, lat, group=group)) + geom_polygon()

In order to render Alaska and Hawaii as insets, Choroplethr:

  1. First pre-renders the Continental US, Alaska and Hawaii as separate ggplot objects.
  2. Then affixes the Alaska and Hawaii ggplot objects to the continental US object using ggplot2’s ggplotGrob and annotation_custom functions.
Insets and Themes

The two images differ in more than just the placement of Alaska and Hawaii, though. The first map also has a clear background, whereas the second map has a grey background and tick marks for longitude and latitude. Also, Choroplethr uses a single legend for all three maps (the legends for Alaska and Hawaii are suppressed).

Traditionally Choroplethr has implemented these aesthetic tweaks in the functions theme_clean (for the Continental US) and theme_inset (for Alaska and Hawaii). The code for these functions originally came from Section 13.19 (“Making a Map with a Clean Background”) of Winston Chang’s R Graphics Cookbook. Here’s what that code used to look like:

theme_clean = function() { theme( axis.title = element_blank(), axis.text = element_blank(), panel.background = element_blank(), panel.grid = element_blank(), axis.ticks.length = unit(0, "cm"), panel.spacing = unit(0, "lines"), plot.margin = unit(c(0, 0, 0, 0), "lines"), complete = TRUE ) }, theme_inset = function() { theme( legend.position = "none", axis.title = element_blank(), axis.text = element_blank(), panel.background = element_blank(), panel.grid = element_blank(), axis.ticks.length = unit(0, "cm"), panel.spacing = unit(0, "lines"), plot.margin = unit(c(0, 0, 0, 0), "lines"), complete = TRUE ) }

The Bug

The bug was happening when Choroplethr was attempting to affix the map of Alaska, with theme_inset applied to it, to the map of the Continental US:

# omit code for creating continental.ggplot ret = continental.ggplot # omit code for creating alaska.ggplot # add alaska.ggplot to ret using ggplotGrob and annotation_custom alaska.grob = ggplotGrob(alaska.ggplot) ret = ret + annotation_custom(grobTree(alaska.grob), xmin=-125, xmax=-110, ymin=22.5, ymax=30)

The last line in the above snippet was causing Choroplethr to crash. It appears that the intersection of theme_insetggplotGrob and annotation_custom was causing a problem.

The Fix

It appears that ggplot2 has a new function for creating a clear background on a map that does not cause these problems. That function is theme_void. Modifying theme_clean and theme_inset to use theme_void solved the problem. Here is the new version of those functions:

theme_clean = function() { ggplot2::theme_void() } theme_inset = function() { ggplot2::theme_void() %+replace% ggplot2::theme(legend.position = "none") }

The post Choroplethr v3.6.2 is now on CRAN appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Financial Times and BBC use R for publication graphics

Wed, 06/27/2018 - 23:56

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While graphics guru Edward Tufte recently claimed that "R coders and users just can't do words on graphics and typography" and need additonal tools to make graphics that aren't "clunky", data journalists at major publications beg to differ. The BBC has been creating graphics "purely in R" for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

… and this chart on gender pay gaps at large British banks:

Meanwhile, the chart below was made for the Financial Times using just R and the ggplot2 package, "down to the custom FT font and the white bar in the top left", according to data journalist John Burn-Murdoch.

There are also entire collections devoted to recreating Tufte's own visualizations in R, presumably meeting his typography standards. Tufte later clarified saying "Problem is not code, problem is published practice", which is true of any programming environment, which is why it was strange that he'd call out R in particular.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Finalfit now in CRAN

Wed, 06/27/2018 - 21:56

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

Your favourite package for getting model outputs directly into publication ready tables is now available on CRAN. They make you work for it! Thank you to all that helped. The development version will continue to be available from github.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Course: Marketing Analytics in R

Wed, 06/27/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Course Description

Here is a link to our new R course.

This is your chance to dive into the worlds of marketing and business analytics using R. Day by day, there are a multitude of decisions that companies have to face. With the help of statistical models, you’re going to be able to support the business decision-making process based on data, not your gut feeling. Let us show you what a great impact statistical modeling can have on the performance of businesses. You’re going to learn about and apply strategies to communicate your results and help them make a difference.

Chapter 1: Modeling Customer Lifetime Value with Linear Regression

How can you decide which customers are most valuable for your business? Learn how to model the customer lifetime value using linear regression.

Chapter 2: Logistic Regression for Churn Prevention

Predicting if a customer will leave your business, or churn, is important for targeting valuable customers and retaining those who are at risk. Learn how to model customer churn using logistic regression.

Chapter 3: Modeling Time to Reorder with Survival Analysis

Learn how to model the time to an event using survival analysis. This could be the time until next order or until a person churns.

Chapter 4: Reducing Dimensionality with Principal Component Analysis

Learn how to reduce the number of variables in your data using principal component analysis. Not only does this help to get a better understanding of your data. PCA also enables you to condense information to single indices and to solve multicollinearity problems in a regression analysis with many intercorrelated variables.

If you are interested in learning more about marketing and data science, check out this tutorial for Python, Data Science for Search Engine Marketing.

The following R courses are prerequisites to take Marketing Analytics in R: Statistical Modeling.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Project: A Visual History of Nobel Prize Winners

Wed, 06/27/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Project Description

The Nobel Prize is perhaps the worlds most well known scientific award. Every year it is given to scientists and scholars in chemistry, literature, physics, medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the prize was Eurocentric and male-focused, but nowadays it’s not biased in any way. Surely, right?

Well, let’s find out! In this Project, you get to explore patterns and trends in over 100 years worth of Nobel Prize winners. What characteristics do the prize winners have? Which country gets it most often? And has anybody gotten it twice? It’s up to you to figure this out.

Before taking on this Project, we recommend that you have completed Introduction to the Tidyverse.

The dataset used in this Project was retrieved from The Nobel Foundation on Kaggle.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages