Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 2 hours ago

Data validation with the assertr package

Tue, 04/11/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

This is cross-posted from Tony’s blog onthelambda.com

Version 2.0 of my data set validation package assertr hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr, the text below will point out the improvements.

I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).

assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.

I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; assertr is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the assertive meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell assertr as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that assertr attempts to impose a grammar of error checking for datasets.

In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:

  • For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. assertr calls this verb assert.
  • For every element in a column, you want to make sure certain criteria are met but the criteria can be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. assertr calls this verb insist.
  • For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated. assertr calls this verb assert_rows.
  • For every row of a dataset, you want to make sure certain assumptions hold but the criteria can be decided only after looking at the entire column as a whole. This closely mirrors the distinction between assert and insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr calls this verb insist_rows.
  • You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc… assertr calls this last verb verify.

Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…

Pretend that, before finding the average miles per gallon for each number of engine cylinders in the mtcars dataset, we wanted to confirm the following dataset assumptions…

  • that it has the columns mpg, vs, and am
  • that the dataset contains more than 10 observations
  • that the column for 'miles per gallon' (mpg) is a positive number
  • that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean
  • that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
  • each row contains at most 2 NAs
  • each row is unique jointly between the mpg, am, and wt columns
  • each row's mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)

library(assertr) library(magrittr) library(dplyr) mtcars %>% verify(has_all_names("mpg", "vs", "am", "wt")) %>% verify(nrow(.) > 10) %>% verify(mpg > 0) %>% insist(within_n_sds(4), mpg) %>% assert(in_set(0,1), am, vs) %>% assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>% assert_rows(col_concat, is_uniq, mpg, am, wt) %>% insist_rows(maha_dist, within_n_mads(10), everything()) %>% group_by(cyl) %>% summarise(avg.mpg=mean(mpg))

Before assertr version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the chain_start and chain_end functions; all assumptions within a chain (below a call to chain_start and above chain_end) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.

Another major improvement since the last version of assertr on CRAN is that assertr errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell assertr to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct assertr to just give a warning instead of throwing a fatal error. For more information on this, see help("success_and_error_functions")

Beyond these examples

Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using assertr to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.

The first is that there is another package, assertive that greatly enhances the assertr experience. The predicates (functions that start with is_) from this (meta)package can be used in assertr pipelines just as easily as assertr’s own predicates. And assertive has an enormous amount of them! Some specialized and exciting examples include is_hex_color, is_ip_address, and is_isbn_code!

The second is if assertive doesn’t have what you’re looking for, with just a little bit of studying the assertr grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.

If this package interests you, I urge you to read the most recent package vignette here. If you're an assertr old-timer, I point you to this NEWS file that list the changes from the previous version.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sakura blossoms in Japan

Tue, 04/11/2017 - 07:00

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

Springtime and cherry blossoms

It is spring, and the flowers bloom. One tree in particular symbolises this: the Japanese cherry (sakura) tree. Every year, millions of Japanese enjoy hanami, jointly watching the cherry blossoms, philosophising about the the transient nature of beauty, and the inevitability of death.

There are regular “sakura forecasts” on TV, and elsewhere. The Japanese National Tourist Office has its own map and details available on their website as well.

The data and graph section of The Economist recently published a beautiful chart showing 1200 years of data on the cherry blossom – as well as one of their customary great puns: “the elephant in the bloom”. I wondered if I could replicate the graph in R.

I remembered an emoji font I had stumbled across, and that there is a sakura (cherry blossom) emoji. I also felt a bit competitive – with a Japanese on this blog’s author roll, and another author being a certified expert on emojis, I felt that I needed show off that I, too, can poach in their areas of expertise!

Data

The Economist’s article mentions a professor Yasuyuki Aono, who wrote several papers using data on Japanese cherry blossom dates. A quick googleing brought up his homepage. He also linked to a page detailing the cherry blossom data. Fantastic! A pretty straightforward data grab, it is then.

if(!file.exists("sakura.xls")){ # data is available on AONO's university homepage: # http://atmenv.envi.osakafu-u.ac.jp/aono/kyophenotemp4/ aono.dta <- "http://atmenv.envi.osakafu-u.ac.jp/osakafu-content/uploads/sites/251/2015/10/KyotoFullFlower7.xls" # make sure to read as binary, because windows is stupid download.file(aono.dta, "sakura.xls", mode= "wb") } # read in data dta <- readxl::read_excel("sakura.xls", skip = 25) %>% # make "good" column names setNames(make.names(names(.), unique = TRUE)) %>% mutate( # convert the blossom date to R dateformat blossom.date = as.Date(paste0(sprintf("%04d", AD), "0", Full.flowering.date), format = "%Y%m%d") )

We download the data if not available on disk, do some cleaning of column names, and generate a proper date column of the full flowering date, since this is the data with the most observations.

It’s a fascinating dataset: 1200 years of human records on a yearly event! It’s not often that one can work with these kinds of data. For this, I would like to thank professor Aono. Generating such a historical dataset is a tremendous amount of work (he probably had to find and read through hundreds of historical records), and providing the data on his homepage is not something I see often in academia.

Graphing

The original emoji graphing system I had in mind, emoGG, worked, but I could not change the size of the individual emojis. It turns out, emoGG is deprecated, and I should look at emojifont instead.

emojifont uses geom_text() to draw the emojis onto a ggplot() graph. After a quick install of the font I was ready to go. Two things one needs to keep in mind when working with emojifont: you will need to load the font to start with, and if you’re using RStudio, you will need to initialise a new graphics environment. The standard RStudio environment is not able to preview/show the fonts. Both can be accomplished with these commands:

# need this to load the font load.emojifont() # only needed for RStudio windows()

One other thing that was problematic: emojifont uses a font to populate the graph by ggplot2::aes(label = "labelname") to specify the emoji. It needs the label pulled from the original dataset, and not generated within the aes().

# plot as sakuras dta %>% # include the sakura as emoji character in dataset mutate( sakura.emoji = emoji("cherry_blossom"), # join to a common year for axis label (2000 is a leap year) common.date = as.Date(paste0("2000-", format(blossom.date, "%m-%d")), "%Y-%m-%d") )

I created the common.date variable to scale the y-axis of the graph to a month-day format.

dta %>% # plot in ggplot ggplot(aes(x=AD, y=common.date))+ # alternatively, with geom_emoji: # geom_emoji("cherry_blossom", x=dta$AD, y=dta$Full.flowering.date..DOY., size = 3)+ #geom_point(size = 0.5, colour = "red")+ # include emoji as text, h-/vjust to center; strange it needs vjust 0.25 -- why? 975572 BD77A4 geom_text(aes(label = sakura.emoji, hjust = 0.5, vjust = 0.25), family = "EmojiOne", size = 4, colour = "#FF506E")+ # trend line geom_smooth(method = "loess", span = 0.1, fill = "#D2A5C2", colour = "grey", size = 0.5)

To copy the Economist’s graph as closely as possible, I manually set the y-axis breaks. The x-axis breaks are more interesting, because the original graphs has major ticks every 200 years, labelled, and minor ticks every 100 years in between, which are not labelled. This is unfortunately not possible in ggplot, currently, so I had to resort to a workaround.

I create the sequence of years, with ticks every 100 years (so 800, 900, 1000, ...). I then “overwrite” every even element (900, 1100, 1300, ...) of the vector, resulting in a “blank” label for those years. The axis label would thus look like this: 800, "", 1000, "", 1200, ....

scale_y_date( name = "Date of cherry-blossom peak-bloom", breaks = c("2000-03-27", "2000-04-01", "2000-04-10", "2000-04-20", "2000-05-01", "2000-05-04") %>% as.Date(), # Apr-01 labels = scales::date_format("%b-%d") )+ scale_x_continuous( limits = c(800, 2020), # axis ticks every 200 years breaks = seq(800, 2000, 100), # no minor axis ticks in ggplot2::theme(): http://stackoverflow.com/questions/14490071/adding-minor-tick-marks-to-the-x-axis-in-ggplot2-with-no-labels # length(breaks): 13; replace every even element with empty string, to remove from axis labels labels = replace(seq(800, 2000, 100), seq(2, 12, 2), ""), name = "Year" )

All that remains then are some theme() hacking to get the graph pretty.

Behold, 1200 years of cherry blossom history!

I’m not super happy with the graph colours, it seems too “light”, and pastel. Not really very readable. The Economist’s version is in purple; much more readable and much prettier to look at. I did want to use the sakura pink tones, though, and thus the graph colour scheme seems as fragile and transient as the actual sakura blossoms.

Concluding remarks

The code is available on github, as always.

Aono uses the data to predict March temperatures, and in various papers he manages to do this to an astonishing degree of accuracy: 0.1°C.

It’s very disheartening to see the blossom dates dramatically moving earlier into the year in the last 50 years. Global warming is to blame, and it always saddens me that people chose to ignore these facts… As a cold-loving northern German, I will need to find a retirement location in northern Norway, if things continue the way they do!

Sakura blossoms in Japan was originally published by Kirill Pomogajko at Opiate for the masses on April 11, 2017.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Plot Function — The Options

Tue, 04/11/2017 - 00:02

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

R’s plot function is probably the most used visualization function in R. It’s simple, easy and gets the job done. It’s also highly customizable. Adding unnecessary styling and information on a visualization/plot is not really recommended because it can take away from what’s being portrayed, but there are times when you have just have to. Whether it’s for pure aesthetics, to convey multiple things in one plot, or any other reason, here are the options you can use in R’s base plot() function.

The Data Points

We’re going to be using the cars dataset that is built in R. To follow along with real code, here’s an interactive R Notebook. Feel free to copy it and play around with the code as you read along.

So if we were to simply plot the dataset using just the data as the only parameter, it’d look like this:

plot(dataset)
Plot using no options. Dot Style

The default data points are circles with an empty fill. To change the style of the dots (the data points), we use the option ‘pch’:

plot(dataset,pch=19)
Plot using the ‘pch’ option.

The ‘pch’ has accepts several codes, here is a grid with all the different data point styles (the default is 1):


https://greggilbertlab.sites.ucsc.edu/wp-content/uploads/sites/276/2015/10/symbols.jpg Data Point Size

To change the size of the data point, we use the ‘cex’ option:

plot(dataset,pch=19,cex=2)
Plot with the data point style and size options used. Data Point Color

The default color for the data points is black, to change that we use the ‘col’ option:

plot(dataset,pch=19,cex=2,col="pink")
Plot with the data point, size and color options used.

The ‘col’ option takes in both words and integers to identify the color. So you can use words like ‘green’, ‘wheat’, ‘red’ etc… and color codes. To get all the colors, just run colors() and it will return all the colors available. So if you know the location of your favorite color in the return value of the colors() function, you can just run plot(dataset,col=colors()[45]) and you’ll have a plot with your favorite color (or just save the color in a variable somewhere in the document if you want to keep re-using it). There are 657 colors in the colors() function.

Axis Labels

If you work in a team, or even if you work alone (but especially if you work with a group), always label your axes. If the dataset you’re using has headers or column titles, the plot function will automatically use those as the labels for the plot axes. To add your own, use the xlab() and the ylab() options:

plot(dataset, xlab("Speed (km/h)"), ylab("Distance (km)"))
Plot option with custom axes labels. Plot Legend

Plot legends can be called using the legend() function. The first two parameters are the x-position and y-position of the legend on the plot. You call this function right after you plot:

plot(dataset,col="blue",pch=4)
legend(20,110,c("Cars"),col="blue",pch=4)
Plot legend.

You want the legend symbol to match the symbol used in the plot. The legend takes in the same pch codes used in the plot() function for the symbols. In addition, you should of course have the same color for the symbols in the legend and the symbols in the plot. Here’s some of the options you can play around with in the legend() function:

legend(xPosition: int, yPosition: int, labels: array, col :int|string, cex: int, pch: int)

These are just what I call the essentials, a lot more in the documentation (see below).

And that’s it. Like I said before, there are several other options you can use like regression/trend lines, plot sizing etc… These are just the essentials when you want a little something extra on your visualization. In particular stages of the data analysis process, the less you add to your plots, the better.

Reference and Documentation


R Plot Function — The Options was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

8th MilanoR Meeting – Presentations, photos and video!

Mon, 04/10/2017 - 23:45

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hello R-users <3

The 8th MilanoR Meeting was great! Thank you very much for your interest and participation!

A short recap for those who weren’t there: the meeting was last Wednesday evening at the Microsoft House (a very nice location, thanks @dr_mattia for your support!).

We had two exceptional speakers: Stefano Iacus, member of the R Foundation and funder of Voices from the blogs, and Romain François, historical author of the package Rcpp and data active at Thinkr, who came all the way from Paris for us.

The event started with my quick presentation about our MilanoR community and our upcoming events: if you want to take a look, here it is.


Welcome Presentation 

by Mariachiara Fortuna

 

Then it was the turn of Stefano, that introduced us the yuima package, a powerful S4 framework for the simulation and inference of several classes of time series. On the top of the yuima package, Stefano and his collaborators built a great interactive dashboard, yuimaGUI, that makes accessible to a wider audience the data I/O, the explorative data analysis and model fitting. The youima package and the related app look very powerful: if you want to be introduced to their potentialities, the presentation is wide and deep.

yuimaGUI a Shiny app for the analysis and simulation of financial time series

by Stefano Iacus

 

Then Romain François gave us the chance of watching under the R surface, to the C++ deep layer. Romain told us about the past and the state of the art of R-C++ relationship, in a surprisingly clear and accessible way. Then he shared with us some insights on how he imagines the future of R and C++ integration and the challenge it represents: you can find more or less everything in the presentation here, or you may follow this link.

R and C++: past, present and future

by Romain François

 

The meeting ended with a free refreshment and some networking time – you can see everything from the pictures

This time we were able to have a Facebook streaming during the meeting, so the full recording is available in our Facebook channel. Thanks @Akiro for the perfect quality!

Meeting full video

 

 

We are grateful to our sponsors, Quantide and Microsoft, that made possible all of this, to Stefano and Romain that shared their knowledge with us, and to all of you that were there. Thank you so much, and hope to see you soon at the R-Lab#2 (today)!

The post 8th MilanoR Meeting – Presentations, photos and video! appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Linear splines with convenient parametrisations

Mon, 04/10/2017 - 23:38

(This article was first published on R – Brokering Closure, and kindly contributed to R-bloggers)

I have just published a small package lspline on CRAN that implements linear splines using convenient parametrisations such that

  • coefficients are slopes of consecutive segments
  • coefficients capture slope change at consecutive knots

Knot locations can be specified

  • manually (with lspline())
  • at breaks dividing the range of x into q equal-frequency intervals (with qlspline())
  • at breaks dividing the range of x into n equal-width intervals (with elspline())

The implementation follows Greene (2003, chapter 7.5.2).

The package sources are on GitHub here.

Examples

We will use the following artificial data with knots at x=5 and x=10:

set.seed(666) n <- 200 d <- data.frame( x = scales::rescale(rchisq(n, 6), c(0, 20)) ) d$interval <- findInterval(d$x, c(5, 10), rightmost.closed = TRUE) + 1 d$slope <- c(2, -3, 0)[d$interval] d$intercept <- c(0, 25, -5)[d$interval] d$y <- with(d, intercept + slope * x + rnorm(n, 0, 1))

Plotting y against x:

library(ggplot2) fig <- ggplot(d, aes(x=x, y=y)) + geom_point(aes(shape=as.character(slope))) + scale_shape_discrete(name="Slope") + theme_bw() fig

The slopes of the consecutive segments are 2, -3, and 0.

Setting knot locations manually

We can parametrize the spline with slopes of individual segments (default marginal=FALSE):

library(lspline) m1 <- lm(y ~ lspline(x, c(5, 10)), data=d) knitr::kable(broom::tidy(m1)) term estimate std.error statistic p.value (Intercept) 0.1343204 0.2148116 0.6252941 0.5325054 lspline(x, c(5, 10))1 1.9435458 0.0597698 32.5171747 0.0000000 lspline(x, c(5, 10))2 -2.9666750 0.0503967 -58.8664832 0.0000000 lspline(x, c(5, 10))3 -0.0335289 0.0518601 -0.6465255 0.5186955

Or parametrize with coeficients measuring change in slope (with marginal=TRUE):

m2 <- lm(y ~ lspline(x, c(5,10), marginal=TRUE), data=d) knitr::kable(broom::tidy(m2)) term estimate std.error statistic p.value (Intercept) 0.1343204 0.2148116 0.6252941 0.5325054 lspline(x, c(5, 10), marginal = TRUE)1 1.9435458 0.0597698 32.5171747 0.0000000 lspline(x, c(5, 10), marginal = TRUE)2 -4.9102208 0.0975908 -50.3143597 0.0000000 lspline(x, c(5, 10), marginal = TRUE)3 2.9331462 0.0885445 33.1262479 0.0000000

The coefficients are

  • lspline(x, c(5, 10), marginal = TRUE)1 – the slope of the first segment
  • lspline(x, c(5, 10), marginal = TRUE)2 – the change in slope at knot x = 5; it is changing from 2 to -3, so by -5
  • lspline(x, c(5, 10), marginal = TRUE)3 – tha change in slope at knot x = 10; it is changing from -3 to 0, so by 3

The two parametrisations (obviously) give identical predicted values:

all.equal( fitted(m1), fitted(m2) ) ## [1] TRUE

graphically

fig + geom_smooth(method="lm", formula=formula(m1), se=FALSE) + geom_vline(xintercept = c(5, 10), linetype=2)

Knots at n equal-length intervals

Function elspline() sets the knots at points dividing the range of x into n equal length intervals.

m3 <- lm(y ~ elspline(x, 3), data=d) knitr::kable(broom::tidy(m3)) term estimate std.error statistic p.value (Intercept) 3.5484817 0.4603827 7.707678 0.00e+00 elspline(x, 3)1 0.4652507 0.1010200 4.605529 7.40e-06 elspline(x, 3)2 -2.4908385 0.1167867 -21.328105 0.00e+00 elspline(x, 3)3 0.9475630 0.2328691 4.069080 6.84e-05

Graphically

fig + geom_smooth(aes(group=1), method="lm", formula=formula(m3), se=FALSE, n=200)

Knots at quantiles of x

Function qlspline() sets the knots at points dividing the range of x into q equal-frequency intervals.

m4 <- lm(y ~ qlspline(x, 4), data=d) knitr::kable(broom::tidy(m4)) term estimate std.error statistic p.value (Intercept) 0.0782285 0.3948061 0.198144 0.8431388 qlspline(x, 4)1 2.0398804 0.1802724 11.315548 0.0000000 qlspline(x, 4)2 1.2675186 0.1471270 8.615132 0.0000000 qlspline(x, 4)3 -4.5846478 0.1476810 -31.044273 0.0000000 qlspline(x, 4)4 -0.4965858 0.0572115 -8.679818 0.0000000

Graphically

fig + geom_smooth(method="lm", formula=formula(m4), se=FALSE, n=200)

Installation

Stable version from CRAN or development version from GitHub with

devtools::install_github("mbojan/lspline", build_vignettes=TRUE) Acknowledgements

Inspired by Stata command mkspline and function ares::lspline from Junger & Ponce de Leon (2011). As such, the implementation follows Greene (2003), chapter 7.5.2.

  • Greene, William H. (2003) Econometric analysis. Pearson Education
  • Junger & Ponce de Leon (2011) ares: Environment air pollution epidemiology: a library for timeseries analysis. R package version 0.7.2 retrieved from CRAN archives.

To leave a comment for the author, please follow the link and comment on their blog: R – Brokering Closure. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Saving input and output with sp_execute_external_script

Mon, 04/10/2017 - 22:06

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Again I was at the point, where I needed to store and save to external file all the R code that was executed through sp_execute_external_script.

Soon, you will find out several interesting things. To show the example, I will start with following example:

USE [WideWorldImporters]; GO EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N' d <- InputDataSet     c <- data.frame(Num_V1 = c(1,2,3))      c      OutputDataSet <- c'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS ((Numbers_From_R INT));

The result is a column called “Numbers” with three rows, represented from the data frame. This is very easy and straight-forward.

DMV

By using dynamic management view sys.dm_exec_query_stats as following:

SELECT      QM_ST.[TEXT] AS [Query]     ,DM_QS.last_execution_time     ,DM_QS.query_hash     ,DM_QS.query_plan_hash  FROM     sys.dm_exec_query_stats AS DM_QS     CROSS APPLY sys.dm_exec_sql_text(DM_QS.sql_handle) AS QM_ST ORDER BY     DM_QS.last_execution_time DESC

Surprisingly I get only the following query returned:

sp_execute_external_script: SELECT 1 AS Nmbrs_From_R

which is far what was executed in the first place!

EXECUTION PLANS

When using sys.dm_exec_query_plan dynamic management view to generate executed query plan, I get similar result with no R code and little sign of SQL query that was introduced to sp_execute_external_query procedure.

Relative the same results emerges when showing actual execution plan in SSMS. Only XML-UDX is showed.

So far, very slim possibility to get some extra and additional information from query statistics DMV or execution plan.

SQL SERVER PROFILER

So opening SQL Profiler and running the example sp_execute_external_script code, I was finally able to see the actual R code within profiler:

Upon closer look, we can see that profiler wraps execution of external procedure with following command SET STATISTICS XML ON/OFF. So we can store the results from profiler into a table or trace file and later filter out the R-code!

QUERY STORE

Query store is very very useful and new feature with flagship MSSQL2016. Storing the queries and execution times is therefore needed in order to do later performance analysis. So in this phase, let’s just see, if we can store external procedure code in query store.

With execution of R external procedure, I execute following query to check the Query Store (QS):

SELECT   QSQT.query_text_id  ,QSQT.query_sql_text  ,QSP.plan_id FROM     sys.query_store_plan AS QSP     JOIN sys.query_store_query AS QSQ       ON QSP.query_id = QSQ.query_id       JOIN sys.query_store_query_text AS QSQT       ON QSQ.query_text_id = QSQT.query_text_id

And the results are – in a way – not surprising at all, since many of query store statistics base on DMV. So result for my external procedure is again, very little informative in order to extract R code:

Something, we have seen already couple of times. And no sign of execution of R Script. In fact, looking from this, it is hard even to tell, this was passed to RLaunchpad.exe external program.

SINK

Sink is a R function to store the output of the executed R code into external file. With execution of any of the two T-SQL code, I will never be able to either get the results nor the R code itself.

In case of results:

EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         sink("C:\\DataTK\\logRSQLsession3.txt")         d <- InputDataSet         c <- data.frame(Num_V1 = c(1,2,3))         c         sink()         OutputDataSet <- c'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS ((Numbers_From_R INT)); EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         c <- data.frame(Num_V1 = c(1,2,3))         c         sink("C:\\DataTK\\logRSQLsession3.txt")'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS NONE;

In both cases the file is created, but it is just that. Empty file. No content whatsoever.

LOAD

Load will store intermediate results into file for later analysis or for semi aggreagated data, used for further calculations. So, I have tested it as following:

EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         c <- data.frame(Num_V1 = c(1,2,3))         c         save(c, file="C:\\DataTK\\logRSQLsession3.rda")         #load(file="C:\\DataTK\\logRSQLsession3.rda")'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS NONE; -- LOAD RESULTS EXEC sys.sp_execute_external_script      @language = N'R'     ,@script = N'         load(file="C:\\DataTK\\logRSQLsession3.rda")         OutputDataSet <- c'     ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R' WITH RESULT SETS ((Num_V1 INT));

 

EXTENSIBILITY LOG

Extensibility Log will store information about the session but it will not store the R or R environment information or data, just session information and data. Navigate to:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\LOG\ExtensibilityLog

to check the content and to see, if there is anything useful for your needs.

Conclusion

We are very limited in terms of exporting executed R code, results or Logs. Same applies for importing any additional code. We have seen that import, source are not working, whereas Load for loading *.rda files is working. At least something There should be more ways to get into the, especially with Rterm or Vanilla R, but the idea was to have everything run comfortably from the SSMS environment.

As you can see, there is little possibilities to store R code separately or store execution R logs in external files. But I presume, I haven’t exhausted all the possibilities, so there should be still some ways to try and do this.

As always, the code is available at Github.

Happy Rrrrr!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: March 2017 roundup

Mon, 04/10/2017 - 21:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from March of particular interest to R users. 

A tutorial and comparison of the SparkR, sparklyr, rsparkling, and RevoScaleR packages for using R with Spark.

An analysis of Scrabble games between AI players.

The doAzureParallel package, a backend to "foreach" for parallel computations on Azure-based clusters.

The UK government project to automate reporting of official statistics with R.

Data science languages R and Python rank highly in the latest Redmonk popularity rankings.

FiveThirtyEight used R to find clusters of similar subreddits.

RTVS 1.0, which provides R support to Visual Studio, is now available.

The mrsdeploy package (part of Microsoft R Server) facilitates publishing an R function as a web service on Azure.

The Call for Papers for the EARL conferences in London and San Francisco closes on April 14.

Alteryx Designer has been integrated with Microsoft R.

StitchFix, the personal styling service, uses R and Python as part of their data science process.

A review of "Testing R Code", by Richie Cotton.

On the connection between AUC and the Mann-Whitney U-Statistic.

An overview of neural networks and R packages to train them.

Performance benchmarks of rxNeuralNetwork (in the MicrosoftML package).

Updates to the Data Science Virtual Machine for Linux.

A timelapse of the founding of cities since 3500BC.

A set of R scripts and sample data to predict employee attrition.

R 3.3.3 is now available.

The htmlwidgets gallery catalogs interactive web-based charts for R.

R-based analyses to predict the length of a hospital stay.

Scholarships to encourage diversity at useR!2017.

And some general interest stories (not necessarily related to R): simulating the galaxy;  a poor showing at Crufts; why some memes survive better than others;  a digital clock made in Life; and a tongue-in-cheek guide to reading sheet music.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting: Time Series Exploration Exercises (Part-1)

Mon, 04/10/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

R provides powerful tools for forecasting time series data such as sales volumes, population sizes, and earthquake frequencies. A number of those tools are also simple enough to be used without mastering sophisticated underlying theories. This set of exercises is the first in a series offering a possibility to practice in the use of such tools, which include the ARIMA model, exponential smoothing models, and others.
The first set provides a training in exploration of regularly spaced time series data (such as weekly, monthly, and quarterly), which may be useful for selection of a predictive model. The set covers:
– visual inspection of time series,
– estimation of trend and seasonal patterns,
– finding whether a series is stationary (i.e. whether it has a constant mean and variance),
– examination of correlation between lagged values of a time series (autocorrelation).
The exercises make use of functions from the packages forecast, and tseries. Exercises are based on a dataset on retail sales volume by US clothing and clothing accessory stores for 1992-2016 retrieved from FRED, the Federal Reserve Bank of St. Louis database (download here). The data represent monthly sales in millions of dollars.
For other parts of the series follow the tag forecasting
Answers to the exercises are available here

Exercise 1
Read the data from the file sales.csv.

Exercise 2
Transform the data into a time series object of the ts type (indicate that the data is monthly, and the starting period is January 1992).
Print the data.

Exercise 3
Plot the time series. Ensure that the y axis starts from zero.

Exercise 4
Use the gghistogram function from the forecast package to visually inspect the distribution of time series values. Add a kernel density estimate and a normal density function to the plot.

Exercise 5
Use the decompose function to break the series into seasonal, trend, and irregular components (apply multiplicative decomposition).
Plot the decomposed series.

Exercise 6
Explore the structure of the decomposed object, and find seasonal coefficients (multiples). Identify the three months with the greatest coefficients, and the three months with the smallest coefficients. (Note that the coefficients are equal in different years).

Exercise 7
Check whether the time series is trend-stationary (i.e. its mean and variance are constant with respect to a trend) using the function kpss.test from the tseries package. (Note that the null hypothesis of the test is that the series is trend-stationary).

Exercise 8
Use the diff function to create a differenced time series (i.e. a series that includes differences between the values of the original series), and test it for trend stationarity.

Exercise 9
Plot the differenced time series.

Exercise 10
Use the Acf and Pacf functions from the forecast package to explore autocorrelation of the differenced series. Find at which lags correlation between lagged values is statistically significant at 5% level.

Related exercise sets:
  1. Stock Prices Analysis part 3 – Exercises
  2. zoo time series exercises
  3. Correlation and Correlogram Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Transaction Costs are Not an Afterthought; Transaction Costs in quantstrat

Mon, 04/10/2017 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

DISCLAIMER: Any losses incurred based on the content of this post are the responsibility of the trader, not the author. The author takes no responsibility for the conduct of others nor offers any guarantees.

Introduction: Efficient Market Hypothesis

Burton Malkiel, in the finance classic A Random Walk Down Wall Street, made the accessible, popular case for the efficient market hypothesis (EMH). One can sum up the EMH as, “the price is always right.” No trader can know more about the market; the market price for an asset, such as a stock, is always correct. This means that trading, which relies on forecasting the future movements of prices, is as profitable as forecasting whether a coin will land heads-up; in short, traders are wasting their time. The best one can do is buy a large portfolio of assets representing the composition of the market and earn the market return rate (about 8.5% a year). Don’t try to pick winners and losers; just pick a low-expense, “dumb” fund, and you’ll do better than any highly-paid mutual fund manager (who isn’t smart enough to be profitable).

The EMH comes in various forms and “stengths”. The strong form of the EMH claims that prices cannot be beaten by anyone. Period. All information is contained in the market price; no one has any information that could “correct” the market price. The semi-strong form walks this back by claiming that prices react to new information instantaneously, so no one can consistently beat the market (this walks back the strong form, where new information is irrelevant). The weak form claims that only those traders with insider knowledge can beat the market; public information is of no help.

If true, the EMH condemns much of the finance sector and traders to practicing an elaborate exercise in futility, so naturally its controversial. Obviously the weak version is most plausible (if not because the other two versions imply the weak version), and my reading of Burton Malkiel’s statements is that he’s a believer in either the weak or semi-strong versions, but any version dooms investors to the “boring” market rate (since insider trading is illegal). Many don’t like that idea.

There are good reasons to believe any form of the EMH is wrong. First, the EMH, if true, implies stock movement behavior exhibits certain characteristics that can be statistically tested. In particular, “extreme” price movements do not happen. Stocks price movements would resemble a Wiener process (good news if your a mathematician or financial engineer, since these process have very nice properties), and price changes would follow a bell-curve shape (the Normal distribution) where 99.7% of most price movements happen nearby the average; “extreme” events are almost never seen. But the famed mathematician Benoit Mandelbrot (the one famous for developing the theory of fractals) showed that extreme price movements, rather than being relatively uncommon, happen very frequently; we see price movements that should not be seen within a million years or longer every couple decades or so. There are other assaults too from behavioral economics and other fields, along with anecdotal evidence such as the wealth of Warren Buffett.

A Random Walk Down Wall Street has been through (at least) eleven editions, so one should not be surprised that Malkiel has heard and addressed some of these issues. For example, instead of arguing against behavioral economics, he co-opts some findings from the theory, such as confirmation bias or the brain’s propensity for pattern-seeking, that supports his thesis that trying to beat the market is a futile effort (ignoring some of the theory’s majori criticisms of the EMH)1..

As for some of the attacks against the markets’ alleged efficiency, Malkiel has a good defense; even if markets are not really, truly efficient (efficiency meaning “the price is always right”), for the rank-and-file investor, this doesn’t matter. They’re efficient enough for an investor to be unable to game them profitably. Fees and small mistakes will eat away any gains to be had; worse, they may reverse them from an opportunity-cost perspective, if not even a purely monetary one.

Malkiel has a point: index funds can win just by being low cost, thanks to being “dumb” by just matching the composition of the market. Every transaction has a cost. Small gains may fail to make up for the costs in research, and the opportunity to make those gains may disappear as soon as they’re spotted. But most damaging of all is the fees.

Fees are Important

Fees and transaction costs are the friction sapping the energy from a trading system until it grinds to a red-hot halt. They’re the house’s cut at a poker table that make the game a losing one for all involved except the house. They’re often mentioned as an afterthought in books; a strategy is presented without accounting for fees, and the author says, “We should account for fees,” and nothing more. This could lead to a neophyte trader believing that fees are more a nuisance than something to be seriously accounted for when backtesting. Nothing could be further from the truth.

Every effort should be taken to minimize fees. Opt for the cheapest brokers (don’t pay for the advice; the premium paid for what passes for expertise is likely not worth it, even if the advice were good, and it often isn’t). Given the choice between a trading system making lots of trades and a trading system making few, opt for few. Remember that a single percentage point difference in returns means a big difference in profitability (thanks to compounding). Heck, people could become rich on Wall Street if only they can dodge the fees!

In my blog post on backtesting with Python, I introduced a problem to readers: adjust my backtesting system to account for transaction costs. Naturally, I never give a problem without solving it myself first (how else do I know it can be done?), and I retried it on the data sets I looked at in the post, adding in a 2% commission.

The result? The strategy that was barely profitable turned into a losing one (let alone losing compared to SPY).

The point was important enough that I felt sorry that I did not post this result before. I still will not post my Python solution; why spoil the problem? But I can still make that point in R (I never made equivalent R problems).

Cleaning Up the Trading System

Readers who read both the Python and R posts on backtesting will notice that they are not exactly equivalent. The Python trading system trades based on signal, while the R system trades based on regime (the R system will reinforce a position so it’s always 10% of the portfolio, even after the initial crossing). They Python system trades in batches of 100, while the R system has no such restriction. Here, I redo the R trading system, when trading from a batch of tech stocks, and bring it back into line with they Python trading system.

I load in quantstrat and repeat many of the steps from earlier articles.

if (!require("quantstrat")) { install.packages("quantstrat", repos="http://R-Forge.R-project.org") library(quantstrat) } if (!require("IKTrading")) { if (!require("devtools")) { install.packages("devtools") } library(devtools) install_github("IKTrading", username = "IlyaKipnis") library(IKTrading) } start <- as.Date("2010-01-01") end <- as.Date("2016-10-01") rm(list = ls(.blotter), envir = .blotter) # Clear blotter environment currency("USD") # Currency being used Sys.setenv(TZ = "MDT") # Allows quantstrat to use timestamps initDate <- "2009-12-31" # A date prior to first close price; needed (why?) # Get new symbols symbols <- c("AAPL", "MSFT", "GOOG", "FB", "TWTR", "NFLX", "AMZN", "YHOO", "SNY", "NTDOY", "IBM", "HPQ") getSymbols(Symbols = symbols, src = "yahoo", from = start, to = end) # Quickly define adjusted versions of each of these `%s%` <- function(x, y) {paste(x, y)} `%s0%` <- function(x, y) {paste0(x, y)} for (s in symbols) { eval(parse(text = s %s0% "_adj <- adjustOHLC(" %s0% s %s0% ")")) } symbols_adj <- paste(symbols, "adj", sep = "_") stock(symbols_adj, currency = "USD", multiplier = 1) strategy_st <- portfolio_st <- account_st <- "SMAC-20-50" rm.strat(portfolio_st) rm.strat(strategy_st) initPortf(portfolio_st, symbols = symbols_adj, initDate = initDate, currency = "USD") initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 1000000) initOrders(portfolio_st, store = TRUE) strategy(strategy_st, store = TRUE)

As mentioned before, the strategy from my earlier article traded based on the current regime. This feature was implemented by using the signal function sigComparrison(), which tracks whether one indicator is greater than another (or vice versa). To trade based on a single crossing, use the sigCrossover() function (which is similar to sigComparison() but only generates a signal at the first point of crossing).

For some reason, the quantstrat developers wrote sigCrossover() so it returns either TRUE or NA rather than TRUE or FALSE. This leads to bad behavior for my system, so I’ve written an alternative, sigCrossover2(), that exhibits the latter behavior.

sigCrossover2 <- function(label, data = mktdata, columns, relationship = c("gt", "lt", "eq", "gte", "lte"), offset1 = 0, offset2 = 0) { # A wrapper for sigCrossover, exhibiting the same behavior except returning # an object containing TRUE/FALSE instead of TRUE/NA res <- sigCrossover(label = label, data = data, columns = columns, relationship = relationship, offset1 = offset1, offset2 = offset2) res[is.na(res)] <- FALSE return(res) }

Use this in my strategy instead of sigCrossover().

add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 20), label = "fastMA") ## [1] "SMAC-20-50" add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") ## [1] "SMAC-20-50" # Next comes trading signals add.signal(strategy = strategy_st, name = "sigCrossover2", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") ## [1] "SMAC-20-50" add.signal(strategy = strategy_st, name = "sigCrossover2", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") ## [1] "SMAC-20-50"

Now the strategy will trade based on a single crossover instead of balancing the portfolio so that 10% of equity is devoted to a signle position. I prefer this; the earlier strategy engaged in trades that would not be as profitable as those made at the point of crossover, driving up the number of transactions and thus the resulting fees.

Now, for trading in batches of 100. The original rule for opening a position used the function osMaxDollar() from the package IKTrading to size a position based on a dollar amount. The line floor(getEndEq(account_st_2, Date = timestamp) * .1)), passed to osMaxDollar()‘s maxSize and tradeSize parameters, instructed the system to place trades so they did not exceed 10% of the equity in the account. I modify the osMaxDollar() function to accept a batchSize parameter so that we can place trades in batches of 100.

# Based on Ilya Kipnis's osMaxDollar(); lots of recycled code osMaxDollarBatch <- function(data, timestamp, orderqty, ordertype, orderside, portfolio, symbol, prefer = "Open", tradeSize, maxSize, batchSize = 100, integerQty = TRUE, ...) { # An order sizing function that limits position size based on dollar value of # the position, optionally controlling for number of batches to purchase # # Args: # data: ??? (held over from Mr. Kipnis's original osMaxDollar function) # timestamp: The current date being evaluated (some object, like a string, # from which time can be inferred) # orderqty: ??? (held over from Mr. Kipnis's original osMaxDollar function) # ordertype: ??? (held over from Mr. Kipnis's original osMaxDollar # function) # orderside: ??? (held over from Mr. Kipnis's original osMaxDollar # function) # portfolio: A string representing the portfolio being treated; will be # passed to getPosQty # symbol: A string representing the symbol being traded # prefer: A string that indicates whether the Open or Closing price is # used for determining the price of the asset in backtesting (set # to "Close" to use the closing price) # tradeSize: Numeric, indicating the dollar value to transact (using # negative numbers for selling short) # maxSize: Numeric, indicating the dollar limit to the position (use # negative numbers for the short side) # batchSize: The number of stocks purchased per batch (only applies if # integerQty is TRUE); default value is 100, but setting to 1 # effectively nullifies the batchSize # integerQty: A boolean indicating whether or not to truncate to the # nearest integer of contracts/shares/etc. # ...: ??? (held over from Mr. Kipnis's original osMaxDollar function) # # Returns: # A numeric quantity representing the number of shares to purchase pos <- getPosQty(portfolio, symbol, timestamp) if (prefer == "Close") { price <- as.numeric(Cl(mktdata[timestamp, ])) } else { price <- as.numeric(Op(mktdata[timestamp, ])) } posVal <- pos * price if (orderside == "short") { dollarsToTransact 0) { dollarsToTransact = 0 } } else { dollarsToTransact <- min(tradeSize, maxSize - posVal) if (dollarsToTransact < 0) { dollarsToTransact = 0 } } qty <- dollarsToTransact/price if (integerQty) { # Controlling for batch size only makes sense if we were working with # integer quantities anyway; if we didn't care about being integers, # why bother? qty <- trunc(qty / batchSize) * batchSize } return(qty) }

Now we add trading rules, replacing osMaxDollar() with our new osMaxDollarBatch():

Now the updated analytics:

# Now for analytics updatePortf(portfolio_st) ## [1] "SMAC-20-50" dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) ## [1] "SMAC-20-50" updateEndEq(account_st) ## [1] "SMAC-20-50" tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) ## AAPL_adj AMZN_adj FB_adj GOOG_adj HPQ_adj ## Num.Txns 39.00 33.00 23.00 35.00 29.00 ## Num.Trades 20.00 17.00 12.00 18.00 15.00 ## Net.Trading.PL 96871.62 99943.99 116383.00 27476.81 51431.03 ## Avg.Trade.PL 4843.58 5879.06 9698.58 1526.49 3428.74 ## Med.Trade.PL 803.16 2755.00 4681.50 -830.50 -947.54 ## Largest.Winner 40299.33 33740.00 74888.01 27149.18 54379.08 ## Largest.Loser -7543.93 -14152.00 -17185.01 -9588.94 -15344.27 ## Gross.Profits 134300.99 140947.00 141620.01 62944.47 121893.03 ## Gross.Losses -37429.37 -41003.01 -25237.01 -35467.66 -70462.01 ## Std.Dev.Trade.PL 12463.46 13985.62 22585.57 8247.78 20283.11 ## Percent.Positive 55.00 64.71 75.00 38.89 40.00 ## Percent.Negative 45.00 35.29 25.00 61.11 60.00 ## Profit.Factor 3.59 3.44 5.61 1.77 1.73 ## Avg.Win.Trade 12209.18 12813.36 15735.56 8992.07 20315.51 ## Med.Win.Trade 7272.58 7277.00 8810.00 7254.76 10230.34 ## Avg.Losing.Trade -4158.82 -6833.84 -8412.34 -3224.33 -7829.11 ## Med.Losing.Trade -2837.23 -5474.50 -4752.00 -3537.02 -8954.58 ## Avg.Daily.PL 4216.84 4519.19 10124.27 1386.81 2571.42 ## Med.Daily.PL 304.94 2241.50 4346.99 -1121.00 -2841.81 ## Std.Dev.Daily.PL 12476.98 13232.69 23637.40 8479.65 20764.83 ## Ann.Sharpe 5.37 5.42 6.80 2.60 1.97 ## Max.Drawdown -24834.98 -44379.01 -42976.02 -26932.62 -53384.75 ## Profit.To.Max.Draw 3.90 2.25 2.71 1.02 0.96 ## Avg.WinLoss.Ratio 2.94 1.87 1.87 2.79 2.59 ## Med.WinLoss.Ratio 2.56 1.33 1.85 2.05 1.14 ## Max.Equity 102428.61 99943.99 120250.00 36354.82 52580.84 ## Min.Equity -11759.00 -3269.00 -14696.00 -21199.34 -53384.75 ## End.Equity 96871.62 99943.99 116383.00 27476.81 51431.03 ## IBM_adj MSFT_adj NFLX_adj NTDOY_adj SNY_adj ## Num.Txns 40.00 39.00 33.00 37.00 44.00 ## Num.Trades 20.00 20.00 17.00 19.00 22.00 ## Net.Trading.PL 10761.37 28739.24 232193.00 19555.00 -25046.79 ## Avg.Trade.PL 538.07 1436.96 13658.41 1029.21 -1138.49 ## Med.Trade.PL -651.11 -2351.12 2500.00 -1440.00 349.03 ## Largest.Winner 20407.71 18337.28 151840.00 43452.01 14393.86 ## Largest.Loser -7512.68 -11360.92 -29741.57 -8685.00 -16256.19 ## Gross.Profits 45828.56 88916.25 295344.86 79742.00 61804.49 ## Gross.Losses -35067.19 -60177.00 -63151.85 -60186.99 -86851.28 ## Std.Dev.Trade.PL 6103.11 8723.81 39673.15 12193.29 8412.38 ## Percent.Positive 40.00 45.00 64.71 42.11 54.55 ## Percent.Negative 60.00 55.00 35.29 57.89 45.45 ## Profit.Factor 1.31 1.48 4.68 1.32 0.71 ## Avg.Win.Trade 5728.57 9879.58 26849.53 9967.75 5150.37 ## Med.Win.Trade 3157.60 9507.66 8307.00 2560.50 3140.38 ## Avg.Losing.Trade -2922.27 -5470.64 -10525.31 -5471.54 -8685.13 ## Med.Losing.Trade -2153.73 -4955.57 -6835.07 -5832.00 -8354.55 ## Avg.Daily.PL 538.07 1135.14 14355.81 226.94 -1138.49 ## Med.Daily.PL -651.11 -2858.40 2847.00 -1512.00 349.03 ## Std.Dev.Daily.PL 6103.11 8854.93 40866.48 12019.71 8412.38 ## Ann.Sharpe 1.40 2.04 5.58 0.30 -2.15 ## Max.Drawdown -37072.32 -30916.29 -54616.13 -50112.01 -36718.77 ## Profit.To.Max.Draw 0.29 0.93 4.25 0.39 -0.68 ## Avg.WinLoss.Ratio 1.96 1.81 2.55 1.82 0.59 ## Med.WinLoss.Ratio 1.47 1.92 1.22 0.44 0.38 ## Max.Equity 30247.68 42996.90 264941.01 40014.00 11471.08 ## Min.Equity -6824.64 -17726.78 -3255.72 -32476.99 -32673.80 ## End.Equity 10761.37 28739.24 232193.00 19555.00 -25046.79 ## TWTR_adj YHOO_adj ## Num.Txns 9.00 43.00 ## Num.Trades 5.00 22.00 ## Net.Trading.PL 27051.99 106619.99 ## Avg.Trade.PL 5410.40 4846.36 ## Med.Trade.PL -27.01 -1296.00 ## Largest.Winner 1800.00 54746.00 ## Largest.Loser -10362.00 -7700.00 ## Gross.Profits 43340.99 155671.99 ## Gross.Losses -16289.00 -49051.99 ## Std.Dev.Trade.PL 20764.84 15313.45 ## Percent.Positive 40.00 31.82 ## Percent.Negative 60.00 68.18 ## Profit.Factor 2.66 3.17 ## Avg.Win.Trade 21670.50 22238.86 ## Med.Win.Trade 21670.50 14091.99 ## Avg.Losing.Trade -5429.67 -3270.13 ## Med.Losing.Trade -5900.00 -3105.01 ## Avg.Daily.PL -3622.25 4406.10 ## Med.Daily.PL -2963.50 -1404.00 ## Std.Dev.Daily.PL 5565.94 15548.28 ## Ann.Sharpe -10.33 4.50 ## Max.Drawdown -60115.00 -33162.01 ## Profit.To.Max.Draw 0.45 3.22 ## Avg.WinLoss.Ratio 3.99 6.80 ## Med.WinLoss.Ratio 3.67 4.54 ## Max.Equity 42758.99 110806.00 ## Min.Equity -17356.00 -10732.99 ## End.Equity 27051.99 106619.99 final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"], main = "Portfolio Equity")

getSymbols("SPY", from = start, to = end) ## [1] "SPY" # A crude estimate of end portfolio value from buying and holding SPY plot(final_acct$summary$End.Eq["2010/2016"] / 1000000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

Modelling Fees

We’ve now replicated the Python analysis in R (or as close as we will get using quantstrat). Let’s now model our portfolio when a 2% commission is applied to every trade (that is, you pay the broker 2% of the total value of the trade, whether you buy or sell). quantstrat allows you to easily model such a fee. To do so, create a function that computes the fee given the number of shares being sold and the share price. Then pass this to the argument TxnFees when you create the rule (this takes either a function, a function name, or a negative number representing the fee). Be sure to add this to all rules (though it need not be the same fee structure for each rule).

fee <- function(TxnQty, TxnPrice, Symbol) { # A function for computing a transaction fee that is 2% of total value of # transaction # # Args: # TxnQty: Numeric for number of shares being traded # TxnPrice: Numeric for price per share # Symbol: The symbol being traded (not used here, but will be passed) # # Returns: # The fee to be applied return(-0.02 * abs(TxnQty * TxnPrice)) } rm(list = ls(.blotter), envir = .blotter) # Clear blotter environment stock(symbols_adj, currency = "USD", multiplier = 1) rm.strat(portfolio_st) rm.strat(strategy_st) initPortf(portfolio_st, symbols = symbols_adj, initDate = initDate, currency = "USD") initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 1000000) initOrders(portfolio_st, store = TRUE) strategy(strategy_st, store = TRUE) add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 20), label = "fastMA") add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") add.signal(strategy = strategy_st, name = "sigCrossover2", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") add.signal(strategy = strategy_st, name = "sigCrossover2", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bull", sigval = TRUE, ordertype = "market", orderside = "long", replace = FALSE, TxnFees = "fee", # Apply the fee with the trade # If you wanted a flat fee, replace # this with a negative number # representing the fee prefer = "Open", osFUN = osMaxDollarBatch, maxSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), tradeSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), batchSize = 100), type = "enter", path.dep = TRUE, label = "buy") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bear", sigval = TRUE, orderqty = "all", ordertype = "market", orderside = "long", replace = FALSE, TxnFees = "fee", # Apply the fee with the trade prefer = "Open"), type = "exit", path.dep = TRUE, label = "sell") # Having set up the strategy, we now backtest applyStrategy(strategy_st, portfolios = portfolio_st)

Let’s now look at our portfolio.

# Now for analytics updatePortf(portfolio_st) ## [1] "SMAC-20-50" dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) ## [1] "SMAC-20-50" updateEndEq(account_st) ## [1] "SMAC-20-50" tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) ## AAPL_adj AMZN_adj FB_adj GOOG_adj HPQ_adj ## Num.Txns 39.00 33.00 23.00 35.00 29.00 ## Num.Trades 20.00 17.00 12.00 18.00 15.00 ## Net.Trading.PL 20624.18 43224.01 68896.98 -27891.65 -6635.25 ## Avg.Trade.PL 4843.58 5879.06 9698.58 1526.49 3428.74 ## Med.Trade.PL 803.16 2755.00 4681.50 -830.50 -947.54 ## Largest.Winner 37593.97 31266.76 71410.75 24869.77 51318.19 ## Largest.Loser -9426.91 -15583.36 -18874.11 -11201.90 -17051.87 ## Gross.Profits 134300.99 140947.00 141620.01 62944.47 121893.03 ## Gross.Losses -37429.37 -41003.01 -25237.01 -35467.66 -70462.01 ## Std.Dev.Trade.PL 12463.46 13985.62 22585.57 8247.78 20283.11 ## Percent.Positive 55.00 64.71 75.00 38.89 40.00 ## Percent.Negative 45.00 35.29 25.00 61.11 60.00 ## Profit.Factor 3.59 3.44 5.61 1.77 1.73 ## Avg.Win.Trade 12209.18 12813.36 15735.56 8992.07 20315.51 ## Med.Win.Trade 7272.58 7277.00 8810.00 7254.76 10230.34 ## Avg.Losing.Trade -4158.82 -6833.84 -8412.34 -3224.33 -7829.11 ## Med.Losing.Trade -2837.23 -5474.50 -4752.00 -3537.02 -8954.58 ## Avg.Daily.PL 2218.84 2736.55 7953.30 -212.11 542.98 ## Med.Daily.PL -1549.09 541.93 2355.65 -2396.58 -4741.84 ## Std.Dev.Daily.PL 12205.15 12945.81 23168.30 8308.18 20348.24 ## Ann.Sharpe 2.89 3.36 5.45 -0.41 0.42 ## Max.Drawdown -56123.63 -60668.36 -51083.34 -46522.72 -81892.46 ## Profit.To.Max.Draw 0.37 0.71 1.35 -0.60 -0.08 ## Avg.WinLoss.Ratio 2.94 1.87 1.87 2.79 2.59 ## Med.WinLoss.Ratio 2.56 1.33 1.85 2.05 1.14 ## Max.Equity 59846.94 43224.01 80524.92 4063.82 8591.39 ## Min.Equity -15308.91 -18513.81 -24333.09 -42458.90 -81892.46 ## End.Equity 20624.18 43224.01 68896.98 -27891.65 -6635.25 ## IBM_adj MSFT_adj NFLX_adj NTDOY_adj SNY_adj ## Num.Txns 40.00 39.00 33.00 37.00 44.00 ## Num.Trades 20.00 20.00 17.00 19.00 22.00 ## Net.Trading.PL -62316.03 -48328.40 163636.67 -53742.98 -111417.94 ## Avg.Trade.PL 538.07 1436.96 13658.41 1029.21 -1138.49 ## Med.Trade.PL -651.11 -2351.12 2500.00 -1440.00 349.03 ## Largest.Winner 18145.45 15954.84 146848.00 40548.11 12141.77 ## Largest.Loser -9256.52 -13175.01 -31173.09 -10517.40 -17910.22 ## Gross.Profits 45828.56 88916.25 295344.86 79742.00 61804.49 ## Gross.Losses -35067.19 -60177.00 -63151.85 -60186.99 -86851.28 ## Std.Dev.Trade.PL 6103.11 8723.81 39673.15 12193.29 8412.38 ## Percent.Positive 40.00 45.00 64.71 42.11 54.55 ## Percent.Negative 60.00 55.00 35.29 57.89 45.45 ## Profit.Factor 1.31 1.48 4.68 1.32 0.71 ## Avg.Win.Trade 5728.57 9879.58 26849.53 9967.75 5150.37 ## Med.Win.Trade 3157.60 9507.66 8307.00 2560.50 3140.38 ## Avg.Losing.Trade -2922.27 -5470.64 -10525.31 -5471.54 -8685.13 ## Med.Losing.Trade -2153.73 -4955.57 -6835.07 -5832.00 -8354.55 ## Avg.Daily.PL -1294.25 -853.51 12129.90 -1757.68 -3090.09 ## Med.Daily.PL -2449.41 -4777.80 783.80 -3466.08 -1631.09 ## Std.Dev.Daily.PL 5974.29 8677.26 40049.20 11764.24 8256.61 ## Ann.Sharpe -3.44 -1.56 4.81 -2.37 -5.94 ## Max.Drawdown -75697.82 -60653.33 -63238.29 -82945.57 -111417.94 ## Profit.To.Max.Draw -0.82 -0.80 2.59 -0.65 -1.00 ## Avg.WinLoss.Ratio 1.96 1.81 2.55 1.82 0.59 ## Med.WinLoss.Ratio 1.47 1.92 1.22 0.44 0.38 ## Max.Equity 8016.15 3615.01 218329.97 1177.74 0.00 ## Min.Equity -67681.68 -57038.32 -5157.36 -81767.83 -111417.94 ## End.Equity -62316.03 -48328.40 163636.67 -53742.98 -111417.94 ## TWTR_adj YHOO_adj ## Num.Txns 9.00 43.00 ## Num.Trades 5.00 22.00 ## Net.Trading.PL 9470.41 19783.95 ## Avg.Trade.PL 5410.40 4846.36 ## Med.Trade.PL -27.01 -1296.00 ## Largest.Winner 0.00 51665.84 ## Largest.Loser -12099.12 -9466.80 ## Gross.Profits 43340.99 155671.99 ## Gross.Losses -16289.00 -49051.99 ## Std.Dev.Trade.PL 20764.84 15313.45 ## Percent.Positive 40.00 31.82 ## Percent.Negative 60.00 68.18 ## Profit.Factor 2.66 3.17 ## Avg.Win.Trade 21670.50 22238.86 ## Med.Win.Trade 21670.50 14091.99 ## Avg.Losing.Trade -5429.67 -3270.13 ## Med.Losing.Trade -5900.00 -3105.01 ## Avg.Daily.PL -5536.07 2341.16 ## Med.Daily.PL -4925.13 -3332.34 ## Std.Dev.Daily.PL 5438.76 15231.86 ## Ann.Sharpe -16.16 2.44 ## Max.Drawdown -71707.02 -68750.59 ## Profit.To.Max.Draw 0.13 0.29 ## Avg.WinLoss.Ratio 3.99 6.80 ## Med.WinLoss.Ratio 3.67 4.54 ## Max.Equity 36769.43 42502.03 ## Min.Equity -34937.58 -55447.71 ## End.Equity 9470.41 19783.95 final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"], main = "Portfolio Equity")

Before the fees were accounted for, we made a modest profit, though one worse than if we bought and held SPY. But with this commission structure, we barely break even!

Not surprisingly, our strategy looks even worse when compared to buy-and-hold-SPY.

plot(final_acct$summary$End.Eq["2010/2016"] / 1000000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

The 2% commission is very punishing. With this commission, we would need to profit by about 4% in order for a trade to really, truly be profitable. This is harder than it sounds (and doesn’t even allow for losses), and the losses add up.

Of course if you’re using an online trading platform you may opt for a broker that charges a flat rate, say, $10 per trade. This means that with each trade your profit would need to be at least $10 in order to beat the fee. Also, the larger the trade, the smaller the fee is relative to the trade, making the trade more likely to be profitable.

Let’s suppose a $1,000,000 portfolio pays a $100 fee per trade. I model this below.

rm(list = ls(.blotter), envir = .blotter) # Clear blotter environment stock(symbols_adj, currency = "USD", multiplier = 1) rm.strat(portfolio_st) rm.strat(strategy_st) initPortf(portfolio_st, symbols = symbols_adj, initDate = initDate, currency = "USD") initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 1000000) initOrders(portfolio_st, store = TRUE) strategy(strategy_st, store = TRUE) add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 20), label = "fastMA") add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") add.signal(strategy = strategy_st, name = "sigCrossover2", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") add.signal(strategy = strategy_st, name = "sigCrossover2", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bull", sigval = TRUE, ordertype = "market", orderside = "long", replace = FALSE, TxnFees = -100, # Apply the fee with the trade prefer = "Open", osFUN = osMaxDollarBatch, maxSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), tradeSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), batchSize = 100), type = "enter", path.dep = TRUE, label = "buy") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bear", sigval = TRUE, orderqty = "all", ordertype = "market", orderside = "long", replace = FALSE, TxnFees = -100, prefer = "Open"), type = "exit", path.dep = TRUE, label = "sell") # Having set up the strategy, we now backtest applyStrategy(strategy_st, portfolios = portfolio_st)

Now for comparison with SPY.

updatePortf(portfolio_st) ## [1] "SMAC-20-50" dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) ## [1] "SMAC-20-50" updateEndEq(account_st) ## [1] "SMAC-20-50" final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"] / 1000000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

Not quite as punishing as the 2% rate, though admittedly slightly less than when no fee was applied at all. And while you may say, “No one would ever go with a broker charging $100 per trade when they can get $10 or $5 per trade,” remember most people don’t have $1,000,000 accounts (and I’m sure that if you applied a $5 fee to a more modest account, you’d discover that $5 per trade actually makes a big difference: consider that a challenge).

Conclusion

Fees are only the beginning of potential, unmodelled avenues through which a backtest can lead to optimistic results. There’s slippage, where orders are placed at unwanted prices that lead to further losses than initially planned, and becomes a serious problem in periods of high volatility. Another potential problem that cannot be modelled by backtesting at all is affecting the market with an order. Prices in backtesting are treated as givens, but in the real world, an order affects the price of the asset bought and could lead to others reacting in unexpected ways. There’s also the propensity for backtesting to overfit data; the strategy appears to be profitable when it merely managed to learn the ups and downs of the historic period.

In a practice that relies on slim profits, where the market may only be just barely inefficient, there’s little room for error. Do what you can to accurately model what a trade would actually look like (especially when quantstrat makes doing so easy). Little price differences can lead to big differences in the end.

# Session details sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 16.04.2 LTS ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] IKTrading_1.0 roxygen2_6.0.1 ## [3] digest_0.6.12 Rcpp_0.12.10 ## [5] quantstrat_0.9.1739 foreach_1.4.4 ## [7] blotter_0.9.1741 PerformanceAnalytics_1.4.4000 ## [9] FinancialInstrument_1.2.0 quantmod_0.4-7 ## [11] TTR_0.23-1 xts_0.9-7 ## [13] zoo_1.7-14 RWordPress_0.2-3 ## [15] optparse_1.3.2 knitr_1.15.1 ## ## loaded via a namespace (and not attached): ## [1] xml2_1.1.1 magrittr_1.5 getopt_1.20.0 lattice_0.20-35 ## [5] R6_2.2.0 highr_0.6 stringr_1.2.0 tools_3.3.3 ## [9] grid_3.3.3 commonmark_1.2 iterators_1.0.8 bitops_1.0-6 ## [13] codetools_0.2-15 RCurl_1.95-4.8 evaluate_0.10 stringi_1.1.3 ## [17] XMLRPC_0.3-0 XML_3.98-1.5

1 The EMH claims that information about past prices will not help predict future price movements, so technical analysis is pure alchemy. However, we need to remember that price data and future prices are being produced by people who do look at past prices. Suppose a technical indicator indicates a bullish movement in prices; for example, a slow moving average crossed over a fast moving average. Forget whether there’s any inherent reason this should lead to a bullish price movement; there’s a good chance there isn’t any, and all the narratives explaining the rationale of the indicator are rubbish (as I’m inclined to believe). If a trader sees the signal, and believes that enough traders: (a) see the signal; (b) interpret the signal in the same way; and © will act on the signal as it recommends, then that trader should do the same. Enough people believe the asset is undervalued to make it act as if it were undervalued, and the trader should go long. So technical signals could become self-fulfilling prophecies, like J.M. Keynes’ beauty contest.

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How and when: ridge regression with glmnet

Mon, 04/10/2017 - 14:25

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to show you how to conduct ridge regression (linear regression with L2 regularization) in R using the glmnet package, and use simulations to demonstrate its relative advantages over ordinary least squares regression.

 Ridge regression

Ridge regression uses L2 regularisation to weight/penalise residuals when the parameters of a regression model are being learned. In the context of linear regression, it can be compared to Ordinary Least Square (OLS). OLS defines the function by which parameter estimates (intercepts and slopes) are calculated. It involves minimising the sum of squared residuals. L2 regularisation is a small addition to the OLS function that weights residuals in a particular way to make the parameters more stable. The outcome is typically a model that fits the training data less well than OLS but generalises better because it is less sensitive to extreme variance in the data such as outliers.

 Packages

We’ll make use of the following packages in this post:

library(tidyverse) library(broom) library(glmnet)  Ridge regression with glmnet

The glmnet package provides the functionality for ridge regression via glmnet(). Important things to know:

  • Rather than accepting a formula and data frame, it requires a vector input and matrix of predictors.
  • You must specify alpha = 0 for ridge regression.
  • Ridge regression involves tuning a hyperparameter, lambda. glmnet() will generate default values for you. Alternatively, it is common practice to define your own with the lambda argument (which we’ll do).

Here’s an example using the mtcars data set:

y <- mtcars$hp x <- mtcars %>% select(mpg, wt, drat) %>% data.matrix() lambdas <- 10^seq(3, -2, by = -.1) fit <- glmnet(x, y, alpha = 0, lambda = lambdas) summary(fit) #> Length Class Mode #> a0 51 -none- numeric #> beta 153 dgCMatrix S4 #> df 51 -none- numeric #> dim 2 -none- numeric #> lambda 51 -none- numeric #> dev.ratio 51 -none- numeric #> nulldev 1 -none- numeric #> npasses 1 -none- numeric #> jerr 1 -none- numeric #> offset 1 -none- logical #> call 5 -none- call #> nobs 1 -none- numeric

Because, unlike OLS regression done with lm(), ridge regression involves tuning a hyperparameter, lambda, glmnet() runs the model many times for different values of lambda. We can automatically find a value for lambda that is optimal by using cv.glmnet() as follows:

cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas)

cv.glmnet() uses cross-validation to work out how well each model generalises, which we can visualise as:

plot(cv_fit)

The lowest point in the curve indicates the optimal lambda: the log value of lambda that best minimised the error in cross-validation. We can extract this values as:

opt_lambda <- cv_fit$lambda.min opt_lambda #> [1] 3.162278

And we can extract all of the fitted models (like the object returned by glmnet()) via:

fit <- cv_fit$glmnet.fit summary(fit) #> Length Class Mode #> a0 51 -none- numeric #> beta 153 dgCMatrix S4 #> df 51 -none- numeric #> dim 2 -none- numeric #> lambda 51 -none- numeric #> dev.ratio 51 -none- numeric #> nulldev 1 -none- numeric #> npasses 1 -none- numeric #> jerr 1 -none- numeric #> offset 1 -none- logical #> call 5 -none- call #> nobs 1 -none- numeric

These are the two things we need to predict new data. For example, predicting values and computing an R2 value for the data we trained on:

y_predicted <- predict(fit, s = opt_lambda, newx = x) # Sum of Squares Total and Error sst <- sum(y^2) sse <- sum((y_predicted - y)^2) # R squared rsq <- 1 - sse / sst rsq #> [1] 0.9318896

The optimal model has accounted for 93% of the variance in the training data.

 Ridge v OLS simulations

By producing more stable parameters than OLS, ridge regression should be less prone to overfitting training data. Ridge regression might, therefore, predict training data less well than OLS, but better generalise to new data. This will particularly be the case when extreme variance in the training data is high, which tends to happen when the sample size is low and/or the number of features is high relative to the number of observations.

Below is a simulation experiment I created to compare the prediction accuracy of ridge regression and OLS on training and test data.

I first set up the functions to run the simulation:

# Compute R^2 from true and predicted values rsquare <- function(true, predicted) { sse <- sum((predicted - true)^2) sst <- sum(true^2) rsq <- 1 - sse / sst # For this post, impose floor... if (rsq < 0) rsq <- 0 return (rsq) } # Train ridge and OLS regression models on simulated data set with `n_train` # observations and a number of features as a proportion to `n_train`, # `p_features`. Return R squared for both models on: # - y values of the training set # - y values of a simualted test data set of `n_test` observations # - The beta coefficients used to simulate the data ols_vs_ridge <- function(n_train, p_features, n_test = 200) { ## Simulate datasets n_features <- floor(n_train * p_features) betas <- rnorm(n_features) x <- matrix(rnorm(n_train * n_features), nrow = n_train) y <- x %*% betas + rnorm(n_train) train <- data.frame(y = y, x) x <- matrix(rnorm(n_test * n_features), nrow = n_test) y <- x %*% betas + rnorm(n_test) test <- data.frame(y = y, x) ## OLS lm_fit <- lm(y ~ ., train) # Match to beta coefficients lm_betas <- tidy(lm_fit) %>% filter(term != "(Intercept)") %>% {.$estimate} lm_betas_rsq <- rsquare(betas, lm_betas) # Fit to training data lm_train_rsq <- glance(lm_fit)$r.squared # Fit to test data lm_test_yhat <- predict(lm_fit, newdata = test) lm_test_rsq <- rsquare(test$y, lm_test_yhat) ## Ridge regression lambda_vals <- 10^seq(3, -2, by = -.1) # Lambda values to search cv_glm_fit <- cv.glmnet(as.matrix(train[,-1]), train$y, alpha = 0, lambda = lambda_vals, nfolds = 5) opt_lambda <- cv_glm_fit$lambda.min # Optimal Lambda glm_fit <- cv_glm_fit$glmnet.fit # Match to beta coefficients glm_betas <- tidy(glm_fit) %>% filter(term != "(Intercept)", lambda == opt_lambda) %>% {.$estimate} glm_betas_rsq <- rsquare(betas, glm_betas) # Fit to training data glm_train_yhat <- predict(glm_fit, s = opt_lambda, newx = as.matrix(train[,-1])) glm_train_rsq <- rsquare(train$y, glm_train_yhat) # Fit to test data glm_test_yhat <- predict(glm_fit, s = opt_lambda, newx = as.matrix(test[,-1])) glm_test_rsq <- rsquare(test$y, glm_test_yhat) data.frame( model = c("OLS", "Ridge"), betas_rsq = c(lm_betas_rsq, glm_betas_rsq), train_rsq = c(lm_train_rsq, glm_train_rsq), test_rsq = c(lm_test_rsq, glm_test_rsq) ) } # Function to run `ols_vs_ridge()` `n_replications` times repeated_comparisons <- function(..., n_replications = 5) { map(seq(n_replications), ~ ols_vs_ridge(...)) %>% map2(seq(.), ~ mutate(.x, replicate = .y)) %>% reduce(rbind) }

Now run the simulations for varying numbers of training data and relative proportions of features (takes some time):

d <- purrr::cross_d(list( n_train = seq(20, 200, 20), p_features = seq(.55, .95, .05) )) d <- d %>% mutate(results = map2(n_train, p_features, repeated_comparisons))

Visualise the results…

For varying numbers of training data (averaging over number of features), how well do both models predict the training and test data?

d %>% unnest() %>% group_by(model, n_train) %>% summarise( train_rsq = mean(train_rsq), test_rsq = mean(test_rsq)) %>% gather(data, rsq, contains("rsq")) %>% mutate(data = gsub("_rsq", "", data)) %>% ggplot(aes(n_train, rsq, color = model)) + geom_line() + geom_point(size = 4, alpha = .3) + facet_wrap(~ data) + theme_minimal() + labs(x = "Number of training observations", y = "R squared")

As hypothesised, OLS fits the training data better but Ridge regression better generalises to new test data. Further, these effects are more pronounced when the number of training observations is low.

For varying relative proportions of features (averaging over numbers of training data) how well do both models predict the training and test data?

d %>% unnest() %>% group_by(model, p_features) %>% summarise( train_rsq = mean(train_rsq), test_rsq = mean(test_rsq)) %>% gather(data, rsq, contains("rsq")) %>% mutate(data = gsub("_rsq", "", data)) %>% ggplot(aes(p_features, rsq, color = model)) + geom_line() + geom_point(size = 4, alpha = .3) + facet_wrap(~ data) + theme_minimal() + labs(x = "Number of features as proportion\nof number of observation", y = "R squared")

Again, OLS has performed slightly better on training data, but Ridge better on test data. The effects are more pronounced when the number of features is relatively high compared to the number of training observations.

The following plot helps to visualise the relative advantage (or disadvantage) of Ridge to OLS over the number of observations and features:

d %>% unnest() %>% group_by(model, n_train, p_features) %>% summarise(train_rsq = mean(train_rsq), test_rsq = mean(test_rsq)) %>% group_by(n_train, p_features) %>% summarise(RidgeAdvTrain = train_rsq[model == "Ridge"] - train_rsq[model == "OLS"], RidgeAdvTest = test_rsq[model == "Ridge"] - test_rsq[model == "OLS"]) %>% gather(data, RidgeAdvantage, contains("RidgeAdv")) %>% mutate(data = gsub("RidgeAdv", "", data)) %>% ggplot(aes(n_train, p_features, fill = RidgeAdvantage)) + scale_fill_gradient2(low = "red", high = "green") + geom_tile() + theme_minimal() + facet_wrap(~ data) + labs(x = "Number of training observations", y = "Number of features as proportion\nof number of observation") + ggtitle("Relative R squared advantage of Ridge compared to OLS")

This shows the combined effect: that Ridge regression better transfers to test data when the number of training observations is low and/or the number of features is high relative to the number of training observations. OLS performs slightly better on the training data under similar conditions, indicating that it is more prone to overfitting training data than when ridge regularisation is employed.

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Test driving Python integration in R, using the ‘reticulate’ package

Mon, 04/10/2017 - 08:01
Introduction

Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions.

Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

Using python packages

The nice thing about reticulate in RStudio is the support for code completion. When you have imported a python module, RStudio will recognize the methods that are available in the python module:

The clarifai module

Clarifai provides a set of computer vision API’s for image recognition, face detection, extracting tags, etc. There is an official python module and there is also an R package by Guarav Sood, but it exposes less functionality. So I am going to use the python module in R. The following code snippet shows how easy it is to call python functions.

The output returned from the clarifai call is a nested list and can be quit intimidating at first sight. To browse trough these nested lists and to get a better idea of what is in those lists, you can use the package listviewer:

The pattern.nl module

The pattern.nl module contains a fast part-of-speech tagger for Dutch, sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization. At the moment it does not support python 3. That is not a big deal, I am using Anaconda and created a Python 2.7 environment to install pattern.nl.

The nice thing of the reticulate package is that it allows you to choose a specific Python environment to be used.

The pytorch module

pytorch is a python package that provides tensor computations and deep neural networks. There is no ‘R torch’ equivalent, but we can use reticulate in R. There is an example of training a logistic regression in pytorch, see the code here. It takes just a little rewrite of this code to make this work in R. See the first few lines in the figure below.

Conclusion

As a data scientist you should know both R and Python, the reticulate package is no excuse for not learning Python! However, the reticulate package can be very useful if you want to do all your analysis in the RStudio environment. It works very well.

For example, I have used rvest to scrape some Dutch news texts, then used the Python module pattern.nl for Dutch sentiment and wrote an R Markdown document to present the results. Then the reticulate package is a nice way to keep everything in one environment.

Cheers, Longhow

RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes

Mon, 04/10/2017 - 02:00

I released version 1.4.12 of RSiteCatalyst before I wrote the release notes for version 1.4.11, so this blog post will treat both releases as one. Users should upgrade directly to version 1.4.12 as the releases are cumulative in nature.

Get* method additions

Two Get* methods were added in this release: GetClickMapReporting and GetPreviousServerCalls, mostly for completeness. Analytics users will likely not need to use these methods, but they are useful for generating documentation.

Bug fixes
  • Fixed GetLogin function, adding selected_ims_group_list parameter to response (caused test suite failure)
  • Fixed issue with ViewProcessingRules where nested rules threw errors (#214)
  • Fixed issue with GetMarketingChannelRules where nested rules threw errors, matches column duplicated values across rows (#180)
  • Added ability to use a segment in QueueDataWarehouse function, which was previously implemented incorrectly (#216)
  • Fixed issue with QueueDataWarehouse not returning the proper number of results when enqueueOnly = FALSE
  • Fixed encoding for QueueDataWarehouse to correctly use UTF-8-BOM (#198)
  • Fixed parser for GetFeeds, to unnest ‘activity’ data frame into discrete columns
  • Fixed issue where message Error in if (!is.null(elements[i, ]$classification) && nchar(elements[i, : missing value where TRUE/FALSE needed displayed when using multiple elements in a Queue* function (#207)
Community Contributions (An Adobe Summit bounce?!)

In the past month, the number of GitHub issues submitted has increased dramatically, a good problem to have!

I encourage all users of the software to continue reporting bugs via GitHub issues, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.

April R Course Finder update: Logistics Regression, New platforms and Complete Machine Learning

Sun, 04/09/2017 - 23:03

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Last year we launched R Course Finder, an online directory that helps you to find the right R course quickly. With so many R courses available online, we thought it was a good idea to offer a tool that helps people to compare these courses, before they decide where to spend their valuable time and (sometimes) money.

If you haven’t looked at it yet, go to the R Course Finder now by clicking here.

Last month we added 17 courses to the Course Finder. Currently we are at 171 courses on 18 different online platforms, and 3 offline Learning Institutes.

Newly added platforms include:

Highlighted Courses Learning Path: Real-World Data Mining With R

When working with data, so often we first have to consider data mining, Learning Path: Real-World Data Mining With R offers a complete learning path for this essential problem.

Logistic Regression Workshop using R – Step by Step Modeling

The course, Logistic Regression Workshop using R – Step by Step Modeling really peaked our attention as it seems to be one of the rare cases that promote content quality over advertisement. This relatively short course description does not speak for the content and indepth discussion of Logistic regressions in R.

R: Complete Machine Learning Solutions

We have many different courses in our directory that cover the current hype in Data Science, Machine learning, this course R: Complete Machine Learning Solutions covers the whole process front to end, and leaves no room for interpretation. It can be very useful as a back reference for anyone who knows some of these concepts, but wants to return and read about them .

Besides these courses, we also added:

Introduction to R: ETH Zurich
Data Science Certification Training – R Programming
Creating Interactive Presentations with Shiny and R
R: Complete Data Visualization Solutions
R Programming from Scratch
Volatility Trading Analysis with R

How you can help to make R Course Finder better
  • If you miss a course that is not included yet, please post a reminder in the comments and we’ll add it.
  • If you miss an important filter or search functionality, please let us know in the comments below.
  • If you already took one of the courses, please let all of us know about your experiences in the review section, an example is available here.

And, last but not least: If you like R Course Finder, please share this announcement with friends and colleagues using the buttons below.

Related exercise sets:
  1. Basic Tree 2 Exercises
  2. Multiple Regression (Part 3) Diagnostics
  3. Intermediate Tree 2
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny server series part 1: setting up

Sun, 04/09/2017 - 17:46

(This article was first published on Jasper Ginn's blog, and kindly contributed to R-bloggers)

This guide is part of a series on setting up your own private server running shiny apps. There are many guides out there with great advice on how to set up an R shiny server and related software. I try to make a comprehensive guide based in part on these resources as well as my own experiences. I always aim to properly attribute information to their respective sources. If you notice an issue, please contact me.

Welcome to the first part of the shiny server series!

Recently, I have had to set up several shiny servers that required some custom tweaks like user authentication, running shiny server on multiple ports and setting up SSL. With the help of numerous resources, I got everything running smoothly. I decided to publish these steps on my blog so that others might also use them. In this first part, we’ll be setting up the private server and installing shiny.

Resources used for this part

This guide draws from other guides on the web. The guides below were used to put together this resource.

  1. This guide shows you how to install R on Ubuntu 16.04
  2. Dean Attali’s blog post is one of the best and most comprehensive guides on setting up your own shiny server.
  3. This guides – is used to couple a domain name to your server.
What you’ll need

In order to complete this entire series, I expect you have access to the following:

  • A Virtual Private Server (VPS, see below)
  • A domain name, such as www.iamadomainname.com. These are available on e.g. namecheap.

If you have these two things, you may proceed!

A VP…What?

A Virtual Private Server (VPS) is a service that you can purchase from providers such as DigitalOcean, Amazon AWS and Google Cloud Services. Instead of buying a physical server and planting it in your home somewhere, you essentially rent a small portion of a server from a provider which then looks like a normal server environment: you have your own OS and you can manage your own users and so on.

In this guide, I’ll be using DigitalOcean, but you can use any VPS from any provider as long as it runs on Ubuntu 16.04 and has at least 512MB RAM and one processor. You can use this link to sign up to DigitalOcean. You’ll receive $10 in credit.

Setting up a VPS on DigitalOcean

After signing up to DO, you will see the following screen

There won’t be much going on here … yet! Click on ‘create droplet’ to set up your VPS. The options are relatively straightforward here. You want to select a Ubuntu 16.04 distribution.

Then, you can decide how much power your droplet will have. I chose the smallest version, which works perfectly fine, but feel free to take a bigger size if you like.

Select an appropriate data centre (usually that means choosing one in or near your own country)

The final set of options can look arcane if you’re not used to them. Fortunately, the only option you really need to pay attention to is the SSH key.

An SSH key functions as a unique identifier for you, the owner of the VPS, and adds a layer of security to your server. DigitalOcean provides a tutorial for Windows users and Unix users on their website. This step is optional: you don’t have to do it, but I strongly recommended.

After you’ve set up the SSH access, you can choose a suitable name for your droplet and press ‘create’.

Congratulations! You are now the proud owner of your own VPS.

Accessing your server and setting up shiny

Click on ‘droplets’ at the top right of the DO menu. Copy the IP address of your droplet and open up a terminal or PuTTY. Log into your server:

ssh root@<ip-address> Setting up a user

Since we don’t want to use the root user to install everything, we’ll create a new user called ‘shiny’. When you are prompted to enter a name, office number etc. you can just hit enter:

adduser shiny

Next, we give the shiny user admin rights and switch to the shiny user:

# Give shiny admin rights gpasswd -a shiny sudo # Switch to shiny user su shiny Installing dependencies

Firstly, update the list of packages:

sudo apt-get update

Then add the CRAN repository to this list (copy these commands one by one):

# Add a trusted key sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 # Add the CRAN repo sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/' # Update the list again sudo apt-get update

If your droplet has 1G of RAM or less, we need to set up some swap space. This will allow the server to use hard-drive space as additional RAM when it needs to:

# Set up swap sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024 sudo /sbin/mkswap /var/swap.1 sudo /sbin/swapon /var/swap.1 sudo sh -c 'echo "/var/swap.1 swap swap defaults 0 0 " >> /etc/

Now, we’re ready to install R and other dependencies for shiny server:

# Install R sudo apt-get -y install r-base r-base-dev # Install shiny server dependencies sudo apt-get -y install libapparmor1 gdebi-core # These are used by R libraries sudo apt-get install -y libxml2-dev libcurl4-openssl-dev libssl-dev

Copy and execute the following lines to install these R packages:

# Install shiny sudo su - -c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\"" # Install devtools sudo su - -c "R -e \"install.packages('devtools', repos='http://cran.rstudio.com/')\"" # Install digest sudo su - -c "R -e \"install.packages('digest', repos='http://cran.rstudio.com/')\""

Now we’re ready to install shiny server:

# Download rstudio server: wget https://download2.rstudio.org/rstudio-server-1.0.136-amd64.deb # Download shiny server: wget https://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.5.1.834-amd64.deb # Install sudo gdebi rstudio-server-1.0.136-amd64.deb sudo gdebi shiny-server-1.5.1.834-amd64.deb

The rstudio and shiny server are now accessible on ports 8787 and 3838: you can access them by navigating to :8787 in a browser.

While this is fine, it’s a bit difficult to remember and not very ideal to share with others. This is easily solvable! We can simply point a domain name to our new server.

Pointing a domain name to your server

This step involves two actions: one from the provider you bought the domain name from, and one on the DO dashboard.

Changing name servers

The first thing we’ll do is to change the name servers of your domain name so that they point to the DO name servers. Most providers will give this option and the process is largely the same, but I’ll be using my namecheap account.

Once you are logged into your namecheap account, you will see the following screen:

Click on the ‘manage’ button for the domain name you want to use. You will now see the following screen:

Click on the arrow (surrounded by green box) and select ‘custom DNS’ (surrounded by blue box). Add the DO name servers as shown in the image below:

Save your changes and close the tab.

Setting up A and CNAME records on DigitalOcean

In your DO dashboard, click on the ‘networking’ option

Enter your domain name (without ‘www’) and click ‘add domain’

We’re going to add one A-record and two CNAME records. Firstly, add ‘@’ under the hostname as in the image below and click ‘Create Record’:

Next, create a CNAME record. Under ‘hostname’ enter ‘*’, and under ‘is an alias of’, enter ‘@‘, as shown in the picture below:

Next, create another CNAME record where you input ‘www’ under ‘hostname’ and ‘@’ under ‘is an alias of’:

The entire setup should look like the image below

You are now done setting up your domain name. Nominally, it can take up to 24 hours before your domain name points to your server, but usually this takes less than several hours.

Configuring nginx on your server

OK. So now your server can be accessed using a custom domain name. This is great, but rstudio and shiny server are still only accessible using the ports on which they are hosted. We’ll now set up Nginx on your server to manage the routes to the shiny and studio servers.

Go back to the terminal where you’re logged into the server, and execute the following:

sudo apt-get -y install nginx

Now, make a backup of the nginx configuration:

sudo cp /etc/nginx/sites-enabled/default /etc/nginx/sites-enabled/default-backup

Open up the configuration file

sudo nano /etc/nginx/sites-enabled/default

As in the image below, add ‘localhost’ to the server name (green box).

Then, copy the following lines to the config file and paste them in the location shown in the image (area surrounded by pink box).

location /apps/ { proxy_pass http://127.0.0.1:3838/; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection “upgrade”; } location /editor/ { proxy_pass http://127.0.0.1:8787/; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection “upgrade”; }

Hit control+x and then Y+enter, and your changes will be saved. Your rstudio and shiny servers are now accessible using e.g. www../apps

Adding shiny applications

The easiest way to install shiny applications is to create a github or bitbucket repository for the shiny application, which you then clone in the shiny server directory. However, first we need to set up user rights for our shiny user.

Setting up permissions

First, navigate to the shiny server directory where the apps live:

cd /srv/shiny-server

Then, give the shiny user read/write permissions:

sudo groupadd shiny-apps sudo usermod -aG shiny-apps shiny sudo chown -R shiny:shiny-apps . sudo chmod g+w . sudo chmod g+s . Example: hosting an application

This part shows how you can store your shiny apps on github and host them on your server.

Sign up for a github account if you have not already. You can also choose to use bitbucket: the process is largely the same.

Create a repository and follow the steps to add files to it.

Follow the steps on the next screen to add files to your repository

When you’ve pushed your server.R and ui.R files, it should look like this

You can now clone this repository in your server by clicking the ‘clone or download’ button (surrounded by the pink box). Copy the url and go back to the terminal where you are logged into your server. Then, execute the following commands:

# Navigate to the folder with shiny applications cd /srv/shiny-server # Clone the repository git clone https://github.com/JasperHG90/TenK-example.git # Install dependencies sudo su - -c "R -e \"devtools::install_github('JasperHG90/TenK')\""

You can now visit the shiny application by navigating to www.<your-domain-name>.<extension>/apps/TenK-example in a browser.

Some notes on installing R packages

When installing R packages, it’s best to use the following code:

sudo su - -c "R -e \"install.packages('<package-name>', repos='http://cran.rstudio.com/')\""

This ensures that packages are installed globally and that every user has access to it.

Using packrat

You can use packrat – a package management system authored by rstudio – for shiny applications. Packrat is quite useful because it creates a local repository (with, for example, specific package versions) of packages. The advantage of this approach is that it improves the portability of the application and ensures that apps that use the same dependencies but, for example, different versions of that dependency can easily run on the same system. On the flip side, it does not work very well with certain specific libraries such as rgdal. You can read more about packrat on this page.

Setting up packrat

Setting up packrat is easy. In rstudio, start a new project (or open an existing one).

Go to the packrat section and check the box

Packrat will now make a local repository with necessary packages. This can take a while.

Don’t forget to push these changes to your github repository.

Cloning the repo and restoring the library

Getting the repository with the packrat library is very similar to the process outlined above.

First, install the packrat package:

sudo su - -c "R -e \"install.packages('packrat', repos='http://cran.rstudio.com/')\""

Navigate to the /srv/shiny-server folder and clone the repository:

# Navigate to the folder cd /srv/shiny-server # Clone the repo git clone https://github.com/JasperHG90/tenk-example-2.git

Now, enter the repository and start an R session:

# Enter the folder cd tenk-example-2 # Start an R session R

Packrat should automatically kick in and start installing packages. If it doesn’t, execute the following:

packrat::restore()

Execute q() to quit the R session

Future content in this series

This wraps up the first part of the shiny server series. You now have a VPS with a custom domain name and running rstudio plus shiny servers. This setup forms the basis on which the next installments of this series build.

Subsequent parts in this series will focus on the following topics

Part 2: running shiny server on multiple ports

We’ll run shiny server on multiple ports so that we have one server to host public shiny apps and another server to host private apps for a select group of people. We’ll add user authentication to this private server in part 4 of this series.

Part 3: adding an SSL security certificate

Before we can add user authentication to shiny, we want to set up an SSL certificate to buff security on our server. This encrypts traffic from and to your server.

Part 4: adding user authentication to your server using Auth0

In this part, we’ll use Auth0, a free service for up to 7.000 users, to add authentication to our shiny server running private applications. To this end, we’ll slightly adapt the excellent Auth0 tutorial to work with our setup.

Part 5: hosting a static website on your server

Most websites have a front end and a back end. The front end is what you see when you visit a page. The back end usually consists of one or more databases and services such as user authentication. Conversely, a static website is simply a front end: you pre-compile all necessary materials before hosting it on a server. Static websites do not require databases and are typically fast to load. Examples of services that help you make static websites are Jekyll and Pelican, to name a few.

You can use the GitHub approach outlined above for shiny applications for static websites as well. You create a repository, upload the static website, and then clone it in the directory from which nginx hosts its example page.

To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Web Scraping and Applied Clustering Global Happiness and Social Progress Index

Sun, 04/09/2017 - 15:02

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Increasing amount of data is available on the web. Web scraping is a technique developed to extract data from web pages automatically and transforming it into a data format for further data analysis and insights. Applied clustering is an unsupervised learning technique that refers to a family of pattern discovery and data mining tools with applications in machine learning, bioinformatics, image analysis, and segmentation of consumer types, among others. R offers several packages and tools for web scraping, data manipulation, statistical analysis and machine learning. The motivation for this post is to illustrate the applications of web scraping, dimension reduction and applied clustering tools in R.
There are two separate data sets for web scraping in this post. The first data set is from a recently released World Happiness Report 2017 by the United Nations Sustainable Development Solutions Network. The 2017 report launched on March 20, the day of world happiness, contained global rankings for happiness and social well-being. The second data set for web scraping is the 2015 social progress index of countries in the world. Social Progress Index has been described as measuring the extent to which countries provide for the social and environmental needs of their citizens. In this exercise, the two data sets joined by country column were pre-processed prior to principal component analysis (PCA) and clustering. The goals of the clustering approach in this post were to segment rows of the over 150 countries in the data into separate groups (clusters), The expectation is that sets of countries within a cluster are as similar as possible to each other for happiness and social progress, and as dissimilar as possible to the other sets of countries assigned in different clusters.

Load Required Packages

require(rvest) require(magrittr) require(plyr) require(dplyr) require(reshape2) require(ggplot2) require(FactoMineR) require(factoextra) require(cluster) require(useful) Web Scraping and Data Pre-processing # Import the first data set from the site url1 <- "https://en.wikipedia.org/wiki/World_Happiness_Report" happy % html_nodes("table") %>% extract2(1) %>% html_table() # inspect imported data structure str(happy) ## 'data.frame': 155 obs. of 12 variables: ## $ Overall Rank : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Change in rank : int 3 -1 0 -2 0 1 -1 0 0 0 ... ## $ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ... ## $ Score : num 7.54 7.52 7.5 7.49 7.47 ... ## $ Change in score : num 0.039 -0.004 0.003 -0.015 0.056 0.038 -0.088 -0.02 -0.029 -0.007 ... ## $ GDP per capita : num 1.62 1.48 1.48 1.56 1.44 ... ## $ Social support : num 1.53 1.55 1.61 1.52 1.54 ... ## $ Healthy life expectancy : num 0.797 0.793 0.834 0.858 0.809 0.811 0.835 0.817 0.844 0.831 ... ## $ Freedom to make life choices: num 0.635 0.626 0.627 0.62 0.618 0.585 0.611 0.614 0.602 0.613 ... ## $ Generosity : num 0.362 0.355 0.476 0.291 0.245 0.47 0.436 0.5 0.478 0.385 ... ## $ Trust : num 0.316 0.401 0.154 0.367 0.383 0.283 0.287 0.383 0.301 0.384 ... ## $ Dystopia : num 2.28 2.31 2.32 2.28 2.43 ...

Pre-process the first data set for analysis:

## Exclude columns with ranks and scores, retain the other columns happy <- happy[c(3,6:11)] ### rename column headers colnames(happy) <- gsub(" ", "_", colnames(happy), perl=TRUE) names(happy) ## [1] "Country" "GDP_per_capita" ## [3] "Social_support" "Healthy_life_expectancy" ## [5] "Freedom_to_make_life_choices" "Generosity" ## [7] "Trust" ### standardize names of selected countries to confirm with country names in the the map database happy$Country <- as.character(mapvalues(happy$Country, from = c("United States", "Congo (Kinshasa)", "Congo (Brazzaville)", "Trinidad and Tobago"), to = c("USA","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad"))) Import the second data set from the web ### scrape social progress index data report from the site url2 <- "https://en.wikipedia.org/wiki/List_of_countries_by_Social_Progress_Index" social % html_nodes("table") %>% .[[2]] %>% html_table(fill=T) # check imported data structure str(social) ## 'data.frame': 163 obs. of 9 variables: ## $ Country : chr "Norway" "Sweden" "Switzerland" "Iceland" ... ## $ Rank (SPI) : chr "1" "2" "3" "4" ... ## $ Social Progress Index : chr "88.36" "88.06" "87.97" "87.62" ... ## $ Rank (BHN) : chr "9" "8" "2" "6" ... ## $ Basic Human Needs : chr "94.8" "94.83" "95.66" "95" ... ## $ Rank (FW) : chr "1" "3" "2" "4" ... ## $ Foundations of Well-being: chr "88.46" "86.43" "86.5" "86.11" ... ## $ Rank (O) : chr "9" "5" "10" "11" ... ## $ Opportunity : chr "81.82" "82.93" "81.75" "81.73" ...

Pre-process the second data set for analysis:

### Again, exclude columns with ranks, keep the rest social <- social[c(1,5,7,9)] ### rename column headers names(social) <- c("Country", "basic_human_needs", "foundations_well_being", "opportunity") ### Standardize country names to confirm with country names in the map dataset social$Country <- as.character(mapvalues(social$Country, from = c("United States", "Côte d'Ivoire","Democratic Republic of Congo", "Congo", "Trinidad and Tobago"), to=c("USA", "Ivory Cost","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad"))) ## coerce character data columns to numeric social[, 2:4] <- sapply(social[, 2:4], as.numeric) ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

Join the two imported data sets

### perform left join soc.happy <- left_join(happy, social, by = c('Country' = 'Country')) ### check for missing values in the combined data set mean(is.na(soc.happy[, 2:10])) ## [1] 0.0353857

Left joining the two data files introduced missing values (approximately 3.5% of the total data set) in the combined data set. R offers a variety of imputation algorithms to populate missing values. In this post, a median imputation strategy was applied by replacing missing values with the median for each column.

### median imputation for(i in 1:ncol(soc.happy[, 2:10])) { soc.happy[, 2:10][is.na(soc.happy[, 2:10][,i]), i] <- median(soc.happy[, 2:10][,i], na.rm = TRUE) } ### summary of the raw data summary(soc.happy[,2:10]) ## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :0.0000 Min. :0.000 Min. :0.0000 ## 1st Qu.:0.6600 1st Qu.:1.042 1st Qu.:0.3570 ## Median :1.0550 Median :1.252 Median :0.6050 ## Mean :0.9779 Mean :1.187 Mean :0.5474 ## 3rd Qu.:1.3150 3rd Qu.:1.412 3rd Qu.:0.7190 ## Max. :1.8710 Max. :1.611 Max. :0.9490 ## Freedom_to_make_life_choices Generosity Trust ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.3010 1st Qu.:0.1530 1st Qu.:0.0570 ## Median :0.4370 Median :0.2320 Median :0.0890 ## Mean :0.4078 Mean :0.2461 Mean :0.1224 ## 3rd Qu.:0.5140 3rd Qu.:0.3220 3rd Qu.:0.1530 ## Max. :0.6580 Max. :0.8380 Max. :0.4640 ## basic_human_needs foundations_well_being opportunity ## Min. :26.81 Min. :44.12 Min. :21.12 ## 1st Qu.:61.94 1st Qu.:63.09 1st Qu.:41.43 ## Median :75.73 Median :69.41 Median :49.55 ## Mean :71.82 Mean :68.29 Mean :52.12 ## 3rd Qu.:84.73 3rd Qu.:74.79 3rd Qu.:62.83 ## Max. :96.03 Max. :88.46 Max. :86.58

An important procedure for meaningful clustering and dimension reduction steps involves data transformation and scaling variables. The simple function below will transform all variables to values between 0 and 1.

## transform variables to a range of 0 to 1 range_transform <- function(x) { (x - min(x))/(max(x)-min(x)) } soc.happy[,2:10] <- as.data.frame(apply(soc.happy[,2:10], 2, range_transform)) ### summary of transformed data shows success of transformation summary(soc.happy[,2:10]) ## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.3528 1st Qu.:0.6468 1st Qu.:0.3762 ## Median :0.5639 Median :0.7772 Median :0.6375 ## Mean :0.5227 Mean :0.7367 Mean :0.5768 ## 3rd Qu.:0.7028 3rd Qu.:0.8765 3rd Qu.:0.7576 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## Freedom_to_make_life_choices Generosity Trust ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.4574 1st Qu.:0.1826 1st Qu.:0.1228 ## Median :0.6641 Median :0.2768 Median :0.1918 ## Mean :0.6198 Mean :0.2936 Mean :0.2638 ## 3rd Qu.:0.7812 3rd Qu.:0.3842 3rd Qu.:0.3297 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## basic_human_needs foundations_well_being opportunity ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.5075 1st Qu.:0.4278 1st Qu.:0.3103 ## Median :0.7067 Median :0.5704 Median :0.4343 ## Mean :0.6503 Mean :0.5451 Mean :0.4735 ## 3rd Qu.:0.8368 3rd Qu.:0.6917 3rd Qu.:0.6372 ## Max. :1.0000 Max. :1.0000 Max. :1.0000

Although the previous data transformation step could have been adequate for next steps of this analysis, the function shown below would re-scale all variables to a mean of 0 and standard deviation of 1.

## Scaling variables to a mean of 0 and standard deviation of 1 sd_scale <- function(x) { (x - mean(x))/sd(x) } soc.happy[,2:10] <- as.data.frame(apply(soc.happy[,2:10], 2, sd_scale)) ### summary of the scaled data summary(soc.happy[,2:10]) ## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :-2.3045 Min. :-4.1377 Min. :-2.2984 ## 1st Qu.:-0.7492 1st Qu.:-0.5050 1st Qu.:-0.7994 ## Median : 0.1817 Median : 0.2271 Median : 0.2419 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7944 3rd Qu.: 0.7849 3rd Qu.: 0.7206 ## Max. : 2.1046 Max. : 1.4787 Max. : 1.6863 ## Freedom_to_make_life_choices Generosity Trust ## Min. :-2.7245 Min. :-1.8320 Min. :-1.2102 ## 1st Qu.:-0.7137 1st Qu.:-0.6929 1st Qu.:-0.6467 ## Median : 0.1949 Median :-0.1048 Median :-0.3304 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7093 3rd Qu.: 0.5652 3rd Qu.: 0.3023 ## Max. : 1.6713 Max. : 4.4067 Max. : 3.3767 ## basic_human_needs foundations_well_being opportunity ## Min. :-2.6059 Min. :-2.5434 Min. :-1.9025 ## 1st Qu.:-0.5721 1st Qu.:-0.5471 1st Qu.:-0.6559 ## Median : 0.2263 Median : 0.1180 Median :-0.1575 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7474 3rd Qu.: 0.6842 3rd Qu.: 0.6576 ## Max. : 1.4016 Max. : 2.1228 Max. : 2.1154 Simple correlation analysis

Now, the data is ready to explore variable relationships with each other.

corr <- cor(soc.happy[, 2:10], method="pearson") ggplot(melt(corr, varnames=c("x", "y"), value.name="correlation"), aes(x=x, y=y)) + geom_tile(aes(fill=correlation)) + scale_fill_gradient2(low="green", mid="yellow", high="red", guide=guide_colorbar(ticks=FALSE, barheight = 5), limits=c(-1,1)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Heatmap of Correlation Matrix", x=NULL, y=NULL)

Principal Component Analysis

Principal component analysis is a multivariate statistics widely used for examining relationships among several quantitative variables. PCA identifies patterns in the variables to reduce the dimensions of the data set in multiple regression and clustering, among others.

soc.pca <- PCA(soc.happy[, 2:10], graph=FALSE) fviz_screeplot(soc.pca, addlabels = TRUE, ylim = c(0, 65))

Here is the plot:

The percentages on each bar indicate the proportion of total variance explained by the respective principal component. As seen in the scree plot, the first three principal components accounted for ~80% of the total variance.

soc.pca$eig ## eigenvalue percentage of variance cumulative percentage of variance ## comp 1 5.0714898 56.349887 56.34989 ## comp 2 1.4357885 15.953205 72.30309 ## comp 3 0.6786121 7.540134 79.84323 ## comp 4 0.6022997 6.692219 86.53544 ## comp 5 0.4007136 4.452373 90.98782 ## comp 6 0.3362642 3.736269 94.72409 ## comp 7 0.2011131 2.234590 96.95868 ## comp 8 0.1471443 1.634937 98.59361 ## comp 9 0.1265747 1.406386 100.00000

The output suggests that the first two principal components are worthy of consideration. A practical guide to determining the optimum number of principal component axes could be to consider only those components with eigen values greater than or equal to 1.

# Contributions of variables to principal components soc.pca$var$contrib[,1:3] ## Dim.1 Dim.2 Dim.3 ## GDP_per_capita 15.7263477 2.455323e+00 3.162470e-02 ## Social_support 12.0654754 5.445993e-01 1.345610e+00 ## Healthy_life_expectancy 15.1886385 2.259270e+00 3.317633e+00 ## Freedom_to_make_life_choices 6.6999181 2.207049e+01 8.064596e+00 ## Generosity 0.3270114 4.189437e+01 5.406678e+01 ## Trust 4.3688692 2.570658e+01 3.211058e+01 ## basic_human_needs 14.5402807 3.836956e+00 1.021076e+00 ## foundations_well_being 15.1664220 1.232353e+00 4.169125e-02 ## opportunity 15.9170370 6.170376e-05 4.113192e-04

The first principal component explained approximately 57% of the total variation and appears to represent opportunity, GDP per capita, healthy life expectancy, Foundations of Well-being, and basic human needs.
The second principal component explained a further 16% of the total variation and represented opportunity, social support, and generosity.

Applied Clustering

The syntax for clustering algorithms require specifying the number of desired clusters (k=) as an input. The practical issue is what value should k take? In the absence of a subject-matter knowledge, R offers various empirical approaches for selecting a value of k. One such R tool for suggested best number of clusters is the NbClust package.

require(NbClust) nbc <- NbClust(soc.happy[, 2:10], distance="manhattan", min.nc=2, max.nc=30, method="ward.D", index='all') ## Warning in pf(beale, pp, df2): NaNs produced ## *** : The Hubert index is a graphical method of determining the number of clusters. ## In the plot of Hubert index, we seek a significant knee that corresponds to a ## significant increase of the value of the measure i.e the significant peak in Hubert ## index second differences plot. ## ## *** : The D index is a graphical method of determining the number of clusters. ## In the plot of D index, we seek a significant knee (the significant peak in Dindex ## second differences plot) that corresponds to a significant increase of the value of ## the measure. ## ## ******************************************************************* ## * Among all indices: ## * 4 proposed 2 as the best number of clusters ## * 9 proposed 3 as the best number of clusters ## * 3 proposed 4 as the best number of clusters ## * 1 proposed 5 as the best number of clusters ## * 1 proposed 9 as the best number of clusters ## * 1 proposed 16 as the best number of clusters ## * 1 proposed 28 as the best number of clusters ## * 2 proposed 29 as the best number of clusters ## * 1 proposed 30 as the best number of clusters ## ## ***** Conclusion ***** ## ## * According to the majority rule, the best number of clusters is 3 ## ## ## *******************************************************************

The NbClust algorithm suggested a 3-cluster solution to the combined data set. So, we will apply K=3 in next steps.

set.seed(4653) pamK3 <- pam(soc.happy[, 2:10], diss=FALSE, 3, keep.data=TRUE) # Number of countries assigned in the three clusters fviz_silhouette(pamK3) ## cluster size ave.sil.width ## 1 1 28 0.40 ## 2 2 82 0.29 ## 3 3 47 0.27

The number of countries assigned in each cluster is displayed in the table above.
An interesting aspect of the K-medoids algorithm is that it finds observations from the data that are typical of each cluster. So, the following code will ask for a list of “the typical countries” as selected by the algorithm to be closest to the center of the cluster.

## which countries were typical of each cluster soc.happy$Country[pamK3$id.med] ## [1] "Germany" "Macedonia" "Burkina Faso" Putting it together

We can now display individual countries and superimpose their cluster assignments on the plane defined by the first two principal components.

soc.happy['cluster'] <- as.factor(pamK3$clustering) fviz_pca_ind(soc.pca, label="none", habillage = soc.happy$cluster, #color by cluster palette = c("#00AFBB", "#E7B800", "#FC4E07", "#7CAE00", "#C77CFF", "#00BFC4"), addEllipses=TRUE )

Finally, displaying clusters of countries on a world map map.world <- map_data("world") # LEFT JOIN map.world_joined <- left_join(map.world, soc.happy, by = c('region' = 'Country')) ggplot() + geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill=cluster, color=cluster)) + labs(title = "Applied Clustering World Happiness and Social Progress Index", subtitle = "Based on data from:https://en.wikipedia.org/wiki/World_Happiness_Report and\n https://en.wikipedia.org/wiki/List_of_countries_by_Social_Progress_Index", x=NULL, y=NULL) + coord_equal() + theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), panel.background=element_blank() )

The map of clusters:

Concluding Remarks

Cluster analysis and dimension reduction algorithms such as PCA are widely used approaches to split data sets into distinct subsets of groups as well as to reduce the dimensionality of the variables, respectively. Together, these analytics approaches: 1) make the structure of the data easier to understand, and 2) also make subsequent data mining tasks easier to perform. Hopefully, this post has attempted to illustrate the applications of web scraping, pre-processing data for analysis, PCA, and cluster analysis using various tools available in R. Please let us know your comments and suggestions. Ok to networking with the author in LinkdIn.

    Related Post

    1. Key Phrase Extraction from Tweets
    2. Financial time series forecasting – an easy approach
    3. Outlier detection and treatment with R
    4. Implementing Apriori Algorithm in R
    5. R for Publication: Lesson 6, Part 2 – Linear Mixed Effects Models

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Python-Like walk() Function for R

    Sat, 04/08/2017 - 22:07

    (This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

    A really nice function available in Python is walk(), which recursively descends a directory tree, calling a user-supplied function in each directory within the tree. It might be used, say, to count the number of files, or maybe to remove all small files and so on. I had students in my undergraduate class write such a function in R for homework, and thought it may be interesting to present it here.

    Among other things, readers who are not familiar with recursive function calls will learn about those here. I must add that all readers, even those with such background, will find this post to be rather subtle and requiring extra patience, but I believe you will find it worthwhile.

    Let’s start with an example, in which we count the number of files with a given name:

    countinst <- function(startdir,flname) { walk(startdir,checkname, list(flname=flname,tot=0)) } checkname <- function(drname,filelist,arg) { if (arg$flname %in% filelist) arg$tot <- arg$tot + 1 arg }

    Say we try all this in a directory containing a subdirectory mydir that consists of a file x, and two subdirectories, d1 and d2. The latter in turn consists of another file named x. We then make the call countinst(‘mydir’,’x’).  As can be seen above, that call will in turn make the call

    walk('mydir',checkname,list(flname='x',tot=0)

    The walk() function, which I will present below, will start in mydir, and will call the user-specified function checkname() at every directory it encounters in the tree rooted at mydir, in this case mydir, d1 and d2.

    At each such directory, walk() will pass to the user-specified function, in this case checkname(), first the current directory name, then a character vector reporting the names of all files (including directories) in that directory.  In mydir, for instance, this vector will be c(‘d1′,’d2′,’x’).  In addition, walk() will pass to checkname() an argument, formally named arg above, which will serve as a vehicle for accumulating running totals, in this case the total number of files of the given name.

    So, let’s look at walk() itself:

    walk <- function(currdir,f,arg) { # "leave trail of bread crumbs" savetop <- getwd() setwd(currdir) fls <- list.files() arg <- f(currdir,fls,arg) # subdirectories of this directory dirs <- list.dirs(recursive=FALSE) for (d in dirs) arg <- walk(d,f,arg) setwd(savetop) # go back to calling directory arg }

    The comments are hopefully self-explanatory, but the key point is that within walk(), there is another call to walk()! This is recursion.

    Note how arg accumulates, as needed in this application. If on the other hand we wanted, say, simply to remove all files named ‘x’, we would just put in a dummy variable for arg. And though we didn’t need drname here, in some applications it would be useful.

    For compute-intensive tasks, recursion is not very efficient in R, but it can be quite handy in certain settings.

    If you would like to conveniently try the above example, here is some test code:

    test <- function() { unlink('mydir',recursive=TRUE) dir.create('mydir') file.create('mydir/x') dir.create('mydir/d1') dir.create('mydir/d2') file.create('mydir/d2/x') print(countinst('mydir','x')) }

     

    To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    #4: Simpler shoulders()

    Sat, 04/08/2017 - 21:09

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Welcome to the fourth post in the repulsively random R ramblings series, or R4 for short.

    My twitter feed was buzzing about a nice (and as yet unpublished, ie not-on-CRAN) package https://github.com/dirkschumacher/thankr by Dirk Schumacher which compiles a a list of packages (ordered by maintainer count) for your current session (or installation or …) with a view towards saying thank you to those whose packages we rely upon. Very nice indeed.

    I had a quick look and run it twice … and had a reaction of ewwww, really? as running it twice gave different results as on the second instance a boatload of tibblyverse packages appeared. Because apparently kids these day can only slice data that has been tidied or something.

    So I had another quick look … and put together an alternative version using just base R (as there was only one subfunction that needed reworking):

    source(file="https://raw.githubusercontent.com/dirkschumacher/thankr/master/R/shoulders.R") format_pkg_df <- function(df) { # non-tibblyverse variant tb <- table(df[,2]) od <- order(tb, decreasing=TRUE) ndf <- data.frame(maint=names(tb)[od], npkgs=as.integer(tb[od])) colpkgs <- function(m, df) { paste(df[ df$maintainer == m, "pkg_name"], collapse=",") } ndf[, "pkg"] <- sapply(ndf$maint, colpkgs, df) ndf }

    Running this in the ESS session I had open gives:

    R> shoulders() ## by Dirk Schumacher, with small modifications maint npkgs pkg 1 R Core Team <R-core@r-project.org> 9 compiler,graphics,tools,utils,grDevices,stats,datasets,methods,base 2 Dirk Eddelbuettel <edd@debian.org> 4 RcppTOML,Rcpp,RApiDatetime,anytime 3 Matt Dowle <mattjdowle@gmail.com> 1 data.table R>

    and for good measure a screen is below:

    I think we need a catchy moniker for R work using good old base R. SoberVerse? GrumbyOldFolksR? PlainOldR? Better suggestions welcome.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Correlation and Correlogram Exercises

    Sat, 04/08/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)


    Correlation analysis is one of the most popular techniques for data exploration. This set of exercises is intended to help you to extend, speed up, and validate your correlation analysis. It allows to practice in:
    – calculating linear and nonlinear correlation coefficients,
    – testing those coefficients for statistical significance,
    – creating correlation matrices to study interdependence between variables in dataframes,
    – drawing graphical representations of those matrices (correlograms),
    – calculating coefficients for partial correlation between two variables (controlling for their correlation with other variables).
    The exercises make use of functions from the packages Hmisc, corrgram, and ggm. Please install these packages, but do not load them before starting the exercises in which they are needed (to avoid a namespace conflict) (the ggm package contains the function called rcorr which masks the rcorr function from the Hmisc package, and vice versa. If you want to return to the rcorr function from the Hmisc after loading the ggm package run detach(package:ggm)).
    Exercises are based on a reduced version of the auto dataset from the corrgram package (download here). The dataset contains characteristics of 1979 automobile models.
    Answers to the exercises are available here

    Exercise 1
    Calculate simple (linear) correlation between car price and its fuel economy (measured in miles per gallon, or mpg).

    Exercise 2
    Use the cor.test function to check whether the obtained coefficient is statistically significant at 5% level.

    Exercise 3
    Simple correlation assumes a linear relationship between variables, but it may be useful to relax this assumption. Calculate Spearman’s correlation coefficient for the same variables, and find its statistical significance.

    Exercise 4
    In R, it is possible to calculate correlation for all pairs of numeric variables in a dataframe at once. However, this requires excluding non-numeric variables first.
    Create a new dataframe, auto_num, that contains only columns with numeric values from the auto dataframe. You can do this using the Filter function.

    Exercise 5
    Use the cor function to create a matrix of correlation coefficients for variables in the auto_num dataframe.

    Exercise 6
    The standard cor.test function does not work with dataframes. However, statistical significance of correlation coefficients for a dataframe can be verified using the rcorr function from the Hmisc package.
    Transform the auto_num dataframe into a matrix (auto_mat), and use it to check significance of the correlation coefficients with the rcorr function.

    Exercise 7
    Use the corrgram function from the corrgram package to create a default correlogram to visualize correlations between variables in the auto dataframe.

    Exercise 8
    Create another correlogram that (1) includes only the lower panel, (2) uses pie diagrams to represent correlation coefficients, and (3) orders the variables using the default order.

    Exercise 9
    Create a new dataframe, auto_subset, by subsetting the auto dataframe to include only the Price, MPG, Hroom, and Rseat variables. Use the new dataframe to create a correlogram that (1) shows correlation coefficients on the lower panel, and (2) shows scatter plots (points) on the upper panel.

    Exercise 10
    Use the the correlations function from the ggm package to create a correlation matrix with both full and partial correlation coefficients for the auto_subset dataframe. Find the partial correlation between car price and its fuel economy.

    Related exercise sets:
    1. Accessing Dataframe Objects Exercises
    2. Cross Tabulation with Xtabs exercises
    3. Merging Dataframes Exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Exploring propensity score matching and weighting

    Sat, 04/08/2017 - 14:00

    (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

    This post jots down some playing around with the pros, cons and limits of propensity score matching or weighting for causal social science research.

    Intro to propensity score matching

    One is often faced with an analytical question about causality and effect sizes when the only data around is from a quasi-experiment, not the random controlled trial one would hope for. This is common in many fields, but some of the most important occurrences are in public policy. For example, government programs to help individuals or firms are typically not allocated at random, but go to those with higher need, or higher potential to make something out of the assistance. This makes isolating the effect in the data of the treatment difficult, to say the least.

    A frequently-used family of analytical methods to deal with this are grouped under propensity score matching (although not all these methods literally “match”). Such methods model the probability of each unit (eg individual or firm) receiving the treatment; and then using these predicted probabilities or “propensities” to somehow balance the sample to make up for the confounding of the treatment with the other variablers of interest.

    The simplest to explain member of this family involves creating a pseudo control group from the non-treatment individuals that resembles the treatment group in that they have similar propensities to get the treatment, but differs in that they just didn’t get it. Then analysis can proceed “as though” the data had been generated by an experiment. This approach is easy to explain to non-statisticians and has the great benefit of creating a tangible, concrete “control” group.

    But everything depends on the model of the probabilities of getting the treatment. If it’s a good model, you’re fine. If not – and in particular, if the chance of getting the treatment is related to unobserved variables which also impact on your response variable of interest – the propensity score approach helps very little if at all. A good text on all this (and much more) is Morgan and Winship’s Counterfactuals and Causal Inference: Methods and Principles for Social Research.

    Job training example Basics

    A pretty thorough implementation of various propensity score methods in R comes in Daniel E. Ho, Kosuke Imai, Gary King, Elizabeth A. Stuart (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical
    Software, Vol. 42, No. 8, pp. 1-28
    . There’s a good overview in the MatchIt vignette. I’ll get started with data from one of their examples, which shows a typical application of this technique:

    “Our example data set is a subset of the job training program analyzed in Lalonde (1986)
    and Dehejia and Wahba (1999). MatchIt includes a subsample of the original data consisting
    of the National Supported Work Demonstration (NSW) treated group and the comparison
    sample from the Population Survey of Income Dynamics (PSID).1 The variables in
    this data set include participation in the job training program (treat, which is equal to 1
    if participated in the program, and 0 otherwise), age (age), years of education (educ), race
    (black which is equal to 1 if black, and 0 otherwise; hispan which is equal to 1 if hispanic,
    and 0 otherwise), marital status (married, which is equal to 1 if married, 0 otherwise), high
    school degree (nodegree, which is equal to 1 if no degree, 0 otherwise), 1974 real earnings
    (re74), 1975 real earnings (re75), and the main outcome variable, 1978 real earnings (re78).”

    First, loading up the functionality I need for the rest of this post

    library(tidyverse) library(forcats) library(scales) library(MASS) # for rlm library(boot) # for inv.logit library(MatchIt) library(boot) # for bootstrapping library(doParallel) # for parallel processing library(mgcv) # for gam #==============example from the MatchIt vignette================= data(lalonde) # naive comparison - people who got the job training program # have lower incomes in 1978 - because the training was given # to people with income problems. So comparison is unfair: lalonde %>% group_by(treat) %>% summarise(Income1978 = mean(re78), n = n()) treat Income1978 n 1 0 6984.170 429 2 1 6349.144 185

    So we see that the 429 people who didn’t get the job training “treatment” had an average income about $635 more than the 185 beneficiaries. Program failure!

    Obviously that’s unfair on the program, so we use matchit and match.data to create an artificial control group that resembles the treatment group in terms of age, education, ethnicity, marital stats, and income in 1974 and 1975:

    # Choose one of the large variety of propensity score matching methods to model propensity match_model <- matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, method = "nearest") match_data <- match.data(match_model) # Simple comparison is now much fairer match_data %>% group_by(treat) %>% summarise(Income1978 = mean(re78), n = n()) treat Income1978 n 1 0 5440.941 185 2 1 6349.144 185

    That gives us an average treatment effect of $908. Notice that the new control group is the same size as the treatment group; the rest of the observations have been discarded.

    You don’t need to limit yourself to simple comparisons, although in principle they should work. A regression with the matched control and treatment data, even using the same explanatory variables as were used in the matching model, helps address the inevitable lack of complete balance between the two groups. That gives us a result of $960 (output not shown). I use the robust M-estimator rather than ordinary least squares to deal with breaches of the classical regression assumptions (particularly the non-Normal response, with variance increasing as its mean increases).

    # regression model estimate with matched data coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = match_data))["treat"] Regression is simpler and gives similar results

    However, a similar estimate could have come from a simpler, single-step regression with the original data, skipping the propensity score modelling altogether (there are arguments pro and con). The regression below estimates a treatment effect of $1,183. I call this “similar” because the uncertainty around all these estimates is huge, which I’ll demonstrate further down the post.

    # regression model estimate with original data coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde))["treat"] Weighting might be better than matching

    Creating a control group by matching has the distressing side-effect of throwing away large amounts of the data, because the control group is shrunk down to the same size as the treatment group. A possibly better use of the propensity scores is to keep all observations in play but weight them according to the propensity score – one of the methods described by Peter Austin in this article on “An introduction to propensity score methods for reducing the effects of confounding in observational studies”.

    Under “inverse probability of treatment weighting”, proposed by Imbens in 2000, observations that receive the treatment are given weight of \frac{1}{p} and those that did not receive the treatment are given weight of \frac{1}{1-p}, where p is the probability of getting the treatment. That is, each observation is given weight of the inverse of the probability of the treatment they actually got. Intuitively, treatment cases that resemble the controls are interesting and given more weight, and control cases that look like they should have got the treatment are interesting and get more weight. Analysis can then proceed via weighted average values, or a regression with explanatory variables (which may or may not be the same variables as those used in the model of propensity for treatment).

    I implemented this with Lalonde’s data without using the MatchIt package, partly to convince myself I understood how it worked, and partly because I wanted more flexibility in modelling propensity than is supported by that package. I’d looked at the generalized linear model with binomial family response that was used for its propensity score matching, and noticed that age was showing up as unhelpful in determining treatment. My expectation is that age is probably related in a non-linear fashion to the probability of getting job training treatment, so I used a generalized additive model to allow for this…

    mod <- gam(treat ~ s(age) + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, family = "binomial") plot(mod, shade = TRUE, main = "Non-linear function of age in treatment propensity")

    …which turns out to be the case:

    Converting predicted probabilities into weights is straightforward. In the code below I have a quick look at the resulting density of weights.

    lalonde <- lalonde %>% mutate(propensity_gam = predict(mod, type = "response"), weight_gam = 1 / propensity_gam * treat + 1 / (1 - propensity_gam) * (1 - treat)) ggplot(lalonde, aes(x = weight_gam, fill = as.factor(treat))) + geom_density(alpha = 0.5, colour = "grey50") + geom_rug() + scale_x_log10(breaks = c(1, 5, 10, 20, 40)) + ggtitle("Distribution of inverse probability weights")

    One of the criticisms of this inverse probability of treatment weighting approach is that individual observations can get very high weights and become unduly influential. For example, in an incomplete article by Posner and Ash (there may be a complete version somewhere else):

    “While this method can be shown to have nice mathematical properties, it does not work well in practice. Consider a lone treated observation that happens to have a very low probability of being treated…. The value of the inverse of the propensity score will be extremely high, asymptotically infinity. The effect size obtained will be dominated by this single value, and any fluctuations in it will produce wildly varied results, which is an undesirable property.”

    Ouch. Posner and Ash go on to suggest alternative ways of weighting less vulnerable to these problems. As a simple starter, in the below I try the obvious first step of truncating the weights at 10. Here are the average incomes of the treatment and non-treatment groups using the full set of inverse probability weights, and another set truncated at 10.

    lalonde %>% group_by(treat) %>% summarise(Income_allweights = round(weighted.mean(re78, weight_gam)), Income_truncweights = round(weighted.mean(re78, pmin(10, weight_gam))), n = n()) treat Income_allweights Income_truncweights n 1 0 6446 6435 429 2 1 6814 7167 185

    Note that we now have all 429 of the non-treatment cases, a definite advantage over the matching methods.

    We’re not limited to simple weighted averages of course; we can use these weights in the analysis of our choice including our robust regression model. Ho et al (2007) suggest that (in Morgan and Winship’s paraphrase):

    “the general procedure one should carry out in any multivariate analysis that aspires to generate causal inferences is to first balance one’s data as much as possible with a matching routine and then estimate a regression model on the matched data. From this perspective, matching is a preprocessor, which can be used to prepare the data for subsequent analysis with something such as a regression model.”

    The “doubly robust” approach to regression modelling to identify causal effects uses weights from propensity score matching as well as a regression.

    “There has been considerable debate as to which approach to confounder control is to be preferred, as the first [ie single step regression] is biased if the outcome regression model is misspecified while the second approach [ie propensity score matching] is biased if the treatment regression, ie propensity, model is misspecified. This controversy could be resolved if an estimator were available that was guaranteed to be consistent … whenever at least one of the two models was correct. … We refer to such combined methods as doubly-robust or doubly-protected as they can protect against misspecification of either the outcome or treatment model, although not against simultaneous misspecification of both.”

    Robins and Rotnitzky 2001, quoted in Morgan and Winship Counterfactuals and Causal Analysis

    So here is my robust M estimator regression, using inverse propensity of treatment as weights, where propensity of treatment was modelled earlier as a generalized additive model on all explanatory variables and non-linearly with age:

    coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, weights = as.vector(weight_gam)))["treat"]

    This gives a treatment estimate of $910 and I think it’s my best point estimate so far.

    These weighting methods seem better than simply matching, if only because they retain all the data, but crucially they are still vulnerable to model mis-specification and unobserved explanatory variables. There’s a good critical discussion in this article by Freedman and Berk:

    “Regressions can be weighted by propensity scores in order to reduce bias. However,
    weighting is likely to increase random error in the estimates, and to bias the
    estimated standard errors downward, even when selection mechanisms are well understood.
    Moreover, in some cases, weighting will increase the bias in estimated
    causal parameters. If investigators have a good causal model, it seems better just to
    fit the model without weights. If the causal model is improperly specified, there can
    be significant problems in retrieving the situation by weighting, although weighting
    may help under some circumstances.”

    And with respect to the “doubly robust” approach:

    “you need to get at least one of the two models (and preferably both) nearly right in order for weighting to help much. If both models are wrong, weighting could easily be a dead end. There are papers suggesting that under some circumstances, estimating a shaky causal model and a shaky selection model should be doubly robust. Our results indicate that under other circumstances, the technique is doubly frail.”

    When I see multi-stage approaches like propensity score matching or weighting – just like structural equation models, and two or three stage least squares – that aim to deal with causality by adding complexity, I always get very nervous; the more so when I read criticisms like those above. I’ve done some simulations exploring this issue but a write-up will have to wait for a separate post.

    Bootstrap or cross-validation must envelope the propensity modelling

    The literature has a range of (conflicting) views on estimating uncertainty of statistics estimated after propensity score matching or weighting. Morgan and Winship report in a footnote that the bootstrap does not to work particularly well for matching in particular, because the resampling process leaves fewer distinct cases to match to during the propensity modelling stage. However, Austin and Small reported in 2014 tests of bootstrap methods that seem quite effective. Definitely, one of the appeals of weighting (rather than matching) is that it should make the overall process more suitable for sitting inside a bootstrap. However, I don’t have time for a really thorough review of this literature, so let’s just note that estimating uncertainty after propensity score matching is a moving area of uncertainty itself.

    I use the method described by Austin and Small as the “complex bootstrap”, which involves resampling from the original data and performing the propensity modelling and matching for each resample. The propensity modelling is a big source of our uncertainty in the final estimates of interest. So that source of uncertainty needs to be repeated each time we mimic the sampling process with our bootstrap (the same applies to other pre-processing steps, like imputation). Austin and Small report that this results in a small overestimate of sampling variability (standard error higher by 7% than it should be; compared to 4% for a simpler bootstrap approach) but my principle, following the approach in Harrell’s Regression Modeling Strategies and elsewhere, is to always include pre-processing inside the bootstrap. I suspect that with less ideal data than Austin and Small use in their simulations (on my reading, they assume all relevant variables are available, amongst other ideals) this will pay off. Anyway, the results reported below look plausible.

    Here’s code that bootstraps four of the methods of estimating treatment effect above:

    • simple comparison of matched data;
    • robust regression with matched data;
    • robust regression with data weighted by inverse propensity of treatment;
    • robust regression with the original data.
    #=================bootstrap for confidence intervals================= # Set up a cluster for parallel computing cluster <- makeCluster(7) # only any good if you have at least 7 processors :) registerDoParallel(cluster) clusterEvalQ(cluster, { library(MatchIt) data(lalonde) }) #' Function to estimate treatment effect three different methods #' @return a vector of four estimates of treatment effect. See comments #' in function for description of which is which my_function <- function(x, i){ resampled_data <- x[i, ] match_data <- match.data( matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data, method = "nearest") ) # simple mean of matched groups est1 <- with(match_data, mean(re78[treat == 1]) - mean(re78[treat == 0])) # regression model estimate with matched groups est2 <- coef(rlm(re78 ~ treat + age + educ + black + hispan + nodegree + married + re74 + re75, data = match_data))["treat"] # regression model with IPTW mod <- gam(treat ~ s(age) + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data, family = "binomial") resampled_data <- resampled_data %>% mutate(propensity_gam = predict(mod, type = "response"), weight_gam = 1 / propensity_gam * treat + 1 / (1 - propensity_gam) * (1 - treat)) est3 <- coef(rlm(re78 ~ age + treat + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data, weights = as.vector(weight_gam)))["treat"] # regression model estimate with original data est4 <- coef(rlm(re78 ~ treat + age + educ + black + hispan + nodegree + married + re74 + re75, data = resampled_data))["treat"] return(c(est1, est2, est3, est4)) } # test gives same results as when done earlier in the post: my_function(lalonde, 1:nrow(lalonde)) booted <- boot(lalonde, statistic = my_function, R = 5000, parallel = "snow", cl = cluster) booted_df <- as.data.frame(booted$t) names(booted_df) <- c("Simple difference with matched data", "Regression with matched data", "Regression with weighted data", "Regression with original data") booted_tidy <- booted_df %>% gather("Method", "Estimate") %>% mutate(Method = fct_reorder(Method, Estimate)) booted_summary <- booted_tidy %>% group_by(Method) %>% summarise(lower = quantile(Estimate, 0.025), upper = quantile(Estimate, 0.975)) %>% gather(type, Estimate, -Method) ggplot(booted_tidy, aes(x = Estimate, fill = Method)) + geom_density(alpha = 0.4, colour = "grey50") + geom_segment(data = booted_summary, aes(xend = Estimate), y = 0, yend = 2e-04) + geom_text(data = booted_summary, aes(label = round(Estimate)), y = 2.5e-04, size = 3) + facet_wrap(~Method) + theme(legend.position = "none") + scale_x_continuous(label = dollar) + ggtitle("Treatment effects from four different modelling methods", "Distribution and 95% confidence intervals of estimated impact on income in 1978 of a job training program") + labs(x = "Estimated treatment effect", caption = "Lalonde's 1986 data on a job training program", fill = "")

    > # biases revealed by the bootstrap: > booted Bootstrap Statistics : original bias std. error t1* 908.2022 -268.85736 766.5107 t2* 959.5030 -75.55153 655.4119 t3* 909.9001 12.09643 838.2337 t4* 1182.5056 33.35265 682.6993

    Those are big confidence intervals – reflecting the small sample size and the difficulty of picking up an impact of an intervention amongst all the complexities of income determinants. I’m satisfied that those are ok depictions of the uncertainty.

    All three of the two stage methods that model propensity first give extremely similar results; and the simpler one-stage regression on the original unweighted data gives a result within a third of a standard error. In practice which would I do? I’d probably do it both ways and if there are wildly different results I’d worry, treating that as a diagnostic warning.

    Basically with this sort of data, the big gain comes from any modelling strategy that accepts treatment as endogenous and hence you need to control for other variables that are related to both it and the response. Once you’ve adopted a method to do that, the benefits of which particular method are, to me, not particularly material.

    To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R code to accompany Real-World Machine Learning (Chapter 5): Event Modeling

    Sat, 04/08/2017 - 13:00
    Abstract

    The rwml-R Github repo is updated with R code for the event modeling examples from Chapter 5 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. Examples given include reading large data files with the fread function from data.table, optimization of model parameters with caret, computing and plotting ROC curves with ggplot2, engineering date-time features, and determining the relative importance of features with the varImp function in caret.

    Event-Modeling Data

    The data for the event modeling examples in chapter 5 of the book is from the
    Kaggle Event Recommendation Engine Challenge.
    The rules of the challenge prohibit redristribution of the data, so the reader must
    download the data from Kaggle.

    Using fread

    In order to work through the examples, features from the rather large
    event.csv file are processed several times. To save time, an alternative to
    the read.csv function is needed. This is where the fread function from the
    data.table library comes in. It is similar to read.table but
    faster and more convenient. On my MacBook Pro, it took only seven seconds to read
    the event_id, lat, and lng features from the >1GB events.csv data file.

    events <- fread(file.path(dataDir, "events.csv"), sep=",", colClasses = c("integer64", rep("NULL",6), "numeric", "numeric", rep("NULL",101))) Initial cross-validated ROC curve and AUC metric

    Once a training data set is built with features from the events.csv,
    train.csv, and users.csv files, the caret train function is used
    to train a random forsest model
    evaluated using 10-fold cross-validation with the
    receiver operating characteristic (ROC) curve as the metric.
    The ROC curve and area under the curve (AUC) for the model (when applied
    to held-out test data from the training data set) are shown in the figure
    below. An AUC of 0.86 is achieved with the initial set of six features.

    Twitshot
    !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://ssl.twitshot.com/share-button.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitshot-bjs');

    Inclusion of date-time and bag-of-words features lead to over-fitting

    Ten date-time features (such as ‘‘week of the year’’ and ‘‘day of the week’’)
    are extracted from the timestamp column of the events.csv file.
    When added to the model, the AUC actually decreases slightly,
    indicating over-fitting. The AUC decreases even more when available
    ‘‘bag-of-words’’ data is included in the model.

    Feature importance

    The varImp function from the caret package computes the
    relative importance of each variable for objects produced by the train
    method. A plot of the relative importance for the top seven variables
    in the final random forest model is shown below.

    Feedback welcome

    If you have any feedback on the rwml-R project, please
    leave a comment below or use the Tweet button.
    As with any of my projects, feel free to fork the rwml-R repo
    and submit a pull request if you wish to contribute.
    For convenience, I’ve created a project page for rwml-R with
    the generated HTML files from knitr, including a page with
    all of the event-modeling examples from chapter 5.

    Download
    Fork

    Pages