Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 8 hours ago

Timing in R

Mon, 11/20/2017 - 01:00

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

As time goes on, your R scripts are probably getting longer and more complicated, right? Timing parts of your script could save you precious time when re-running code over and over again. Today I’m going to go through the 4 main functions for doing so.

Nested timings 1) Sys.time()

Sys.time() takes a “snap-shot” of the current time and so it can be used to record start and end times of code.

start_time = Sys.time() Sys.sleep(0.5) end_time = Sys.time()

To calculate the difference, we just use a simple subtraction

end_time - start_time ## Time difference of 0.5061 secs

Notice it creates a neat little message for the time difference.

2) The tictoc package

You can install the CRAN version of tictoc via

install.packages("tictoc")

whilst the most recent development is available via the tictoc GitHub page.

library("tictoc")

Like Sys.time(), tictoc also gives us ability to nest timings within code. However, we now have some more options to customise our timing. At it’s most basic it acts like Sys.time():

tic() Sys.sleep(0.5) toc() ## 0.506 sec elapsed

Now for a more contrived example.

# start timer for the entire section, notice we can name sections of code tic("total time") # start timer for first subsection tic("Start time til half way") Sys.sleep(2) # end timer for the first subsection, log = TRUE tells toc to give us a message toc(log = TRUE) ## Start time til half way: 2.037 sec elapsed

Now to start the timer for the second subsection

tic("Half way til end") Sys.sleep(2) # end timer for second subsection toc(log = TRUE) ## Half way til end: 2.012 sec elapsed # end timer for entire section toc(log = TRUE) ## total time: 4.067 sec elapsed

We can view the results as a list (format = TRUE returns this list in a nice format), rather than raw code

tic.log(format = TRUE) ## [[1]] ## [1] "Start time til half way: 2.037 sec elapsed" ## ## [[2]] ## [1] "Half way til end: 2.012 sec elapsed" ## ## [[3]] ## [1] "total time: 4.067 sec elapsed"

We can also reset the log via

tic.clearlog() Comparing functions 1) system.time()

Why oh WHY did R choose to give system.time() a lower case s and Sys.time() and upper case s? Anyway… system.time() can be used to time functions without having to take note of the start and end times.

system.time(Sys.sleep(0.5)) ## user system elapsed ## 0.000 0.000 0.501 system.time(Sys.sleep(1)) ## user system elapsed ## 0.000 0.000 1.003

We only want to take notice of the “elapsed” time, for the definition of the “user” and “system” times see this thread.

For a repeated timing, we would use the replicate() function.

system.time(replicate(10, Sys.sleep(0.1))) ## user system elapsed ## 0.000 0.000 1.007 2) The microbenchmark package

You can install the CRAN version of microbenchmark via

install.packages("microbenchmark")

Alternatively you can install the latest update via the microbenchmark GitHub page.

library("microbenchmark")

At it’s most basic, microbenchmark() can we used to time single pieces of code.

# times = 10: repeat the test 10 times # unit = "s": output in seconds microbenchmark(Sys.sleep(0.1), times = 10, unit = "s") ## Unit: seconds ## expr min lq mean median uq max neval ## Sys.sleep(0.1) 0.1001 0.1006 0.1005 0.1006 0.1006 0.1006 10

Notice we get a nicely formatted table of summary statistics. We can record our times in anything from seconds to nanoseconds(!!!!). Already this is better than system.time(). Not only that, but we can compare sections of code in an easy-to-do way and name the sections of code for an easy-to-read output.

sleep = microbenchmark(sleepy = Sys.sleep(0.1), sleepier = Sys.sleep(0.2), sleepiest = Sys.sleep(0.3), times = 10, unit = "s")

As well as this (more?!) microbenchmark comes with a two built-in plotting functions.

microbenchmark:::autoplot.microbenchmark(sleep)

microbenchmark:::boxplot.microbenchmark(sleep)

These provide quick and efficient ways of visualising our timings.

Conclusion

Sys.time() and system.time() have there place, but for most cases we can do better. The tictoc and microbenchmark packages are particularly useful and make it easy to store timings for later use, and the range of options for both packages stretch far past the options for Sys.time() and system.time(). The built-in plotting functions are handy.

Thanks for chatting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Where to live in the Netherlands based on temperature XKCD style

Mon, 11/20/2017 - 01:00

(This article was first published on Clean Code, and kindly contributed to R-bloggers)

After seeing a plot of best places to live in Spain and the USA based on the weather, I had to
chime in and do the same thing for the Netherlands. The idea is simple, determine where you want to live based on your temperature preferences.

First the end result:

How to read this plot?

In this xkcd comic we see that the topleft of the the graph represents “if you hate cold and hate heat”, if you go down from the topleft to the bottom left the winters get colder and ending in “if you love cold and hate heat”. Going to the right the heat (and humidity) go up ending in “if you love cold and love heat”. Going up to the top right: “if you hate cold and love heat”.

This post explains how to make the plot, to see where I got the data and what procedures I took look at https://github.com/RMHogervorst/xkcd_weather_cities_nl.

Inspiration

According to this post by Maële Salmon inspired by xkcd, we can determine our ideal city by looking at wintertemperature and humidex (combination of humidity and summer heat).

I’ve seen major cities in the USA (post by Maelle) and where to live in Spain by Claudia Guirao. There is even one on France in French, of course.

So I had to make one for the Netherlands too. There is just a small detail,
The Netherlands is tiny, the United States is approximately 237 times larger then The Netherlands. From The Hague to the German border directly east is 164 km (101 miles) and from Leeuwarden to Maastricht in the south is 262 km (162 miles). Essentially my home country has a moderate sea climate with warm summers and relatively mild winters.

I expect there to be no real big differences between the cities. I use the average temperature over december, january and february for winter temperature and calculate the humidex using the comf package. This humidex is a combination of humidity and temperature.

  • 20 to 29: Little to no discomfort
  • 30 to 39: Some discomfort
  • 40 to 45: Great discomfort; avoid exertion
  • Above 45: Dangerous; heat stroke quite possible

For cities I went the extremely lazy way and took all of the available weatherstations provided by the Dutch weather service (KNMI, — Royal Netherlands, Metereological Institute). There are 46 stations in the Netherlands. These are not always in the cities but sort of broadly describe the entire country.

Plot like XKCD

The XKCD package has wonderful plotting abilities and a cool font that you have to install. I copied and modified the code from the post of Mäelle, because it is really good!

If you want to see how I did the data cleaning go to the github page.

First we plot all of the stations in the Netherlands most of these places will probably not be familiar to non-Dutch people.

library("xkcd") library("ggplot2") library("extrafont") library("ggrepel") xrange <- range(result$humidex_avg) yrange <- range(result$wintertemp) set.seed(3456) ggplot(result, aes(humidex_avg, wintertemp)) + geom_point() + geom_text_repel(aes(label = NAME), family = "xkcd", max.iter = 50000, size = 3)+ ggtitle("Where to live in The Netherlands \nbased on your temperature preferences", subtitle = "Data from KNMI, 2016-2017") + xlab("Summer heat and humidity via Humidex")+ ylab("Winter temperature in Celsius degrees") + xkcdaxis(xrange = xrange, yrange = yrange)+ theme_xkcd() + theme(text = element_text(size = 13, family = "xkcd"))

But what does that mean, in the grand scheme of things? As you might notice the range is very small. It would be interesting to add some US cities and perhaps some Spanisch cities to compare against. For fun I also added the Dutch Caribean islands.

xrange2 <- range(c(18,40)) # modified these by hand to increase the margins yrange2 <- range(c(-5,40)) USA <- tribble( ~NAME, ~humidex_avg, ~wintertemp, "DETROIT, MI", 27, 0, "NASHVILLE, TN", 33, 9, "FORT MEYERS, FL",37, 20 ) SPAIN <- tribble( ~NAME, ~humidex_avg, ~wintertemp, "MADRID, SPAIN", 27, 8, "TENERIFE, SPAIN", 24, 13, "BARCELONA, SPAIN",32, 10 ) D_CARI <- tribble( ~NAME, ~humidex_avg, ~wintertemp, "Bonaire, Caribbean Netherlands", 27, calcHumx(29,76), "Sint Eustatius, Caribbean Netherlands", 28, calcHumx(28,77), "Saba, Caribbean Netherlands",30, calcHumx(30,76) ) set.seed(3456) ggplot(result, aes(humidex_avg, wintertemp)) + geom_point(alpha = .7) + geom_text_repel(aes(label = NAME), family = "xkcd", max.iter = 50000, size = 2)+ geom_text_repel(data = USA, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd", max.iter = 50000, size = 2, color = "blue")+ geom_point(data = USA, aes(humidex_avg, wintertemp), color = "blue")+ geom_text_repel(data = SPAIN, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd", max.iter = 50000, size = 2, color = "red")+ geom_point(data = SPAIN, aes(humidex_avg, wintertemp),color = "red")+ geom_text_repel(data = D_CARI, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd", max.iter = 50000, size = 2, color = "orange")+ geom_point(data = D_CARI, aes(humidex_avg, wintertemp), color = "orange")+ labs(title = "Where to live in The Netherlands based on \nyour temperature preferences \nCompared with some places in Spain, Caribbean NL and USA", subtitle = "Data from KNMI, 2016-2017, added Spain and US cities", x = "Summer heat and humidity via Humidex", y = "Winter temperature in Celsius degrees", caption = "includes Caribbean Netherlands" ) + xkcdaxis(xrange = xrange2, yrange = yrange2)+ theme_xkcd() + theme(text = element_text(size = 16, family = "xkcd"))

Finally we can compare the wintertemperature and summer humidex in The Netherlands by placing the points on a map using the coordinates of the measure stations.

NL <- map_data(map = "world",region = "Netherlands") result %>% rename(LON = `LON(east)`, LAT = `LAT(north)`) %>% ggplot( aes(LON, LAT))+ geom_point(aes(color = wintertemp))+ geom_text_repel(aes(label = NAME), family = "xkcd", size = 3, max.iter = 50000)+ geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") + coord_map()+ labs(title = "Wintertemperature in NL", subtitle = "data from KNMI (2016,2017", x = "", y = "")+ theme_xkcd()+ theme(text = element_text(size = 16, family = "xkcd")) result %>% rename(LON = `LON(east)`, LAT = `LAT(north)`) %>% ggplot( aes(LON, LAT))+ geom_point(aes(color = humidex_avg))+ geom_text_repel(aes(label = NAME), family = "xkcd", size = 3, max.iter = 50000)+ geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") + coord_map()+ labs(title = "Humidex in NL", subtitle = "data from KNMI (2016,2017", x = "", y = "")+ theme_xkcd()+ theme(text = element_text(size = 12, family = "xkcd"))+ scale_color_continuous(low = "white", high = "red")


Now show us what your country looks like!

Where to live in the Netherlands based on temperature XKCD style was originally published by at Clean Code on November 20, 2017.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Clean Code. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Content Evaluation: What is the Value of Social Media?

Mon, 11/20/2017 - 01:00

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

As most bloggers, I do check my analytics stats on a regular basis. What I do not really look at is social shares. For most blog posts traffic follows a clear pattern; 2 days of increased traffic, followed by a steady decrease to the base traffic volume.
The amplitude varies massively depending on how interesting/viral the post was. However, in the long-term this virality does not drive the majority of the traffic but rather the search ranking of individual posts, as 80% of the base traffic comes through organic search results.

There is plenty of discussion on the value of a share/like with very different take-aways.
I thought it would be great to analyse if social shares do increase my traffic.

In addition, I realized that it is pretty easy to get the number of Facebook and LinkedIn shares for each of my blog posts from r-bloggers. (Please find the script at the end of the post.)

For the analysis, I scraped the social share numbers and merged them with my Google Analytics stats.
In order to compare the traffic between blog posts, I normalise the total traffic per post by number of days the post is online.

To get a first picture of the data we can plot the normalized traffic over the number of shares (both by blog post).

Next, we can analyse how the variables are correlated. I included the total pageviews as reference.

M <- cor(dff[,c("shares", "linkedIn1", "pageviews", "normalizedPageViews")]) corrplot.mixed(M)

We see that LinkedIn and Facebook shares are positively correlated with r=.56 and that the correlation between shares and normalized pageviews is the same.
When looking at the total page views we see that have a much lower correlation with shares both from facebook as well as Linkedin.

To determine the “value” of social media to drive traffic volume, we can regress the number of social shares on the normalized traffic numbers.

dff$DaysOnline <- as.numeric(Sys.Date()-dff$Date) fit1 <- lm(normalizedPageViews ~ shares + linkedIn1, data=dff) fit2 <- lm(normalizedPageViews ~ shares + linkedIn1, data=dff) fit3 <- lm(normalizedPageViews ~ shares + linkedIn1+ DaysOnline, data=dff) stargazer(fit1, fit2, fit3, type = "html" ,title = "The value of shares for daily traffic") The value of shares for daily traffic Dependent variable: normalizedPageViews (1) (2) (3) shares 0.043** 0.043** 0.026 (0.017) (0.017) (0.022) linkedIn1 0.138 0.138 0.101 (0.126) (0.126) (0.129) DaysOnline -0.008 (0.006) Constant 1.462 1.462 5.744 (1.114) (1.114) (3.571) Observations 33 33 33 R2 0.341 0.341 0.375 Adjusted R2 0.297 0.297 0.310 Residual Std. Error 5.042 (df = 30) 5.042 (df = 30) 4.994 (df = 29) F Statistic 7.755*** (df = 2; 30) 7.755*** (df = 2; 30) 5.802*** (df = 3; 29) Note: *p<0.1; **p<0.05; ***p<0.01

As in the correlation plot, we see that highly shared posts have more daily visits. If you take the estimates at face value one can also state that a linkedIn share is worth 3 times a facebook share in terms of additional traffic.
In the last regression I also included a variable (DaysOnline) to capture my learning effect. The longer post is online the lower is the daily traffic. (e.g. posts published when I started blogging).

While writing this post, I realized that the script can also be used to analyse;

  • which topics are highly shared
  • which authors are popular
  • how social sharing changed over time
    on r-bloggers.

Have fun!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Automatic output format in Rmarkdown

Sun, 11/19/2017 - 19:21

(This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers)

I am writing a Rmarkdown document with plenty of tables, and I want them in a decent format, e.g. kable. However I don’t want to format them one by one. For example, I have created the following data frame in dplyr

data2 %>% group_by(uf) %>%
summarise(n = n(), ) %>%
arrange(desc(n))

One solution to the output format of this data frame would be to name it as an object in R, and then give it a format by using the kable function.

t1 <- data2 %>%
group_by(uf) %>%
summarise(n = n(), ) %>% arrange(desc(n))

knitr::kable(t1)

However, if your document has hundreds of these queries and you need a faster way to compile the document, while keeping the kable style automatically, avoiding giving a name to the data frame and even avoiding to call the kable function over that name, you can use the printr package. Just add the following piece of code inside a chunk at the beginning of your document and voilá.

library(printr)

Now, all of your data frames will have a decent style, and you do not need to worry about this issue. For example, I have knitted a presentation by using printr and the first code in this post, and this is the result:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why is R slow? some explanations and MKL/OpenBLAS setup to try to fix this

Sun, 11/19/2017 - 17:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Introduction

Many users tell me that R is slow. With old R releases that is 100% true provided old R versions used its own numerical libraries instead of optimized numerical libraries.

But, numerical libraries do not explain the complete story. In many cases slow code execution can be attributed to inefficient code and in precise terms because of not doing one or more of these good practises:

  • Using byte-code compiler
  • Vectorizing operations
  • Using simple data structures (i.e using data frames instead of matrices in large computing instances)
  • Re-using results

I would add another good practise: “Use the tidyverse”. Provided tidyverse packages such as dplyr benefit from Rcpp, having a C++ backend can be faster than using dplyr’s equivalents in base (i.e plain vanilla) R.

The idea of this post is to clarify some ideas. R does not compete with C or C++ provided they are different languages. R and data.table package may compete with Python and numpy library. This does not mean that I’m defending R over Python or backwards. The reason behind this is that both R and Python implementations consists in an interpreter while in C and C++ it consists in a compiler, and this means that C and C++ will always be faster because in really over-simplifying terms compiler implementations are closer to the machine.

Basic setup for general usage

As an Ubuntu user I can say the basic R installation from Canonical or CRAN repositories work for most of the things I do on my laptop.

When I use RStudio Server Pro© that’s a different story because I really want to optimize things because when I work with large data (i.e. 100GB in RAM) a 3% more of resources efficiency or reduced execution time is valuable.

Installing R with OpenBLAS will give you a tremedous performance boost, and that will work for most of laptop situations. I explain how to do that in detail for Ubuntu 17.10 and Ubuntu 16.04 but a general setup would be as simple as one of this two options:

# 1: Install from Canonical (default Ubuntu repository) sudo apt-get update && sudo apt-get upgrade sudo apt-get install libopenblas-dev r-base # 2: Install from CRAN mirror sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9 printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list sudo apt-get update && sudo apt-get upgrade sudo apt-get install libopenblas-dev r-base # 3: Install RStudio (bonus) cd ~/Downloads wget https://download1.rstudio.org/rstudio-xenial-1.1.383-amd64.deb sudo apt-get install gdebi sudo gdebi rstudio-xenial-1.1.383-amd64.deb printf '\nexport QT_STYLE_OVERRIDE=gtk\n' | sudo tee -a ~/.profile

Being (1) a substitute of (2). It’s totally up to you which one to use and both will give you a really fast R compared to installing it without OpenBLAS.

Benchmarking different R setups

I already use R with OpenBLAS just like the setup above. I will compile parallel R instances to do the benchmarking.

Installing Intel© MKL numerical libraries

My benchmarks do indicate that in my case it’s convenient to take the time it takes to install Intel© MKL. The execution time is strongly reduces for some operations when compared to R with OpenBLAS performance.

Run this to install MKL:

# keys taken from https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' sudo apt-get update && sudo apt-get install intel-mkl-64bit Installing CRAN R with MKL

To compile it from source (in this case it’s the only option) run these lines:

# key added after sudo apt-get update returned a warning following this guide: https://support.rstudio.com/hc/en-us/articles/218004217-Building-R-from-source sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9 printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list # you need to enable multiverse repo or packages as xvfb won't be found sudo rm -rf /etc/apt/sources.list printf 'deb http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse deb-src http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse\n deb http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse deb-src http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse\n deb http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse deb-src http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse\n' | sudo tee -a /etc/apt/sources.list sudo apt-get update sudo apt-get clean sudo apt-get autoclean sudo apt-get autoremove sudo apt-get upgrade --with-new-pkgs sudo apt-get build-dep r-base cd ~/GitHub/r-with-intel-mkl wget https://cran.r-project.org/src/base/R-3/R-3.4.2.tar.gz tar xzvf R-3.4.2.tar.gz cd R-3.4.2 source /opt/intel/mkl/bin/mklvars.sh intel64 MKL="-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm" ./configure --prefix=/opt/R/R-3.4.2-intel-mkl --enable-R-shlib --with-blas="$MKL" --with-lapack make && sudo make install printf '\nexport RSTUDIO_WHICH_R=/usr/local/bin/R\nexport RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl\n' | tee -a ~/.profile Installing CRAN R with OpenBLAS

Just not to interfere with working installation I decided to compile a parallel instance from source:

cd ~/GitHub/r-with-intel-mkl/ rm -rf R-3.4.2 tar xzvf R-3.4.2.tar.gz cd R-3.4.2 ./configure --prefix=/opt/R/R-3.4.2-openblas --enable-R-shlib --with-blas --with-lapack make && sudo make install printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R\n' | tee -a ~/.profile Installing CRAN R with no optimized numerical libraries

There is a lot of discussion and strong evidence from different stakeholders in the R community that do indicate that this is by far the most inefficient option. I compiled this just to make a complete benchmark:

cd ~/GitHub/r-with-intel-mkl/ rm -rf R-3.4.2 tar xzvf R-3.4.2.tar.gz cd R-3.4.2 ./configure --prefix=/opt/R/R-3.4.2-defaults --enable-R-shlib make && sudo make install printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R\n' | tee -a ~/.profile Installing Microsoft© R Open with MKL

This R version includes MKL by default and it’s supposed to be easy to install. I could not make it run and that’s bad because different articles (like this post by Brett Klamer) state that this R version is really efficient but no different to standard CRAN R with MKL numerical libraries.

In any case here’s the code to install this version:

cd ~/GitHub/r-with-intel-mkl wget https://mran.blob.core.windows.net/install/mro/3.4.2/microsoft-r-open-3.4.2.tar.gz tar xzvf microsoft-r-open-3.4.2.tar.gz cd microsoft-r-open sudo ./install.sh printf 'export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R\n' | tee -a ~/.profile # it was not possible to start /opt/microsoft/ropen/3.4.2/lib64/R/bin/R # the error is: # *** caught segfault *** # address 0x50, cause 'memory not mapped' # removing Microsoft R # https://mran.microsoft.com/documents/rro/installation#revorinst-uninstall steps did not work sudo apt-get remove 'microsoft-r-.*' sudo apt-get autoclean && sudo apt-get autoremove Benchmark results

My scripts above do edit ~/.profile. This is to open RStudio and work with differently configured R instances on my computer.

I released the benchmark results and scripts on GitHub. The idea is to run the same scripts from ATT© and Microsoft© to see how different setups perform.

To work with CRAN R with MKL I had to edit ~/.profile because of how I configurated the instances. So I had to run nano ~/.profile and comment the last part of the file to obtain this result:

#export RSTUDIO_WHICH_R=/usr/bin/R export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R #export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R

After that I log out and then log in to open RStudio to run the benchmark.

The other two cases are similar and the benchmark results were obtained editing ~/.profile, logging out and in and opening RStudio with the corresponding instance.

As an example, this result starts with the R version and the corresponding numerical libraries used in that sessions. Any other result are reported in the same way.

And here are the results from ATT© benchmarking script:

Task CRAN R with MKL (seconds) CRAN R with OpenBLAS (seconds) CRAN R with no optimized libraries (seconds) Creation, transp., deformation of a 2500×2500 matrix (sec) 0.68 0.68 0.67 2400×2400 normal distributed random matrix ^1000 0.56 0.56 0.56 Sorting of 7,000,000 random values 0.79 0.79 0.79 2800×2800 cross-product matrix (b = a’ * a) 0.3 0.36 14.55 Linear regr. over a 3000×3000 matrix (c = a \ b’) 0.17 0.22 6.98 FFT over 2,400,000 random values 0.33 0.33 0.33 Eigenvalues of a 640×640 random matrix 0.22 0.49 0.74 Determinant of a 2500×2500 random matrix 0.2 0.22 2.99 Cholesky decomposition of a 3000×3000 matrix 0.31 0.21 5.76 Inverse of a 1600×1600 random matrix 0.2 0.21 2.79 3,500,000 Fibonacci numbers calculation (vector calc) 0.54 0.54 0.54 Creation of a 3000×3000 Hilbert matrix (matrix calc) 0.23 0.24 0.23 Grand common divisors of 400,000 pairs (recursion) 0.27 0.29 0.3 Creation of a 500×500 Toeplitz matrix (loops) 0.28 0.28 0.28 Escoufier’s method on a 45×45 matrix (mixed) 0.22 0.23 0.28 Total time for all 15 tests 5.3 5.62 37.78 Overall mean (weighted mean) 0.31 0.32 0.93

And here are the results from Microsoft© benchmarking script:

Task CRAN R with MKL (seconds) CRAN R with OpenBLAS (seconds) CRAN R with no optimized libraries (seconds) Matrix multiply 5.985 13.227 165.18 Cholesky Factorization 1.061 2.349 26.762 Singular Value Decomposition 7.268 18.325 47.076 Principal Components Analysis 14.932 40.612 162.338 Linear Discriminant Analysis 26.195 43.75 117.537 Actions after benchmarking results

I decided to try Intel MKL and I’ll write another post benchmarking things I do everyday beyond what is considered in the scripts.

To clean my system I deleted all R instances but MKL:

sudo apt-get remove r-base r-base-dev sudo apt-get remove 'r-cran.*' sudo apt-get autoclean && sudo apt-get autoremove sudo apt-get build-dep r-base sudo rm -rf /opt/R/R-3.4.2-openblas sudo rm -rf /opt/R/R-3.4.2-defaults sudo ln -s /opt/R/R-3.4.2-intel-mkl/bin/R /usr/bin/R

I edited ~/.profile so the final lines are:

export RSTUDIO_WHICH_R=/usr/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R #export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R

And I also decided to configure my user packages directory from zero:

# remove installed user packages rm -rf ~/R # create new user packages directory mkdir ~/R/ mkdir ~/R/x86_64-pc-linux-gnu-library/ mkdir ~/R/x86_64-pc-linux-gnu-library/3.4 # install common packages R --vanilla << EOF install.packages(c("tidyverse","data.table","dtplyr","devtools","roxygen2","bit64","pacman"), repos = "https://cran.rstudio.com/") q() EOF # Export to HTML/Excel R --vanilla << EOF install.packages(c("htmlTable","openxlsx"), repos = "https://cran.rstudio.com/") q() EOF # Blog tools R --vanilla << EOF install.packages(c("knitr","rmarkdown"), repos='http://cran.us.r-project.org') q() EOF sudo apt-get install python-pip sudo pip install --upgrade --force-reinstall markdown rpy2==2.7.8 pelican==3.6.3 # PDF extraction tools sudo apt-get install libpoppler-cpp-dev default-jre default-jdk sudo R CMD javareconf R --vanilla << EOF library(devtools) install.packages(c("rjava","pdftools"), repos = "https://cran.rstudio.com/") install_github("ropensci/tabulizer") q() EOF # TTF/OTF fonts usage R --vanilla << EOF install.packages("showtext", repos = "https://cran.rstudio.com/") q() EOF # Cairo for graphic devices sudo apt-get install libgtk2.0-dev libxt-dev libcairo2-dev R --vanilla << EOF install.packages("Cairo", repos = "https://cran.rstudio.com/") q() EOF Concluding remarks

The benchmark exposed here are in no way a definitive end to the long discussion on numerical libraries. My results show some evidence that indicates, that because of more speed for some operations, I should use MKL.

One of the advantages of the setup I explained is that you can use MKL with Python. In that case numpy calculations will be boosted.

Using MKL with AMD© processors might not provide an important improvement when compared to use OpenBLAS. This is because MKL uses specific processor instructions that work well with i3 or i5 processors but not neccesarily with non-Intel models.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

M4 Forecasting Competition

Sun, 11/19/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

The “M” competitions organized by Spyros Makridakis have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models. For that, Spyros deserves congratulations for changing the landscape of forecasting research through this series of competitions.
Makridakis & Hibon, (JRSSA 1979) was the first serious attempt at a large empirical evaluation of forecast methods.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Keyboard Shortcuts for Pipes

Sat, 11/18/2017 - 19:32

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R.

You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statebins Reimagined

Sat, 11/18/2017 - 16:33

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

A long time ago, in a github repo far, far away there lived a tiny package that made it possible to create equal area, square U.S. state cartograms in R dubbed statebins. Three years have come and gone and — truth be told — I’ve never been happy with that package. It never felt “right” and that gnawing feeling finally turned into action with a “re-imagining” of the API.

Previously, on statebins…

There were three different functions in the old-style package:

  • one for discrete scales (it automated ‘cuts’)
  • one for continuous scales
  • one for manual scales

It also did some hack-y stuff with grobs to try to get things to look good without putting too much burden on the user.

All that “mostly” worked, but I always ended up doing some painful workaround when I needed more custom charts (not that I have to use this package much given the line of work I’m in).

Downsizing statebins

Now, there’s just one function for making the cartograms — statebins() — and another for applying a base theme — theme_statebins(). The minimalisation has some advantages that we’ll take a look at now, starting with the most basic example (the one on the manual page):

data(USArrests) USArrests$state <- rownames(USArrests) statebins(USArrests, value_col="Assault", name = "Assault") + theme_statebins(legend_position="right")

Two things should stand out there:

  • you got scale_fill_distiller() for free!
  • labels are dark/light depending on the tile color

Before we go into ^^, it may be helpful to show the new function interface:

statebins(state_data, state_col = "state", value_col = "value", dark_label = "black", light_label = "white", font_size = 3, state_border_col = "white", state_border_size = 2, ggplot2_scale_function = ggplot2::scale_fill_distiller, ...)

You pass in the state name/abbreviation & value columns like the old interface but also specify colors for the dark & light labels (set hex code color with 00 ending alpha values if you don’t want labels but Muricans are pretty daft and generally need the abbreviations on the squares). You can set the font size, too (we’ll do that in a bit) and customize the border color (usually to match the background of the target medium). BUT, you also pass in the ggplot2 scale function you want to use and the named parameters for it (that’s what the ... is for).

So, yes I’ve placed more of a burden on you if you want discrete cuts, but I’ve also made the package way more flexible and made it possible to keep the labels readable without you having to lift an extra coding finger.

The theme()-ing is also moved out to a separate theme function which makes it easier for you to further customize the final output.

But that’s not all!

There are now squares for Puerto Rico, the Virgin Islands and New York City (the latter two were primarily for new features/data in cdcfluview but they are good to have available). Let’s build out a larger example with some of these customizations (we’ll make up some data to do that):

library(statebins) library(tidyverse) library(viridis) data(USArrests) # make up some data for the example rownames_to_column(USArrests, "state") %>% bind_rows( data_frame( state = c("Virgin Islands", "Puerto Rico", "New York City"), Murder = rep(mean(max(USArrests$Murder),3)), Assault = rep(mean(max(USArrests$Assault),3)), Rape = rep(mean(max(USArrests$Rape),3)), UrbanPop = c(93, 95, 100) ) ) -> us_arrests statebins(us_arrests, value_col="Assault", ggplot2_scale_function = viridis::scale_fill_viridis) + labs(title="USArrests + made up data") + theme_statebins("right")

Cutting to the chase

I still think it makes more sense to use binned data in these cartograms, and while you no longer get that for “free”, it’s not difficult to do:

adat <- suppressMessages(read_csv("http://www.washingtonpost.com/wp-srv/special/business/states-most-threatened-by-trade/states.csv?cache=1")) mutate( adat, share = cut(avgshare94_00, breaks = 4, labels = c("0-1", "1-2", "2-3", "3-4")) ) %>% statebins( value_col = "share", ggplot2_scale_function = scale_fill_brewer, name = "Share of workforce with jobs lost or threatened by trade" ) + labs(title = "1994-2000") + theme_statebins()

More manual labor

You can also still use hardcoded colors, but it’s a little more work on your end (but not much!):

election_2012 <- suppressMessages(read_csv("https://raw.githubusercontent.com/hrbrmstr/statebins/master/tmp/election2012.csv")) mutate(election_2012, value = ifelse(is.na(Obama), "Romney", "Obama")) %>% statebins( font_size=4, dark_label = "white", light_label = "white", ggplot2_scale_function = scale_fill_manual, name = "Winner", values = c(Romney = "#2166ac", Obama = "#b2182b") ) + theme_statebins()

BREAKING NEWS: Rounded corners

A Twitter request ended up turning into a new feature this afternoon (after I made this post) => rounded corners:

data(USArrests) USArrests$state <- rownames(USArrests) statebins(USArrests, value_col="Assault", name = "Assault", round=TRUE) + theme_statebins(legend_position="right")

FIN

It’ll be a while before this hits CRAN and I’m not really planning on keeping the old interface when the submission happens. So, it’ll be on GitHub for a bit to let folks chime in on what additional features you want and whether you really need to keep the deprecated functions around in the package.

So, kick the tyres and don’t hesitate to shoot over some feedback!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Seasonality of plagues by @ellis2013nz

Sat, 11/18/2017 - 12:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Seasonality of plague deaths

In the course of my ramblings through history, I recently came across Mark R. Welford and Brian H. Bossak Validation of Inverse Seasonal Peak Mortality in Medieval Plagues, Including the Black Death, in Comparison to Modern Yersinia pestis-Variant Diseases. This article is part of an interesting epidemiological literature on the seasonality of medieval and early modern European plagues and the twentieth century outbreaks of plague caused by the Yersinia pestis bacterium. Y. pestis is widely credited (if that’s the word) for the 6th century Plague of Justinian, the mid 14th century Black Death, and the Third Pandemic from the mid nineteenth to mid twentieth centuries. The Black Death killed between 75 and 200 million people in Eurasis in only a few years; the Third Pandemic killed at least 12 million. These are important diseases to understand!

In their 2009 article, Welford and Bossak added additional evidence and rigour to the argument that the difference in seasonality of peak deaths between the first two suspected bubonic plague pandemics and the Third Pandemic indicates a material difference in the diseases, whether it be a variation in Y. pestis itself, its mode and timing of transmission, or even the presence of an unknown human-to-human virus. This latter explanation would account for the extraordinary speed of spread of the Medieval Black Death, which I understand is surprising for a rodent-borne bacterial disease. These are deep waters; as a newcomer to the topic there’s too much risk of me getting it wrong if I try to paraphrase, so I’ll leave it there and encourage you to read specialists in the area.

Plot polishing

One of Welford and Bossak’s graphics caught my eye. Their original is reproduced at the bottom of this post. It makes the point of the differing seasonalities in the different outbreaks of plague but is pretty cluttered. I set myself the job of improving it as a way of familiarising myself with the issues. My first go differed from there’s only in minimal respects:

  • split the graphic into four facets for each plague agent / era to allow easier comparison
  • direct labels of the lines rather than relying on a hard-to-parse legend
  • added a grey ribbon showing a smoothed overall trend, to give a better idea of the “average” seasonality of the particuar era’s outbreak in each facet
  • removed gridlines

All the R code for today’s post, from download to animation, is collected at the end of the post.

I think the grey ribbon of the smoothed average line is the best addition here. It makes it much clearer that there is an average seasonality in the European (medieval and early modern) plagues not apparent in the Indian and Chinese examples.

I wasn’t particularly happy with the logarithm transform of the vertical axis. While this is a great transformation for showing changes in terms of relative difference (eg growth rates), when we’re showing seasonality like this the natural comparison is of the distance of each point from the bottom. It is natural to interpret the space as proportionate to a number, as though there is no transformation. In effect, the transformation is hiding the extent of the seasonality.

I suspect the main reason for the use of the transform at all was to avoid the high numbers for some outbreaks hiding the variation in the smaller ones. An alternative approach is to allow the vertical axes of some of the facets to use varying scales. This is legitimate so long as any comparison of magnitudes is within each facet, and comparisons across facets are of trends and patterns, not magnitude. That is the case here. Here’s how that looks:

I think this is a significant improvement. The concentration of deaths in a few months of each outbreak is now much clearer.

There’s an obvious problem with both the charts so far, which is the clutter of text from the labels. I think they are an improvement on the legend, but they are a bit of mess. This is in fact a classic problem of how to present longitudinal data in what is often termed a “spaghetti plot” for obvious reasons. Generally, such charts only work if we don’t particularly care which line is which.

I attempted to address this problem by aggregating the samples to the country level, while keeping different lines for different combinations of country and year. I made a colour palette matched to countries so they would have the same colour over time. I also changed the scale to being the proportion of deaths in each month from the total year’s outbreak. If what we’re after is in fact the proportion of deaths in each month (ie the seasonality) rather than the magnitudes, then let’s show that on the graphic:

This is starting to look better.

Animating

While I think it was worth while aggregating the deaths in those small medieval towns and villages to get a less cluttered plot, it does lose some lovely granular detail. Did you know for instance that the medieval Welsh town of Abergwillar is completely invisible on the internet other than in discussion of the plague in 1349? One way around this problem of longitudinal data where we have a time series for many cases, themselves of some interest, is to animate through the cases. This also gave me a chance to use Thomas Pedersen’s handy tweenr R package to smooth the transitions between each case study.

Note how I’ve re-used the country colour palette, and used the extra visual space given by spreading facets out over time (in effect) to add some additional contextual information, like the latitude of each location.

Some innocent thoughts on the actual topic

This was a toe in the water for something I’d stumbled across while reading a history book. There’s lots of interesting questions here. Plague is in the news again today. There’s a big outbreak in Madagascar, with 171 deaths between August and mid November 2017 and more than 2,000 infections. And while I have been writing this blog, the Express has reported on North Korea stockpiling plague bacteria for biological warfare.

On the actual topic of whether seasonality of deaths tells us that the medieval Black Death was different from 19th century Chinese and Indian plagues, I’d like to see more data. For example, what were the death rates by month in China in the 14th century, where the outbreak probably began? Presumably not available.

Nineteenth century Bombay was different from a medieval European village in many important respects. Bombay’s winter in fact is comparable in temperature to Europe’s summer, so the coincidence of these as peak times for plague deaths perhaps has something in common. On the other hand, Manchuria is deadly, freezing cold in its winter when pneumonic plague deaths peaked in our data.

Interesting topic.

R code

Here’s the R code that did all that. It’s nothing spectacular. I actually typed by hand the data from the original table, which was only available, as far as I can see, as an image of a table. Any interest is likely to be in the particular niceties of some of the plot polishing eg using a consistent palette for country colour when colour is actually mapped to country-year combination int he data.

Although tweenr is often used in combination with the animate R package, I prefer to do what animate does manually: save the individual frames somewhere I have set myself and call ImageMagick directly via a system() call. I find the animate package adds a layer of problems – particularly on Windows – that sometimes gives me additional headaches. Calling another application via system() works fine.

library(tidyverse) library(scales) library(openxlsx) library(directlabels) library(forcats) library(RColorBrewer) library(stringr) library(tweenr) #========download and prepare data========= download.file("https://github.com/ellisp/ellisp.github.io/raw/master/data/plague-mortality.xlsx", destfile = "plauge-mortality.xlsx", mode = "wb") orig <- read.xlsx("plague-mortality.xlsx", sheet = "data") d1 <- orig %>% rename(place = Place, year = Date, agent = Agent, country = Country, latitude = Latitude) %>% gather(month, death_rate, -place, -year, -agent, -country, -latitude) %>% # put the months' levels in correct order so they work in charts etc mutate(month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), month_number = as.numeric(month)) %>% mutate(place_date = paste(place, year), agent = fct_reorder(agent, year)) %>% arrange(place, year, month_number) # define a caption for multiple use: the_caption <- str_wrap("Graphic by Peter Ellis, reworking data and figures published by Mark R. Welford and Brian H. Bossak 'Validation of Inverse Seasonal Peak Mortality in Medieval Plagues, Including the Black Death, in Comparison to Modern Yersinia pestis-Variant Diseases'", 100) #============faceted version of original graphic===================== p1 <- d1 %>% ggplot(aes(x = month_number, y = death_rate, colour = place_date)) + geom_smooth(aes(group = NULL), se = TRUE, colour = NA, size = 4) + geom_line(aes(x = month_number), size = 0.9) + scale_y_log10("Number of deaths in a month", limits = c(1, 1000000), label = comma) + scale_x_continuous("", breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + facet_wrap(~agent) + theme(panel.grid = element_blank()) + labs(caption = the_caption) + ggtitle("Seasonal deaths by different plague agents", "Medieval and early modern Black Death and plague peaked in summer, whereas early 20th Century outbreaks peaked in late winter") direct.label(p1, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) # good # version with free y axes as an alternative to the log transform p1b <- p1 + facet_wrap(~agent, scales = "free_y") + scale_y_continuous() direct.label(p1b, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) #==============================y axis as proportion=================== d2 <- d1 %>% group_by(country, year, month_number, agent) %>% summarise(deaths = sum(death_rate, na.rm = TRUE)) %>% mutate(country_date = paste(country, year)) %>% group_by(country_date) %>% mutate(prop_deaths = deaths / sum(deaths, na.rm = TRUE)) %>% # convert the zeroes back to NA to be consistent with original presentation: mutate(prop_deaths = ifelse(prop_deaths == 0, NA, prop_deaths)) # Defining a palette. This is a little complex because although colour # is going to be mapped to country_year, I want each country to have its own # colour regardless of the year. I'm going to use Set1 of Cynthia Brewer's colours, # except for the yellow which is invisible against a white background pal1 <- data_frame(country = unique(d2$country)) pal1$colour <- brewer.pal(nrow(pal1) + 1, "Set1")[-6] pal2 <- distinct(d2, country, country_date) %>% left_join(pal1, by = "country") pal3 <- pal2$colour names(pal3) <- pal2$country_date # draw graphic with colour mapped to country-year combination, but colours come out # at country level: p2 <- d2 %>% ggplot(aes(x = month_number, y = prop_deaths, colour = country_date)) + geom_smooth(aes(group = NULL), se = TRUE, colour = NA, size = 4) + # geom_point() + geom_line(aes(x = month_number)) + scale_color_manual(values = pal3) + scale_y_continuous("Proportion of the year's total deaths in each month\n", label = percent, limits = c(0, 0.65)) + scale_x_continuous("", breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + facet_wrap(~agent) + theme(panel.grid = element_blank()) + ggtitle("Seasonal deaths by different plague agents", "Medieval and early modern Black Death and plague peaked in summer, whereas early 20th Century outbreaks peaked in late winter") + labs(caption = the_caption) direct.label(p2, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) #============animation=================== d3 <- d1 %>% group_by(place_date, country, latitude) %>% mutate(prop_deaths = death_rate / sum(death_rate, na.rm = TRUE), prop_deaths = ifelse(is.na(prop_deaths), 0, prop_deaths)) %>% ungroup() %>% mutate(place_date = fct_reorder(place_date, year), place_date_id = as.numeric(place_date)) %>% arrange(place_date_id, month_number) place_dates <- levels(d3$place_date) # change this to a list of data.frames, one for each state d3_l <- lapply(place_dates, function(x){ d3 %>% filter(place_date == x) %>% select(prop_deaths, month, place_date, agent) }) # use tweenr to create a single data frame combining those 20+ frames in the list, with interpolated # values for smooth animation: d3_t <- tween_states(d3_l, tweenlength = 5, statelength = 8, ease = "cubic-in-out", nframes = 600) %>% # caution - tween_states loses the information ont the ordering of factors mutate(month= factor(as.character(month), levels = levels(d1$month)), place_date = factor(as.character(place_date, levels = place_dates))) # make a temporary folder to store some thousands of PNG files as individual frames of the animation unlink("0114-tmp", recursive = TRUE) dir.create("0114-tmp") for(i in 1:max(d3_t$.frame)){ png(paste0("0114-tmp/", i + 10000, ".png"), 2000, 1000, res = 200) the_data <- filter(d3_t, .frame == i) the_title <- paste("Deaths from", the_data[1, "agent"], "in", the_data[1, "place_date"]) tmp <- d3 %>% filter(place_date == the_data[1, "place_date"]) %>% select(place_date, country, latitude) %>% distinct() the_xlab <- with(tmp, paste0(place_date, ", ", country, ", ", latitude, " degrees North")) the_high_point <- the_data %>% filter(prop_deaths == max(prop_deaths) ) %>% slice(1) %>% mutate(prop_deaths = prop_deaths + 0.03) %>% mutate(lab = agent) the_colour <- pal2[pal2$country == tmp$country, ]$colour[1] print( ggplot(the_data, aes(x = as.numeric(month), y = prop_deaths)) + geom_ribbon(aes(ymax = prop_deaths), ymin = 0, fill = the_colour, colour = "grey50", alpha = 0.1) + scale_x_continuous(the_xlab, breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + ggtitle(the_title, "Distribution by month of the total deaths that year") + theme(panel.grid = element_blank()) + geom_text(data = the_high_point, aes(label = lab)) + labs(caption = the_caption) + scale_y_continuous("Proportion of the year's total deaths in each month\n", label = percent, limits = c(0, 0.68)) ) dev.off() # counter: if (i / 20 == round(i / 20)){cat(i)} } # combine all those frames into a single animated gif pd <- setwd("0114-tmp") system('magick -loop 0 -delay 8 *.png "0114-plague-deaths.gif"') setwd(pd) # clean up unlink("plague-mortality.xlsx") The original

As promised, here’s the original Welford and Bossak image. Sadly, they dropped Abergwillar; and they have data for Poona which I couldn’t find in a table in their paper.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Highlights from the Connect(); conference

Fri, 11/17/2017 - 22:44

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Connect();, the annual Microsoft developer conference, is wrapping up now in New York. The conference was the venue for a number of major announcements and talks. Here are some highlights related to data science, machine learning, and artificial intelligence:

Lastly, I wanted to share this video presented at the conference from Stack Overflow. Keep an eye out for R community luminary David Robinson programming in R!

You can find more from the Connect conference, including on-demand replays of the talks and keynotes, at the link below.

Microsoft: Connect(); November 15-17, 2017

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

padr version 0.4.0 now on CRAN

Fri, 11/17/2017 - 21:30

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

I am happy to share that the latest version of padr just hit CRAN. This new version comprises bug fixes, performance improvements and new functions for formatting datetime variables. But above all, it introduces the custom paradigm that enables you to do asymmetric analysis.

Improvements to existing functions

thicken used to get slowish when the data frame got large. Several adjustments resulted in a considerable performance gain. It is now approximately 6 times faster than in version 0.3.0.

Both pad and thicken used to break with noninformative errors when the datetime variable contains missing values. Both functions allow for missing values now. thicken will leave the record containing the missing datetime value where it is and will enter a missing value in the column added to the data frame. pad will move all the records with missing values in the datetime variable to the end of the data frame and will further ignore them for padding. Both functions throw a warning when the datetime variable has missing values.

New functions

span_date and span_time are wrappers around seq.Date and seq.POSIXt and were previously described in this blog post.
Two new functions are introduced to reformat datetime variable. These can be especially useful for showing the data in a plot or a graph. center_interval shifts the datetime point from the beginning of the interval to the (approximate) center. Especially bar, line and dot plots will reflect the data in a more accurate way after centering.

library(tidyverse) library(padr) jan_first <- emergency %>% filter(as.Date(time_stamp, tz = "EST") == as.Date("2016-01-01")) %>% thicken("3 hour") %>% count(time_stamp_3_hour) ggplot(jan_first, aes(time_stamp_3_hour, n)) + geom_bar(stat = "identity")

jan_first %>% mutate(ts_center = center_interval(time_stamp_3_hour)) %>% ggplot(aes(ts_center, n)) + geom_bar(stat = "identity")

Secondly, there is the format_interval function, that will create a character from the interval using strftime on both the beginning and the end of the datetime points of each interval.

jan_first %>% mutate(hours = format_interval(time_stamp_3_hour, start_format = "%b%d %H", end_format = "%H", sep = "-")) %>% ggplot(aes(hours, n)) + geom_bar(stat = "identity")

When using the interval “week”, one might want to start the weeks on different day than Sunday. In thicken the start_val should than be specified with a day of the desired weekday. Finding this day by hand is a bit tedious, thats why closest_weekday is introduced.

thicken(coffee, "week", start_val = closest_weekday(coffee$time_stamp, 6)) ## time_stamp amount time_stamp_week ## 1 2016-07-07 09:11:21 3.14 2016-07-02 ## 2 2016-07-07 09:46:48 2.98 2016-07-02 ## 3 2016-07-09 13:25:17 4.11 2016-07-09 ## 4 2016-07-10 10:45:11 3.14 2016-07-09 The custom suite

thicken and pad are highly automated, because they assume all datetime points to be equally spaced. Internally they span vectors of the desired interval. However, it might be useful to have asymmetrical periods between datetime points. Especially when there are periods in which the number of observations is consistently lower, such as nightly hours or weekend days. The functions thicken_cust and pad_cust work like the original functions. However, the user has to provide his own spanning vector to whch the observations are mapped.

Lets do an analysis of vehicle accidents in the emergency. We want to distinguish between morning rush hour (7-10), daytime (10-16), evening rush hour (16-19) and nighttime (19-7). This is the full analysis.

accidents_span <- span_around(emergency$time_stamp, "hour", start_shift = "2 hour") %>% subset_span(list(hour = c(7, 10, 16, 19))) emergency %>% filter(title == "Traffic: VEHICLE ACCIDENT -") %>% thicken_cust(accidents_span, "hour") %>% count(hour) %>% pad_cust(accidents_span) %>% fill_by_value() %>% mutate(day = as.Date(hour, tz = "EST"), period = format_interval(hour, start_format = "%H", sep = "-")) %>% ggplot(aes(day, n, col = period)) + geom_line() + facet_wrap(~period)

The helper functions span_around and subset_span are used for building the spanning vector. The first takes the original datetime variable and spans a variable of the requested interval around it. This saves you the manual trouble of finding the min and the max of the variable and determine which points are respectively before and after them to build a variable of the interval. subset_span, subsequently, will only leave the datetime points in the input that meet the criteria given in a list. In the example these are the hours 7, 10, 16, and 19. In total there are eight different datetime parts you can subset on.

The remainder of the analysis is greatly similar to a regular padr analysis. Instead of an interval to which to thicken and pad to, you use the asymmetrically spaced accidents_span variable in both thicken_cust and pad_cust. thicken_cust will then map each observation to a datetime point in the spanned vector. pad_cust will insert rows for each of the observations that are in the spanned vector, but not in the datetime variable.

Thanks

This release completes the initial plans I had for padr. For the next release I plan on refactoring and further increasing performance. Do you find anything still missing from padr? Did you find a bug or an inconsistency? Please notify by sending an email or file an issue on the github page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining – Week 11 (2017)

Fri, 11/17/2017 - 18:39

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 11 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining – Week 11 (2017) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Network analysis video – 7th Scottish QGIS user group

Fri, 11/17/2017 - 13:43

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I spoke yesterday at the Scottish QGIS user group. This is a vibrant and welcoming community and has attendees who travel long distances to be there. I think this time Prague was the furthest city anybody came from.

My talk centred on network analysis using GRASS, R and QGIS. You can read how to do some of this work here and here. Watch the presentation (and the rest of the day) on Youtube and view the slide deck at the end of this post. A massive thanks to Ross, Tom and team for organising the event.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Teaching to machines: What is learning in machine learning entails?

Fri, 11/17/2017 - 04:40

(This article was first published on Memo's Island, and kindly contributed to R-bloggers)

Preamble

Figure 1: The oldest learning institution 
 in the world; University of Bologna
(Source: Wikipedia). Machine Learning (ML) is now a de-facto skill for every quantitative job and almost every industry embraced it, even though fundamentals of the field is not new at all. However, what does it mean to teach to a machine? Unfortunately, for even moderate technical people coming from different backgrounds, answer to this question is not apparent in the first instance. This sounds like a conceptual and jargon issue, but it lies in the very success of supervised learning algorithms.

What is a machine in machine learning

First of all here, machine does not mean a machine in conventional sense, but computational modules or set of instructions. It is called machine because of thermodynamics can be applied to analyse this computational modules, their algorithmic complexity can be map to an energy consumption and work production, recall Landauer Principle. Charles Bennett has an article on this principle, here.

Learning curve for machines

What about learning bit? Learning is a natural process for humans, specially for the younger ones. We learn new things naturally without explicitly thinking with instruction and some training, such as practised in University of Bologna a millennia ago, see Figure 1. Usually,  it takes time for us to learn new thing, and the process is called learning curve, due to Hermann Ebinghauss. Essentially, learning in machine learning exactly roots in Ebinghauss approach.  But question remains, how this manifests quantitatively in machine learning. Answer to this question is going to be given here.

The concept of learning curve and  effect of sample size in training machine learning models for both experienced and novice practitioners. It is often this type of analysis is omitted in producing supervised learning solutions. In the advent of deep learning architecture, multi-layered neural networks, this concept becomes more pronounced.

Quantifying learning curve lies in measuring performance over increasing experience. If the performance of the machine, or human, increases over experience we denote that learning is achieved. We distinguish good learner.

On the misconception of unsupervised learning

Figure 2: Donald Webb,
father of unsupervised
learning.
(Source: Wikipedia)

A generic misconception appear on what unsupervised learning means. Clustering, or categorising unlabelled data, is not learning in Ebbinghaus sense. The goodness of fit or cluster validation exercise do not account an increase in experience, at least this is not established in the literature, judging from cluster validation techniques, see, Jain-Dubes’s Algorithms for clustering data. Wikipedia defines unsupervised learning “inferring a function to describe hidden structure from “unlabeled” data”, this is a function inference is not learning,.

Then, what is unsupervised learning? It originates from Hebbian Theory from neuroscience that “Cells that fire together wire together”. This implies, unsupervised learning is about how information is encoded, not how it is labelled. One good practical model that could be classified as unsupervised learning, so called spiking network model.

Outlook

The concept of learning from Machine Learning perspective is summarised in this post. In data science, it is a good practice to differentiate what we call learning and function inference/optimisation. Being attentive to this details would help us.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Memo's Island. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

pool package on CRAN

Fri, 11/17/2017 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

The pool package makes it easier for Shiny developers to connect to databases. Up until now, there wasn’t a clearly good way to do this. As a Shiny app author, if you connect to a database globally (outside of the server function), your connection won’t be robust because all sessions would share that connection (which could leave most users hanging when one of them is using it, or even all of them if the connection breaks). But if you try to connect each time that you need to make a query (e.g. for every reactive you have), your app becomes a lot slower, as it can take in the order of seconds to establish a new connection. The pool package solves this problem by taking care of when to connect and disconnect, allowing you to write performant code that automatically reconnects to the database only when needed.

So, if you are a Shiny app author who needs to connect and interact with databases inside your apps, keep reading because this package was created to make your life easier.

What the pool package does

The pool package adds a new level of abstraction when connecting to a database: instead of directly fetching a connection from the database, you will create an object (called a “pool”) with a reference to that database. The pool holds a number of connections to the database. Some of these may be currently in-use and some of these may be idle, waiting for a new query or statement to request them. Each time you make a query, you are querying the pool, rather than the database. Under the hood, the pool will either give you an idle connection that it previously fetched from the database or, if it has no free connections, fetch one and give it to you. You never have to create or close connections directly: the pool knows when it should grow, shrink or keep steady. You only need to close the pool when you’re done.

Since pool integrates with both DBI and dplyr, there are very few things that will be new to you, if you’re already using either of those packages. Essentially, you shouldn’t feel the difference, with the exception of creating and closing a “Pool” object (as opposed to connecting and disconnecting a “DBIConnection” object). See this copy-pasteable app that uses pool and dplyr to query a MariaDB database (hosted on AWS) inside a Shiny app.

Very briefly, here’s how you’d connect to a database, write a table into it using DBI, query it using dplyr, and finally disconnect (you must have DBI, dplyr and pool installed and loaded in order to be able to run this code):

conn <- dbConnect(RSQLite::SQLite(), dbname = ":memory:") dbWriteTable(conn, "quakes", quakes) tbl(conn, "quakes") %>% select(-stations) %>% filter(mag >= 6) ## # Source: lazy query [?? x 4] ## # Database: sqlite 3.19.3 [:memory:] ## lat long depth mag ## ## 1 -20.70 169.92 139 6.1 ## 2 -13.64 165.96 50 6.0 ## 3 -15.56 167.62 127 6.4 ## 4 -12.23 167.02 242 6.0 ## 5 -21.59 170.56 165 6.0 dbDisconnect(conn)

And here’s how you’d do the same using pool:

pool <- dbPool(RSQLite::SQLite(), dbname = ":memory:") dbWriteTable(pool, "quakes", quakes) tbl(pool, "quakes") %>% select(-stations) %>% filter(mag >= 6) ## # Source: lazy query [?? x 4] ## # Database: sqlite 3.19.3 [:memory:] ## lat long depth mag ## ## 1 -20.70 169.92 139 6.1 ## 2 -13.64 165.96 50 6.0 ## 3 -15.56 167.62 127 6.4 ## 4 -12.23 167.02 242 6.0 ## 5 -21.59 170.56 165 6.0 poolClose(pool) What problem pool was created to solve

As mentioned before, the goal of the pool package is to abstract away the logic of connection management and the performance cost of fetching a new connection from a remote database. These concerns are especially prominent in interactive contexts, like Shiny apps. (So, while this package is of most practical value to Shiny developers, there is no harm if it is used in other contexts.)

The rest of this post elaborates some more on the specific problems of connection management inside of Shiny, and how pool addresses them.

The connection management spectrum

When you’re connecting to a database, it’s important to manage your connections: when to open them (taking into account that this is a potentially long process for remote databases), how to keep track of them, and when to close them. This is always true, but it becomes especially relevant for Shiny apps, where not following best practices can lead to many slowdowns (from inadvertently opening too many connections) and/or many leaked connections (i.e. forgetting to close connections once you no longer need them). Over time, leaked connections could accumulate and substantially slow down your app, as well as overwhelming the database itself.

Oversimplifying a bit, we can think of connection management in Shiny as a spectrum ranging from the extreme of just having one connection per app (potentially serving several sessions of the app) to the extreme of opening (and closing) one connection for each query you make. Neither of these approaches is great: the former is fast, but not robust, and the reverse is true for the latter.

In particular, opening only one connection per app makes it fast (because, in the whole app, you only fetch one connection) and your code is kept as simple as possible. However:

  • it cannot handle simultaneous requests (e.g. two sessions open, both querying the database at the same time);
  • if the connection breaks at some point (maybe the database server crashed), you won’t get a new connection (you have to exit the app and re-run it);
  • finally, if you are not quite at this extreme, and you use more than one connection per app (but fewer than one connection per query), it can be difficult to keep track of all your connections, since you’ll be opening and closing them in potentially very different places.

While the other extreme of opening (and closing) one connection for each query you make resolves all of these points, it is terribly slow (each time we need to access the database, we first have to fetch a connection), and you need a lot more (boilerplate) code to connect and disconnect the connection within each reactive/function.

If you’d like to see actual code that illustrates these two approaches, check this section of the pool README.

The best of both worlds

The pool package was created so that you don’t have to worry about this at all. Since pool abstracts away the logic of connection management, for the vast majority of cases, you never have to deal with connections directly. Since the pool “knows” when it should have more connections and how to manage them, you have all the advantages of the second approach (one connection per query), without the disadvantages. You are still using one connection per query, but that connection is always fetched and returned to the pool, rather than getting it from the database directly. This is a whole lot faster and more efficient. Finally, the code is kept just as simple as the code in the first approach (only one connection for the entire app), since you don’t have to continuously call dbConnect and dbDisconnect.

Feedback

This package has quietly been around for a year and it’s now finally on CRAN, following lots of the changes in the database world (both in DBI and dplyr). All pool-related feedback is welcome. Issues (bugs and features requests) can be posted to the github tracker. Requests for help with code or other questions can be posted to community.rstudio.com/c/shiny, which I check regularly (they can, of course, also be posted to Stack Overflow, but I’m extremely likely to miss it).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Six tips for running a successful unconference

Fri, 11/17/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)


Attendees at the May 2017 rOpenSci unconference. Photo credit: Nistara Randhawa

In May 2017, I helped run a wildly successful “unconference” that had a huge positive impact on the community I serve. rOpenSci is a non-profit initiative enabling open and reproducible research by creating technical infrastructure in the form of staff- and community-contributed software tools in the R programming language that lower barriers to working with scientific data sources on the web, and creating social infrastructure through a welcoming and diverse community of software users and developers. Our 4th annual unconference brought together 70 people to hack on projects they dreamed up and to give them opportunities to meet and work together in person. One third of the participants had attended before, and two thirds were first-timers, selected from an open call for applications. We paid all costs up front for anyone who requested this in order to lower barriers to participation.

It’s called an “unconference” because there is no schedule set before the event – participants discuss project ideas online in advance and projects are selected by participant-voting at the start. I’m sharing some tips here on how to do this well for this particular flavour of unconference.

1. Have a code of conduct

Having a code of conduct that the organizers promote in the welcome goes a long way to creating a welcoming and safe environment and preventing violations in the first place.

2. Host online discussion of project ideas before the unconference

Our unconference centered on teams working on programming projects, rather than discussions, so prior to the unconference, we asked all participants to suggest project ideas using an open online system, called GitHub, that allows everyone to see and comment on ideas or just share enthusiastic emoji to show support.

3. Have a pre-unconference video-chat with first-time participants

Our AAAS CEFP training emphasizes the importance of extending a personalized welcome to community members, so I was inspired to make the bold move of talking with more than 40 first-time participants prior to the unconference. I asked each person to complete a short questionnaire to get them to consider the roles they anticipated playing prior to our chat. Frequently, within an hour of a video-chat, without prompting, I would see the person post a new project idea or share their thoughts about someone else’s. Other times, our conversation gave the person an opportunity to say “this idea maybe isn’t relevant but…” and I would help them talk through it, inevitably leading to “oh my gosh this is such a cool idea”. I got a huge return on my investment. People’s questions like “how do you plan to have 20 different projects present their work?” led to better planning on my part. Specific ideas for improvements came from me responding with “well…how would YOU do it?”

Between the emails, slack channel, issues on GitHub, and personal video chats, I felt completely at ease going into the unconf (where I knew next to no one!).

-rOpenSci unconf17 participant

4. Run an effective ice breaker

I adapted the “Human Barometer” ice breaker to enable 70 people to share opinions across all perceived levels of who a participant is and introduce themselves to the entire group within a 1 hour period. Success depended on creating questions that were relevant to the unconference crowd, and on visually keeping track of who had spoken up in order to call on those who had not. Ice breakers and the rOpenSci version of the Human Barometer will be the subject of a future CEFP blog post.

5. Have a plan to capture content

So much work and money go into running a great unconference, you can’t afford to do it without a plan to capture and disseminate stories about the people and the products. Harness the brain work that went into the ideas! I used the concept of content pillars to come up with a plan. Every project group was given a public repository on GitHub to house their code and documentation. In a 2-day unconference with 70 people in 20 projects, how do people present their results?! We told everyone that they had just three minutes to present, and that the only presentation material they could use was their project README (the page of documentation in their code repository). No slides allowed! This kept their focus on great documentation for the project rather than an ephemeral set of pretty slides. Practically speaking, this meant that all presentations were accessible from a single laptop connected to the projector and that to access their presentation, a speaker had only to click on the link to their repo. Where did the essence of this great idea come from? From a pre-unconference chat of course!

In the week following the unconference, we used the projects’ README documentation to create a series of five posts released Monday through Friday, noting every one of the 20 projects with links to their code repositories. To get more in-depth stories about people and projects, I let participants know we were keen to host community-contributed blog posts and that accepted posts would be tweeted to rOpenSci’s >13,000 followers. Immediately after the unconference, we invited selected projects to contribute longer form narrative blog posts and posted these once a week. The series ended with Unconf 2017: The Roads Not Taken, about all the great ideas that were not yet pursued and inviting people to contribute to these.

All of this content is tied together in one blog post to summarize the unconference and link to all staff- and community-contributed posts in the unconference projects series as well as to posts of warm personal and career-focussed reflections by some participants.

6. Care about other people’s success

Community managers do a lot of “emotional work”. In all of this, my #1 key to running a successful unconference is to genuinely and deeply care that participants arrive feeling ready to hit the ground running and leave feeling like they got what they wanted out of the experience. Ultimately, the success of our unconference is more about the human legacy – building people’s capacity as software users and developers, and the in-person relationships they develop within our online community – than it is about the projects.

“I’ve steered clear of ‘hackathons’ because they feel intimidating and the ‘bad’ kind of competitive. But, this….this was something totally different.”

– rOpenSci unconf17 participant

Additional Resources

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The City of Chicago uses R to issue beach safety alerts

Thu, 11/16/2017 - 23:27

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Among the many interesting talks I saw a the Domino Data Science Pop-Up in Chicago earlier this week was the presentation by Gene Lynes and Nick Lucius from the City of Chicago. The City of Chicago Tech Plan encourages smart communities and open government, and as part of that initiative the city has undertaken dozens of open-source, open-data projects in areas such as food safety inspections, preventing the spread of West Nile virus, and keeping sidewalks clear of snow. 

This talk was on the Clear Water initiative, a project to monitor the water quality of Chicago's many public beaches on Lake Michigan, and to issue safety alerts (or in serious cases, beach closures) when E Coli levels in the water get too high. The problem is that E Coli levels can change rapidly: water levels can be normal for weeks, and then spike for a single day. But traditional culture tests take many days to return results, and while rapid DNA-based tests do exist, it's not possible conduct these tests daily at every beach.

The solution is to build a predictive model, which uses meteorological data and rapid DNA tests for some beaches, combined with historical (culture-based) evaluations of water quality, to predict E Coli levels at all beaches every day. The analysis is performed using R (you can find the R code at this Github repository).

The analysis was developed in conjunction with citizen scientists at Chi Hack Night and statisticians from DePaul University. In 2017, the model was piloted in production in Chicago to issue beach safety alerts and to create a live map of beach water quality. This new R-based model predicted 60 additional occurrences of poor water quality, compared with the process used in prior years.

Still, water quality is hard to predict: once you have the slower test data and an actual result to compare with, that's an accuracy rate of 38%, with fewer than 2% false alarms. (The city plans to use clustering techniques to further improve that number.) That model uses rapid DNA testing at five beaches to predict all beaches along Lake Michigan. A Shiny app (linked below) lets you explore the impact of testing at a different set of beaches, and adjusting the false positive rate, on the overall accuracy of the model.

You can find more details about the City of Chicago Clear Water initiative at the link below.

City of Chicago: Clear Water

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data science courses in R (/python/and others) for $10 at Udemy (Black Friday sale)

Thu, 11/16/2017 - 22:05

Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

Click here to browse ALL (R and non-R) courses

Advanced R courses: 

Introductory R courses: 

Top data science courses (not in R): 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

New IBM Plex Sans Support in hrbrthemes + Automating Axis Text Justification

Thu, 11/16/2017 - 15:07

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes) library(ggplot2) ggplot(mpg, aes(displ, hwy)) + geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) + scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) + scale_y_continuous(expand=c(0,0), limits=c(10, 50)) + scale_color_ipsum() + scale_fill_ipsum() + facet_wrap(~class, scales="free") + labs( title="IBM Plex Sans Test", subtitle="This is a subtitle to see the how it looks in IBM Plex Sans", caption="Source: hrbrthemes & IBM" ) + theme_ipsum_ps(grid="XY", axis="xy") + theme(legend.position="none") -> gg flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, I’m going to add I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing how confounding biases estimates of population-wide (or marginal) average causal effects

Thu, 11/16/2017 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

When we are trying to assess the effect of an exposure or intervention on an outcome, confounding is an ever-present threat to our ability to draw the proper conclusions. My goal (starting here and continuing in upcoming posts) is to think a bit about how to characterize confounding in a way that makes it possible to literally see why improperly estimating intervention effects might lead to bias.

Confounding, potential outcomes, and causal effects

Typically, we think of a confounder as a factor that influences both exposure and outcome. If we ignore the confounding factor in estimating the effect of an exposure, we can easily over- or underestimate the size of the effect due to the exposure. If sicker patients are more likely than healthier patients to take a particular drug, the relatively poor outcomes of those who took the drug may be due to the initial health status rather than the drug.

A slightly different view of confounding is tied to the more conceptual framework of potential outcomes, which I wrote a bit about earlier. A potential outcome is the outcome we would observe if an individual were subjected to a particular exposure. We may or may not observe the potential outcome – this depends on the actual exposure. (To simplify things here, I will assume we are interested only in two different exposures.) \(Y_0\) and \(Y_1\) represent the potential outcomes for an individual with and without exposure, respectively. We observe \(Y_0\) if the individual is not exposed, and \(Y_1\) if she is.

The causal effect of the exposure for the individual \(i\) can be defined as \(Y_{1i} – Y_{0i}\). If we can observe each individual in both states (with and without the exposure) long enough to measure the outcome \(Y\), we are observing both potential outcomes and can measure the causal effect for each individual. Averaging across all individuals in the sample provides an estimate the population average causal effect. (Think of a crossover or N-of-1 study.)

Unfortunately, in the real world, it is rarely feasible to expose an individual to multiple conditions. Instead, we use one group as a proxy for the other. For example, the control group represents what would have happened to the exposed group had the exposed group not been exposed. This approach only makes sense if the control group is identical in every way to the exposure group (except for the exposure, of course.)

Our goal is to compare the distribution of outcomes for the control group with the exposed group. We often simplify this comparison by looking at the means of each distribution. The average causal effect (across all individuals) can be written as \(E(Y_1 – Y_0)\), where \(E()\) is the expectation or average. In reality, we cannot directly measure this since only one potential outcome is observed for each individual.

Using the following logic, we might be able to convince ourselves that we can use observed measurements to estimate unobservable average causal effects. First, we can say \(E(Y_1 – Y_0) = E(Y_1) – E(Y_0)\), because expectation is linear. Next, it seems fairly reasonable to say that \(E(Y_1 | A = 1) = E(Y | A = 1)\), where \(A=1\) for exposure, \(A=0\) for control. In words, this states that the average potential outcome of exposure for the exposed group is the same as what we actually observe for the exposed group (this is the consistency assumption in causal inference theory). Along the same lines, \(E(Y_0 | A = 0) = E(Y | A = 0)\). Finally, if we can say that \(E(Y_1) = E(Y_1 | A = 1)\) – the potential outcome of exposure for everyone is equal to the potential outcome of exposure for those exposed – then we can say that \(E(Y_1) = E(Y | A = 1)\) (the potential outcome with exposure for everyone is the same as the observed outcome for the exposed. Similarly, we can make the same argument to conclude that \(E(Y_0) = E(Y | A = 0)\). At the end of this train of logic, we conclude that we can estimate \(E(Y_1 – Y_0)\) using observed data only: \(E(Y | A = 1) – E(Y | A = 0)\).

This nice logic fails if \(E(Y_1) \ne E(Y | A = 1)\) and/or \(E(Y_0) \ne E(Y | A = 0)\). That is, this nice logic fails when there is confounding.

This is all a very long-winded way of saying that confounding arises when the distributions of potential outcomes for the population are different from those distributions for the subgroups we are using for analysis. For example, if the potential outcome under exposure for the population as a whole (\(Y_1\)) differs from the observed outcome for the subgroup that was exposed (\(Y|A=1\)), or the potential outcome without exposure for the entire population (\(Y_0\)) differs from the observed outcome for the subgroup that was not exposed (\(Y|A=0\)), any estimates of population level causal effects using observed data will be biased.

However, if we can find a factor \(L\) (or factors) where

\[ \begin{aligned}
P(Y_1 | L=l) &= P(Y | A = 1 \text{ and } L=l) \\
P(Y_0 | L=l) &= P(Y | A = 0 \text{ and } L=l)
\end{aligned}
\] both hold for all levels or values of \(L\), we can remove confounding (and get unbiased estimates of the causal effect) by “controlling” for \(L\). In some cases, the causal effect we measure will be conditional on \(L\), sometimes it will be a population-wide average (or marginal) causal effect, and sometimes it will be both.

What confounding looks like …

The easiest way to illustrate the population/subgroup contrast is to generate data from a process that includes confounding. In this first example, the outcome is continuous, and is a function of both the exposure (\(A\)) and a covariate (\(L\)). For each individual, we can generate both potential outcomes \(Y_0\) and \(Y_1\). (Note that both potential outcomes share the same individual level noise term \(e\) – this is not a necessary assumption.) This way, we can “know” the true population, or marginal causal effect of exposure. The observed outcome \(Y\) is determined by the exposure status. For the purposes of plotting a smooth density curve, we generate a very large sample – 2 million.

library(simstudy) defC <- defData(varname = "e", formula = 0, variance = 2, dist = "normal") defC <- defData(defC, varname = "L", formula = 0.4, dist = "binary") defC <- defData(defC, varname = "Y0", formula = "1 + 4*L + e", dist = "nonrandom") defC <- defData(defC, varname = "Y1", formula = "5 + 4*L + e", dist = "nonrandom") defC <- defData(defC, varname = "A", formula = "0.3 + 0.3 * L", dist = "binary") defC <- defData(defC, varname = "Y", formula = "1 + 4*A + 4*L + e", dist = "nonrandom") set.seed(2017) dtC <- genData(n = 2000000, defC) dtC[1:5] ## id e L Y0 Y1 A Y ## 1: 1 2.02826718 1 7.0282672 11.028267 1 11.0282672 ## 2: 2 -0.10930734 0 0.8906927 4.890693 0 0.8906927 ## 3: 3 1.04529790 0 2.0452979 6.045298 0 2.0452979 ## 4: 4 -2.48704266 1 2.5129573 6.512957 1 6.5129573 ## 5: 5 -0.09874778 0 0.9012522 4.901252 0 0.9012522

Feel free to skip over this code – I am just including in case anyone finds it useful to see how I generated the following series of annotated density curves:

library(ggplot2) getDensity <- function(vector, weights = NULL) { if (!is.vector(vector)) stop("Not a vector!") if (is.null(weights)) { avg <- mean(vector) } else { avg <- weighted.mean(vector, weights) } close <- min(which(avg < density(vector)$x)) x <- density(vector)$x[close] if (is.null(weights)) { y = density(vector)$y[close] } else { y = density(vector, weights = weights)$y[close] } return(data.table(x = x, y = y)) } plotDens <- function(dtx, var, xPrefix, title, textL = NULL, weighted = FALSE) { dt <- copy(dtx) if (weighted) { dt[, nIPW := IPW/sum(IPW)] dMarginal <- getDensity(dt[, get(var)], weights = dt$nIPW) } else { dMarginal <- getDensity(dt[, get(var)]) } d0 <- getDensity(dt[L==0, get(var)]) d1 <- getDensity(dt[L==1, get(var)]) dline <- rbind(d0, dMarginal, d1) brk <- round(dline$x, 1) p <- ggplot(aes(x=get(var)), data=dt) + geom_density(data=dt[L==0], fill = "#ce682f", alpha = .4) + geom_density(data=dt[L==1], fill = "#96ce2f", alpha = .4) if (weighted) { p <- p + geom_density(aes(weight = nIPW), fill = "#2f46ce", alpha = .8) } else p <- p + geom_density(fill = "#2f46ce", alpha = .8) p <- p + geom_segment(data = dline, aes(x = x, xend = x, y = 0, yend = y), size = .7, color = "white", lty=3) + annotate(geom="text", x = 12.5, y = .24, label = title, size = 5, fontface = 2) + scale_x_continuous(limits = c(-2, 15), breaks = brk, name = paste(xPrefix, var)) + theme(panel.grid = element_blank(), axis.text.x = element_text(size = 12), axis.title.x = element_text(size = 13) ) if (!is.null(textL)) { p <- p + annotate(geom = "text", x = textL[1], y = textL[2], label = "L=0", size = 4, fontface = 2) + annotate(geom = "text", x = textL[3], y = textL[4], label="L=1", size = 4, fontface = 2) + annotate(geom = "text", x = textL[5], y = textL[6], label="Population distribution", size = 4, fontface = 2) } return(p) } library(gridExtra) grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Full\npopulation", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed\nonly"), plotDens(dtC, "Y1", "Potential outcome", "Full\npopulation"), plotDens(dtC[A==1], "Y", "Observed", "Exposed\nonly"), nrow = 2 )

Looking at the various plots, we can see a few interesting things. The density curves on the left represent the entire population. The conditional distributions of the potential outcomes at the population level are all normally distributed, with means that depend on the exposure and covariate \(L\). We can also see that the population-wide distribution of \(Y_0\) and \(Y_1\) (in blue) are non-symmetrically shaped, because they are a mixture of the conditional normal distributions, weighted by the proportion of each level of \(L\). Since the proportions for the top and bottom plots are in fact the population proportion, the population-level density curves for \(Y_0\) and \(Y_1\) are similarly shaped, with less mass on the higher end, because individuals are less likely to have an \(L\) value of 1:

dtC[, .(propLis1 = mean(L))] ## propLis1 ## 1: 0.399822

The shape of the marginal distribution of \(Y_1\) is identical to \(Y_0\) (in this case, because that is the way I generated the data), but shifted to the right by an amount equal to the causal effect. The conditional effect sizes are 4, as is the population or marginal effect size.

The subgroup plots on the right are a different story. In this case, the distributions of \(L\) vary across the exposed and unexposed groups:

dtC[, .(propLis1 = mean(L)), keyby = A] ## A propLis1 ## 1: 0 0.2757937 ## 2: 1 0.5711685

So, even though the distributions of (observed) \(Y\) conditional on \(L\) are identical to their potential outcome counterparts in the whole population – for example, \(P(Y | A=0 \text{ and } L = 1) = P(Y_0 | L = 1)\) – the marginal distributions of \(Y\) are quite different for the exposed and unexposed. For example, \(P(Y | A = 0) \ne P(Y_0)\). This is directly due to the fact that the mixing weights (the proportions of \(L\)) are different for each of the groups. In the unexposed group, about 28% have \(L=1\), but for the exposed group, about 57% do. Using the subgroup data only, the conditional effect sizes are still 4 (comparing mean outcomes \(Y\) within each level of \(L\)). However the difference in means between the marginal distributions of each subgroup is about 5.2 (calculated by 7.3 – 2.1). This is confounding.

No confounding

Just so we can see that when the covariate \(L\) has nothing to do with the probability of exposure, the marginal distributions of the subgroups do in fact look like their population-level potential outcome marginal distributions:

defC <- updateDef(defC, "A", newformula = 0.5) # change data generation dtC <- genData(n = 2000000, defC) dtC[, .(propLis1 = mean(L)), keyby = A] # subgroup proportions ## A propLis1 ## 1: 0 0.4006499 ## 2: 1 0.3987437 dtC[, .(propLis1 = mean(L))] # population/marginal props ## propLis1 ## 1: 0.3996975 grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed"), plotDens(dtC, "Y1", "Potential outcome", "Population"), plotDens(dtC[A==1], "Y", "Observed", "Exposed"), nrow = 2 )

Estimation of causal effects (now with confounding)

Generating a smaller data set, we estimate the causal effects using simple calculations and linear regression:

library(broom) # change back to confounding defC <- updateDef(defC, "A", newformula = ".3 + .3 * L") dtC <- genData(2500, defC)

The true average (marginal) causal effect from the average difference in potential outcomes for the entire population:

dtC[, mean(Y1 - Y0)] ## [1] 4

And the true average causal effects conditional on the covariate \(L\):

dtC[, mean(Y1 - Y0), keyby = L] ## L V1 ## 1: 0 4 ## 2: 1 4

If we try to estimate the marginal causal effect by using a regression model that does not include \(L\), we run into problems. The estimate of 5.2 we see below is the same biased estimate we saw in the plot above. This model is reporting the differences of the means (across both levels of \(L\)) for the two subgroups, which we know (because we saw) are not the same as the potential outcome distributions in the population due to different proportions of \(L\) in each subgroup:

tidy(lm(Y ~ A, data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.027132 0.06012997 33.71251 1.116211e-205 ## 2 A 5.241004 0.09386145 55.83766 0.000000e+00

If we estimate a model that conditions on \(L\), the estimates are on target because in the context of normal linear regression without interaction terms, conditional effects are the same as marginal effects (when confounding has been removed, or think of the comparisons being made within the orange groups and green groups in the fist set of plots above, not within the purple groups):

tidy(lm(Y ~ A + L , data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 0.9178849 0.03936553 23.31697 5.809202e-109 ## 2 A 4.0968358 0.05835709 70.20288 0.000000e+00 ## 3 L 3.9589109 0.05862583 67.52844 0.000000e+00 Inverse probability weighting (IPW)

What follows briefly here is just a sneak preview of IPW (without any real explanation), which is one way to recover the marginal mean using observed data with confounding. For now, I am ignoring the question of why you might be interested in knowing the marginal effect when the conditional effect estimate provides the same information. Suffice it to say that the conditional effect is not always the same as the marginal effect (think of data generating processes that include interactions or non-linear relationships), and sometimes the marginal effect estimate may the best that we can do, or at least that we can do easily.

If we weight each individual observation by the inverse probability of exposure, we can remove confounding and estimate the marginal effect of exposure on the outcome. Here is a quick simulation example.

After generating the dataset (the same large one we started out with so you can compare) we estimate the probability of exposure \(P(A=1 | L)\), assuming that we know the correct exposure model. This is definitely a questionable assumption, but in this case, we actually do. Once the model has been fit, we assign the predicted probability to each individual based on her value of \(L\).

set.seed(2017) dtC <- genData(2000000, defC) exposureModel <- glm(A ~ L, data = dtC, family = "binomial") tidy(exposureModel) ## term estimate std.error statistic p.value ## 1 (Intercept) -0.847190 0.001991708 -425.3584 0 ## 2 L 1.252043 0.003029343 413.3053 0 dtC[, pA := predict(exposureModel, type = "response")]

The IPW is not based exactly on \(P(A=1 | L)\) (which is commonly used in propensity score analysis), but rather, the probability of the actual exposure at each level of \(L\): \(P(A=a | L)\), where \(a\in(0,1)\):

# Define two new columns defC2 <- defDataAdd(varname = "pA_actual", formula = "A * pA + (1-A) * (1-pA)", dist = "nonrandom") defC2 <- defDataAdd(defC2, varname = "IPW", formula = "1/pA_actual", dist = "nonrandom") # Add weights dtC <- addColumns(defC2, dtC) round(dtC[1:5], 2) ## id e L Y0 Y1 A Y pA pA_actual IPW ## 1: 1 2.03 1 7.03 11.03 1 11.03 0.6 0.6 1.67 ## 2: 2 -0.11 0 0.89 4.89 0 0.89 0.3 0.7 1.43 ## 3: 3 1.05 0 2.05 6.05 0 2.05 0.3 0.7 1.43 ## 4: 4 -2.49 1 2.51 6.51 1 6.51 0.6 0.6 1.67 ## 5: 5 -0.10 0 0.90 4.90 0 0.90 0.3 0.7 1.43

To estimate the marginal effect on the log-odds scale, we use function lm again, but with weights specified by IPW. The true value of the marginal effect of exposure (based on the population-wide potential outcomes) was 4.0. I know I am repeating myself here, but first I am providing the biased estimate that we get when we ignore covariate \(L\) to convince you that the relationship between exposure and outcome is indeed confounded:

tidy(lm(Y ~ A , data = dtC)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.101021 0.002176711 965.2275 0 ## 2 A 5.184133 0.003359132 1543.2956 0

And now, with the simple addition of the weights but still not including \(L\) in the model, our weighted estimate of the marginal effect is spot on (but with such a large sample size, this is not so surprising):

tidy(lm(Y ~ A , data = dtC, weights = IPW)) ## term estimate std.error statistic p.value ## 1 (Intercept) 2.596769 0.002416072 1074.789 0 ## 2 A 4.003122 0.003416842 1171.585 0

And finally, here is a plot of the IPW-adjusted density. You might think I am just showing you the plots for the unconfounded data again, but you can see from the code (and I haven’t hidden anything) that I am still using the data set with confounding. In particular, you can see that I am calling the routine plotDens with weights.

grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population", c(1, .24, 5, .22, 2.6, .06)), plotDens(dtC[A==0], "Y", "Observed", "Unexposed", weighted = TRUE), plotDens(dtC, "Y1", "Potential outcome", "Population"), plotDens(dtC[A==1], "Y", "Observed", "Exposed", weighted = TRUE), nrow = 2 )

As I mentioned, I hope to write more on IPW, and marginal structural models, which make good use of this methodology to estimate effects that can be challenging to get a handle on.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages