Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 5 min ago

Divided Differences Method of Polynomial Interpolation

Thu, 07/27/2017 - 20:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part of 6 in the series Numerical Analysis

The divided differences method is a numerical procedure for interpolating a polynomial given a set of points. Unlike Neville’s method, which is used to approximate the value of an interpolating polynomial at a given point, the divided differences method constructs the interpolating polynomial in Newton form.

Consider a table of values of and corresponding values of a function at those points:

x y

Also assume that is the th Lagrangian polynomial that corresponds with the function at these points. The polynomial can be expressed using the divided differences of the function with respect to the -values.

Therefore the constants must be found to construct the polynomial. To find these constants, the divided differences are recursively generated until iterations have been completed. We start with the zeroth divided difference of the function with respect to , which is the value of at that point. Bracket notation is introduced to distinguish the divided differences.

Now begins the recursive generation of the divided differences. The first divided difference is then the function with respect to the values and .

The second divided difference follows:

This iteration continues until the th divided difference:

Thus the interpolating polynomial resulting from the divided differences method takes the form:

Divided Differences Example

Consider the following set of points and the corresponding values of a function at those points. This data set is the same from the previous post on Neville’s method. We will perform the divided differences method to find the interpolating polynomial of these points and approximate the polynomial at .

x f(x) 8.1 16.9446 8.3 17.56492 8.6 18.50515 8.7 18.82091

First, is found.

Then, is computed.

With and , we find .

is then found.

We can then compute :

Lastly, we calculate :

The interpolating polynomial, , is then constructed.

We can see the interpolated polynomial passes through the points with the following:

p3x <- function(x) { f <- 16.9446 + 3.1016 * (x - 8.1) + 0.065 * (x - 8.1) * (x - 8.3) - 0.010417 * (x - 8.1) * (x - 8.3) * (x - 8.6) return(f) } x <- c(8.1, 8.3, 8.6, 8.7) fx <- c(16.9446, 17.56492, 18.50515, 18.82091) dat <- data.frame(cbind(x, fx)) ggplot(dat, aes(x=x, y=fx)) + geom_point(size=5, col='blue') + stat_function(fun = p3x, size=1.25, alpha=0.4)

Using the function above, we can also see the interpolated polynomial resulting from the divided differences method returns the same approximated value of the function , as Neville’s method.

p3x(8.4) ## [1] 17.87709 Divided Differences in R

The following is an implementation of the divided differences method of polynomial interpolation in R. The function utilizes the rSymPy library to build the interpolating polynomial and approximate the value of the function for a given value of .

divided.differences <- function(x, y, x0) { require(rSymPy) n <- length(x) q <- matrix(data = 0, n, n) q[,1] <- y f <- as.character(round(q[1,1], 5)) fi <- '' for (i in 2:n) { for (j in i:n) { q[j,i] <- (q[j,i-1] - q[j-1,i-1]) / (x[j] - x[j-i+1]) } fi <- paste(fi, '*(x - ', x[i-1], ')', sep = '', collapse = '') f <- paste(f, ' + ', round(q[i,i], 5), fi, sep = '', collapse = '') } x <- Var('x') sympy(paste('e = ', f, collapse = '', sep = '')) approx <- sympy(paste('e.subs(x, ', as.character(x0), ')', sep = '', collapse = '')) return(list('Approximation from Interpolation'=as.numeric(approx), 'Interpolated Function'=f, 'Divided Differences Table'=q)) }

Let’s use the function to interpolate a function given the values of and the corresponding values of a function , and approximate the function when .

divided.differences(x, fx, 8.4) ## Loading required package: rSymPy ## Loading required package: rJython ## Loading required package: rJava ## Loading required package: rjson ## $`Approximation from Interpolation` ## [1] 17.87709 ## ## $`Interpolated Function` ## [1] "16.9446 + 3.1016*(x - 8.1) + 0.065*(x - 8.1)*(x - 8.3) + -0.01042*(x - 8.1)*(x - 8.3)*(x - 8.6)" ## ## $`Divided Differences Table` ## [,1] [,2] [,3] [,4] ## [1,] 16.94460 0.0000 0.00000 0.00000000 ## [2,] 17.56492 3.1016 0.00000 0.00000000 ## [3,] 18.50515 3.1341 0.06500 0.00000000 ## [4,] 18.82091 3.1576 0.05875 -0.01041667

The function returns the approximated value, the interpolated polynomial use to approximate the function and the divided differences table which contains the intermediate calculations as we saw earlier. I wasn’t able to find a function in R or a package that performs the divided differences method, so, unfortunately, I have nothing for which to compare our function. However, the results agree with our manual calculations so that should be satisfactory for this introductory purpose.

References

Burden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.

The post Divided Differences Method of Polynomial Interpolation appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio meets MilanoR – Presentations, photos and video!

Thu, 07/27/2017 - 18:39

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hello R-users,

On June 29th we had the great pleasure to host the RStudio in Milano at Impact Hub. It went absolutely well with great participation, thank you all!

This post is just to give the materials to those of you who could not make it or just want to go through it again and again!

We had two very exceptional guests: Nathan Stephens, director of solutions engineering at Rstudio, and Joseph Rickert, member of the R Consortium board of directors, accompanied by Pete Knast, enterprise advocate at RStudio. They came all the way from U.S. just to meet the MilanoR Community and the Milano Data Science Community!

The event started with a short introduction of the two hosting communities: MilanoR community and Milano Data Science community. Mariachiara spoke about the events that we have been organising this year: please take a look at the slides for past, present and future events of MilanoR community. Gianmario introduced the Data Science Milan community, the independent group of professionals with the goal of promoting and pioneering knowledge and innovation of the data-driven revolution in the Italian peninsula and beyond.


MilanoR – Past, present and future 

by Mariachiara Fortuna

 

Welcome Presentation 

by Gianmario Spacagna

 

Then our special guests: first Nathan Stephens went through the latest news by Rstudio. Which direction is RStudio taking? What’s next? Through an interactive and entertaining demonstration, Nathan showed us the newest RStudio products: tidyverse, Rnotebook and RStudio connect. These tools enable better data science with R and will make working with R in team simpler as well as more efficient. He gave us insights on how RStudio is evolving towards a more efficient way to communicate and share results. Read the slides below to learn more about Nathan’s presentation.

Publishing and self-service analytics with R

by Nathan Stephens

 

Last, Joseph Rickert introduced us to the financial support that the R consortium gives to R communities and R related initiatives. He gave us a quick overview of the many R communities that are spread all around the world. It was somewhat fascinating and inspiring to hear how we all share this same enthusiasm for R. And sure it was for a group of young girls that the day after the meeting gave birth to the MilanoR ladies! Thank you Joe for contributing to enlarging the R community!

Community support: R Consortium & RStudio

by Joseph Rickert

 

At the end of the meeting we had the chance to network while enjoying a nice refreshment kindly offered by Rstudio.

The whole meeting went on a Facebook streaming as well as on a Youtube streaming, so the full recording is available both on our Facebook channel and on Youtube. Thanks @Akiro and @Fabio for dealing with it!

Meeting full video

 

 

We will never be able to thank enough our sponsors, Quantide and RStudio, that made all of this possible, Nathan and Joseph that gave two inspiring talks, and of course all of you for supporting us with such enthusiasm.

Thank you so much, and enjoy your summer holidays!

 

The post RStudio meets MilanoR – Presentations, photos and video! appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Quick Way of Installing all your old R libraries on a New Device

Thu, 07/27/2017 - 06:26

(This article was first published on R – Discovering Python & R, and kindly contributed to R-bloggers)

I recently bought a new laptop and began installing essential software all over again, including R of course! And I wanted all the libraries that I had installed in my previous laptop. Instead of installing libraries one by one all over again, I did the following:

Step 1: Save a list of packages installed in your old computing device (from your old device).


installed <- as.data.frame(installed.packages())
write.csv(installed, 'installed_previously.csv')

This saves information on installed packages in a csv file named installed_previously.csv. Now copy or e-mail this file to your new device and access it from your working directory in R.

Step 2: Create a list of libraries from your old list that were not already installed when you freshly download R (from your new device).


installedPreviously <- read.csv('installed_previously.csv')
baseR <- as.data.frame(installed.packages())
toInstall <- setdiff(installedPreviously, baseR)

We now have a list of libraries that were installed in your previous computer in addition to the R packages already installed when you download R. So you now go ahead and install these libraries.

Step 3: Download this list of libraries.


install.packages(toInstall)

That’s it. Save yourself the trouble installing packages one-by-one all over again.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Discovering Python & R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How Virtual Tags have transformed SCADA data analysis

Thu, 07/27/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

This article describes how to use Virtual tags to analyse SCADA data. Virtual tags provide context o SCADA or Historian data by combining information from various tags with meta data about these tags. Continue reading →

The post How Virtual Tags have transformed SCADA data analysis appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

UseR Brussels 2017

Thu, 07/27/2017 - 02:00

(This article was first published on Clean Code, and kindly contributed to R-bloggers)

I went to UseR-2017 in Brussels. It was my first time at an UseR (I have been to the first satRday), AND I LOVED IT!
There were many interesting talks, I am so going to use Fast Frugal Trees in the future for instance
and I saw a lot of shiny applications and R professional.

But best of all. I talked to a lot of people, people I only spoke to online.
Thanked some people for their help in my packages and generally had a lot of fun!

One of the fun things I did was asking people about packages that should be created but
are not yet there, I put them here: https://github.com/RMHogervorst/shouldbeapackage Contribute if you want.

One of the packages that should exist is one that gives you suggestions for Kareoke, preferably powerballads,
of course I used one of the breaks to create a github page and a package. However a suggestion app needs songs, so
me and Kevin O’Brian tricked people into adding to the list of powerballads.

What we still need though, is a shiny app that you can access from your phone when you are IN the kareoke bar.

We put the package under a new github group: R-eoke
And yes, submit your songs!

See you around.

UseR Brussels 2017 was originally published by at Clean Code on July 27, 2017.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Clean Code. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Span Dates and Times without Overhead

Wed, 07/26/2017 - 21:00

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

I am working on v.0.4.0 of the padr package this summer. Two new features that will be added are wrappers around seq.Date and seq.POSIXt. Since it is going to take a while before the new release is on CRAN, I go ahead and do an early presentation of these functions. Date and datetime parsing in base R are powerful and comprehensive, but also tedious. They can slow you down in your programming or analysis. Luckily, good wrappers and alternatives exist, at least the ymd{_h}{m}{s} suite from lubridate and Dirk Eddelbuettel’s anytime. These functions remove much of the overhead of date and datetime parsing, allowing for quick formatting of vectors in all kinds of formats. They also alleviate the pain of using seq.Date() and seq.POSIXt a little, because the from and the to arguments should be parsed dates or datetimes. Take the following example.

seq(as.POSIXct("2017-07-25 00:00:00"), as.POSIXct("2017-07-25 03:00:00"), by = "hour") ## [1] "2017-07-25 00:00:00 CEST" "2017-07-25 01:00:00 CEST" ## [3] "2017-07-25 02:00:00 CEST" "2017-07-25 03:00:00 CEST" library(lubridate) seq(ymd_h("20170725 00"), ymd_h("20170725 03"), by = "hour") ## [1] "2017-07-25 00:00:00 UTC" "2017-07-25 01:00:00 UTC" ## [3] "2017-07-25 02:00:00 UTC" "2017-07-25 03:00:00 UTC"

I think, however, that there is still some overhead in the second specification. By overhead I mean specifying things that feel redundant, things that could be set to some kind of default. Since the whole idea behind padr is automating away redundant and tedious actions in preparing datetime data, providing alternative functions that ask for as little as possible are a natural addition. This resulted in span_date and span_time. They remove overhead by:

  • allowing for specification of from and to directly as integer or character in lubridatish format.

  • setting the unspecified datetime parts to a default of 1 for month and day, and 0 for hour, minute, and second.

  • assuming the desired interval (the by statement in seq.Date and seq.POSIXt) as the lowest of the datetime parts specified in either from or two.

If this is a little abstract still, let me give some examples. The above becomes example becomes:

devtools::install_github("EdwinTh/padr") # download the dev version library(padr) span_time("20170725 00", "20170725 03") ## [1] "2017-07-25 00:00:00 UTC" "2017-07-25 01:00:00 UTC" ## [3] "2017-07-25 02:00:00 UTC" "2017-07-25 03:00:00 UTC"

We can simplify this even further, specifying the 00 for the hour in from is not strictly necesarry. Since the hour is specified in to already the interval will remain hour if we omit it.

span_time("20170725", "20170725 03") ## [1] "2017-07-25 00:00:00 UTC" "2017-07-25 01:00:00 UTC" ## [3] "2017-07-25 02:00:00 UTC" "2017-07-25 03:00:00 UTC"

We can even use an integer instead of a character for from. When there are no time parts involved, a character is not necesarry. Since we use it in span_time it will be parsed to POSIXct, not to Date.

span_time(20170725, "20170725 03") ## [1] "2017-07-25 00:00:00 UTC" "2017-07-25 01:00:00 UTC" ## [3] "2017-07-25 02:00:00 UTC" "2017-07-25 03:00:00 UTC"

to does not have to be specified, we can use len_out instead. The interval is derived only from from then. To get Jan 1st, from 2010 to 2014 we can do both

span_date(2010, 2014) ## [1] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01"

and

span_date(2010, len_out = 5) ## [1] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01"

If you want the interval to be different from the default, you can specify it.

span_date(2016, 2017, interval = "month") ## [1] "2016-01-01" "2016-02-01" "2016-03-01" "2016-04-01" "2016-05-01" ## [6] "2016-06-01" "2016-07-01" "2016-08-01" "2016-09-01" "2016-10-01" ## [11] "2016-11-01" "2016-12-01" "2017-01-01"

Note however, that you can often also specify the interval by providing more information in from or to.

span_date(201601, 2017) ## [1] "2016-01-01" "2016-02-01" "2016-03-01" "2016-04-01" "2016-05-01" ## [6] "2016-06-01" "2016-07-01" "2016-08-01" "2016-09-01" "2016-10-01" ## [11] "2016-11-01" "2016-12-01" "2017-01-01"

I hope you find these little wrappers around seq.Date and seq.POSIXt useful and that they will enable you to conquer dates and datetimes a little quicker. You can obtain the function by downloading the dev version of padr as I did above. If you can think of improvements of the functions before it hits CRAN please tell me. Issues filed, pull requests, emails, and tweets are much appreciated.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ggplot2 – Easy way to mix multiple graphs on the same page

Wed, 07/26/2017 - 18:14

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

To arrange multiple ggplot2 graphs on the same page, the standard R functions – par() and layout() – cannot be used.

The basic solution is to use the gridExtra R package, which comes with the following functions:

  • grid.arrange() and arrangeGrob() to arrange multiple ggplots on one page
  • marrangeGrob() for arranging multiple ggplots over multiple pages.

However, these functions makes no attempt at aligning the plot panels; instead, the plots are simply placed into the grid as they are, and so the axes are not aligned.

If axis alignment is required, you can switch to the cowplot package, which include the function plot_grid() with the argument align. However, the cowplot package doesn’t contain any solution for multi-pages layout. Therefore, we provide the function ggarrange() [in ggpubr], a wrapper around the plot_grid() function, to arrange multiple ggplots over multiple pages. It can also create a common unique legend for multiple plots.

This article will show you, step by step, how to combine multiple ggplots on the same page, as well as, over multiple pages, using helper functions available in the following R package: ggpubr R package, cowplot and gridExtra. We’ll also describe how to export the arranged plots to a file.

Related articles:

Contents:

Prerequisites Required R package

You need to install the R package ggpubr (version >= 0.1.3), to easily create ggplot2-based publication ready plots.

We recommend to install the latest developmental version from GitHub as follow:

if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/ggpubr")

If installation from Github failed, then try to install from CRAN as follow:

install.packages("ggpubr")

Note that, the installation of ggpubr will install the gridExtra and the cowplot package; so you don’t need to re-install them.

Load ggpubr:

library(ggpubr) Demo data sets

Data: ToothGrowth and mtcars data sets.

# ToothGrowth data("ToothGrowth") head(ToothGrowth) len supp dose 1 4.2 VC 0.5 2 11.5 VC 0.5 3 7.3 VC 0.5 4 5.8 VC 0.5 5 6.4 VC 0.5 6 10.0 VC 0.5 # mtcars data("mtcars") mtcars$name <- rownames(mtcars) mtcars$cyl <- as.factor(mtcars$cyl) head(mtcars[, c("name", "wt", "mpg", "cyl")]) name wt mpg cyl Mazda RX4 Mazda RX4 2.620 21.0 6 Mazda RX4 Wag Mazda RX4 Wag 2.875 21.0 6 Datsun 710 Datsun 710 2.320 22.8 4 Hornet 4 Drive Hornet 4 Drive 3.215 21.4 6 Hornet Sportabout Hornet Sportabout 3.440 18.7 8 Valiant Valiant 3.460 18.1 6 Create some plots

Here, we’ll use ggplot2-based plotting functions available in ggpubr. You can use any ggplot2 functions to create the plots that you want for arranging them later.

We’ll start by creating 4 different plots:

  • Box plots and dot plots using the ToothGrowth data set
  • Bar plots and scatter plots using the mtcars data set

You’ll learn how to combine these plots in the next sections using specific functions.

  • Create a box plot and a dot plot:
# Box plot (bp) bxp <- ggboxplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco") bxp # Dot plot (dp) dp <- ggdotplot(ToothGrowth, x = "dose", y = "len", color = "dose", palette = "jco", binwidth = 1) dp

Arrange multiple ggplots on the same page

  • Create an ordered bar plot and a scatter plot:

Create ordered bar plots. Change the fill color by the grouping variable “cyl”. Sorting will be done globally, but not by groups.

# Bar plot (bp) bp <- ggbarplot(mtcars, x = "name", y = "mpg", fill = "cyl", # change fill color by cyl color = "white", # Set bar border colors to white palette = "jco", # jco journal color palett. see ?ggpar sort.val = "asc", # Sort the value in ascending order sort.by.groups = TRUE, # Sort inside each group x.text.angle = 90 # Rotate vertically x axis texts ) bp + font("x.text", size = 8) # Scatter plots (sp) sp <- ggscatter(mtcars, x = "wt", y = "mpg", add = "reg.line", # Add regression line conf.int = TRUE, # Add confidence interval color = "cyl", palette = "jco", # Color by groups "cyl" shape = "cyl" # Change point shape by groups "cyl" )+ stat_cor(aes(color = cyl), label.x = 3) # Add correlation coefficient sp

Arrange multiple ggplots on the same page

Arrange on one page

To arrange multiple ggplots on one single page, we’ll use the function ggarrange()[in ggpubr], which is a wrapper around the function plot_grid() [in cowplot package]. Compared to the standard function plot_grid(), ggarange() can arrange multiple ggplots over multiple pages.

ggarrange(bxp, dp, bp + rremove("x.text"), labels = c("A", "B", "C"), ncol = 2, nrow = 2)

Arrange multiple ggplots on the same page

Alternatively, you can also use the function plot_grid() [in cowplot]:

library("cowplot") plot_grid(bxp, dp, bp + rremove("x.text"), labels = c("A", "B", "C"), ncol = 2, nrow = 2)

or, the function grid.arrange() [in gridExtra]:

library("gridExtra") grid.arrange(bxp, dp, bp + rremove("x.text"), ncol = 2, nrow = 2) Annotate the arranged figure

R function: annotate_figure() [in ggpubr].

figure <- ggarrange(sp, bp + font("x.text", size = 10), ncol = 1, nrow = 2) annotate_figure(figure, top = text_grob("Visualizing mpg", color = "red", face = "bold", size = 14), bottom = text_grob("Data source: \n mtcars data set", color = "blue", hjust = 1, x = 1, face = "italic", size = 10), left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90), right = "I'm done, thanks :-)!", fig.lab = "Figure 1", fig.lab.face = "bold" )

Arrange multiple ggplots on the same page

Note that, the function annotate_figure() supports any ggplots.

Align plot panels

A real use case is, for example, when plotting survival curves with the risk table placed under the main plot.

To illustrate this case, we’ll use the survminer package. First, install it using install.packages(“survminer”), then type this:

# Fit survival curves library(survival) fit <- survfit( Surv(time, status) ~ adhere, data = colon ) # Plot survival curves library(survminer) ggsurv <- ggsurvplot(fit, data = colon, palette = "jco", # jco palette pval = TRUE, pval.coord = c(500, 0.4), # Add p-value risk.table = TRUE # Add risk table ) names(ggsurv) [1] "plot" "table" "data.survplot" "data.survtable"

ggsurv is a list including the components:

  • plot: survival curves
  • table: the risk table plot

You can arrange the survival plot and the risk table as follow:

ggarrange(ggsurv$plot, ggsurv$table, heights = c(2, 0.7), ncol = 1, nrow = 2)

Arrange multiple ggplots on the same page

It can be seen that the axes of the survival plot and the risk table are not aligned vertically. To align them, specify the argument align as follow.

ggarrange(ggsurv$plot, ggsurv$table, heights = c(2, 0.7), ncol = 1, nrow = 2, align = "v")

Arrange multiple ggplots on the same page

Change column/row span of a plot Use ggpubr R package

We’ll use nested ggarrange() functions to change column/row span of plots.

For example, using the R code below:

  • the scatter plot (sp) will live in the first row and spans over two columns
  • the box plot (bxp) and the dot plot (dp) will be first arranged and will live in the second row with two different columns
ggarrange(sp, # First row with scatter plot ggarrange(bxp, dp, ncol = 2, labels = c("B", "C")), # Second row with box and dot plots nrow = 2, labels = "A" # Labels of the scatter plot )

Arrange multiple ggplots on the same page

Use cowplot R package

The combination of the functions ggdraw() + draw_plot() + draw_plot_label() [in cowplot] can be used to place graphs at particular locations with a particular size.

ggdraw(). Initialize an empty drawing canvas:

ggdraw()

Note that, by default, coordinates run from 0 to 1, and the point (0, 0) is in the lower left corner of the canvas (see the figure below).

draw_plot(). Places a plot somewhere onto the drawing canvas:

draw_plot(plot, x = 0, y = 0, width = 1, height = 1)
  • plot: the plot to place (ggplot2 or a gtable)
  • x, y: The x/y location of the lower left corner of the plot.
  • width, height: the width and the height of the plot

draw_plot_label(). Adds a plot label to the upper left corner of a graph. It can handle vectors of labels with associated coordinates.

draw_plot_label(label, x = 0, y = 1, size = 16, ...)
  • label: a vector of labels to be drawn
  • x, y: Vector containing the x and y position of the labels, respectively.
  • size: Font size of the label to be drawn

For example, you can combine multiple plots, with particular locations and different sizes, as follow:

library("cowplot") ggdraw() + draw_plot(bxp, x = 0, y = .5, width = .5, height = .5) + draw_plot(dp, x = .5, y = .5, width = .5, height = .5) + draw_plot(bp, x = 0, y = 0, width = 1, height = 0.5) + draw_plot_label(label = c("A", "B", "C"), size = 15, x = c(0, 0.5, 0), y = c(1, 1, 0.5))

Arrange multiple ggplots on the same page

Use gridExtra R package

The function arrangeGrop() [in gridExtra] helps to change the row/column span of a plot.

For example, using the R code below:

  • the scatter plot (sp) will live in the first row and spans over two columns
  • the box plot (bxp) and the dot plot (dp) will live in the second row with two plots in two different columns
library("gridExtra") grid.arrange(sp, # First row with one plot spaning over 2 columns arrangeGrob(bxp, dp, ncol = 2), # Second row with 2 plots in 2 different columns nrow = 2) # Number of rows

Arrange multiple ggplots on the same page

It’s also possible to use the argument layout_matrix in the grid.arrange() function, to create a complex layout.

In the R code below layout_matrix is a 2x2 matrix (2 columns and 2 rows). The first row is all 1s, that’s where the first plot lives, spanning the two columns; the second row contains plots 2 and 3 each occupying one column.

grid.arrange(bp, # bar plot spaning two columns bxp, sp, # box plot and scatter plot ncol = 2, nrow = 2, layout_matrix = rbind(c(1,1), c(2,3)))

Arrange multiple ggplots on the same page

Note that, it’s also possible to annotate the output of the grid.arrange() function using the helper function draw_plot_label() [in cowplot].

To easily annotate the grid.arrange() / arrangeGrob() output (a gtable), you should first transform it to a ggplot using the function as_ggplot() [in ggpubr ]. Next you can annotate it using the function draw_plot_label() [in cowplot].

library("gridExtra") library("cowplot") # Arrange plots using arrangeGrob # returns a gtable (gt) gt <- arrangeGrob(bp, # bar plot spaning two columns bxp, sp, # box plot and scatter plot ncol = 2, nrow = 2, layout_matrix = rbind(c(1,1), c(2,3))) # Add labels to the arranged plots p <- as_ggplot(gt) + # transform to a ggplot draw_plot_label(label = c("A", "B", "C"), size = 15, x = c(0, 0, 0.5), y = c(1, 0.5, 0.5)) # Add labels p

Arrange multiple ggplots on the same page

In the above R code, we used arrangeGrob() instead of grid.arrange().

Note that, the main difference between these two functions is that, grid.arrange() draw automatically the output of the arranged plots.

As we want to annotate the arranged plots before drawing it, the function arrangeGrob() is preferred in this case.

Use grid R package

The grid R package can be used to create a complex layout with the help of the function grid.layout(). It provides also the helper function viewport() to define a region or a viewport on the layout. The function print() is used to place plots in a specified region.

The different steps can be summarized as follow :

  1. Create plots : p1, p2, p3, ….
  2. Move to a new page on a grid device using the function grid.newpage()
  3. Create a layout 2X2 - number of columns = 2; number of rows = 2
  4. Define a grid viewport : a rectangular region on a graphics device
  5. Print a plot into the viewport
library(grid) # Move to a new page grid.newpage() # Create layout : nrow = 3, ncol = 2 pushViewport(viewport(layout = grid.layout(nrow = 3, ncol = 2))) # A helper function to define a region on the layout define_region <- function(row, col){ viewport(layout.pos.row = row, layout.pos.col = col) } # Arrange the plots print(sp, vp = define_region(row = 1, col = 1:2)) # Span over two columns print(bxp, vp = define_region(row = 2, col = 1)) print(dp, vp = define_region(row = 2, col = 2)) print(bp + rremove("x.text"), vp = define_region(row = 3, col = 1:2))

Arrange multiple ggplots on the same page

Use common legend for combined ggplots

To place a common unique legend in the margin of the arranged plots, the function ggarrange() [in ggpubr] can be used with the following arguments:

  • common.legend = TRUE: place a common legend in a margin
  • legend: specify the legend position. Allowed values include one of c(“top”, “bottom”, “left”, “right”)
ggarrange(bxp, dp, labels = c("A", "B"), common.legend = TRUE, legend = "bottom")

Arrange multiple ggplots on the same page

Scatter plot with marginal density plots # Scatter plot colored by groups ("Species") sp <- ggscatter(iris, x = "Sepal.Length", y = "Sepal.Width", color = "Species", palette = "jco", size = 3, alpha = 0.6)+ border() # Marginal density plot of x (top panel) and y (right panel) xplot <- ggdensity(iris, "Sepal.Length", fill = "Species", palette = "jco") yplot <- ggdensity(iris, "Sepal.Width", fill = "Species", palette = "jco")+ rotate() # Cleaning the plots yplot <- yplot + clean_theme() xplot <- xplot + clean_theme() # Arranging the plot ggarrange(xplot, NULL, sp, yplot, ncol = 2, nrow = 2, align = "hv", widths = c(2, 1), heights = c(1, 2), common.legend = TRUE)

Arrange multiple ggplots on the same page

Mix table, text and ggplot2 graphs

In this section, we’ll show how to plot a table and text alongside a chart. The iris data set will be used.

We start by creating the following plots:

  1. a density plot of the variable “Sepal.Length”. R function: ggdensity() [in ggpubr]
  2. a plot of the summary table containing the descriptive statistics (mean, sd, … ) of Sepal.Length.
    • R function for computing descriptive statistics: desc_statby() [in ggpubr].
    • R function to draw a textual table: ggtexttable() [in ggpubr].
  3. a plot of a text paragraph. R function: ggparagraph() [in ggpubr].

We finish by arranging/combining the three plots using the function ggarrange() [in ggpubr]

# Density plot of "Sepal.Length" #:::::::::::::::::::::::::::::::::::::: density.p <- ggdensity(iris, x = "Sepal.Length", fill = "Species", palette = "jco") # Draw the summary table of Sepal.Length #:::::::::::::::::::::::::::::::::::::: # Compute descriptive statistics by groups stable <- desc_statby(iris, measure.var = "Sepal.Length", grps = "Species") stable <- stable[, c("Species", "length", "mean", "sd")] # Summary table plot, medium orange theme stable.p <- ggtexttable(stable, rows = NULL, theme = ttheme("mOrange")) # Draw text #:::::::::::::::::::::::::::::::::::::: text <- paste("iris data set gives the measurements in cm", "of the variables sepal length and width", "and petal length and width, respectively,", "for 50 flowers from each of 3 species of iris.", "The species are Iris setosa, versicolor, and virginica.", sep = " ") text.p <- ggparagraph(text = text, face = "italic", size = 11, color = "black") # Arrange the plots on the same page ggarrange(density.p, stable.p, text.p, ncol = 1, nrow = 3, heights = c(1, 0.5, 0.3))

Arrange multiple ggplots on the same page

Insert a graphical element inside a ggplot

The function annotation_custom() [in ggplot2] can be used for adding tables, plots or other grid-based elements within the plotting area of a ggplot. The simplified format is :

annotation_custom(grob, xmin, xmax, ymin, ymax)
  • grob: the external graphical element to display
  • xmin, xmax : x location in data coordinates (horizontal location)
  • ymin, ymax : y location in data coordinates (vertical location)
Place a table within a ggplot

We’ll use the plots - density.p and stable.p - created in the previous section (@ref(mix-table-text-and-ggplot)).

density.p + annotation_custom(ggplotGrob(stable.p), xmin = 5.5, ymin = 0.7, xmax = 8)

Arrange multiple ggplots on the same page

Place a box plot within a ggplot
  1. Create a scatter plot of y = “Sepal.Width” by x = “Sepal.Length” using the iris data set. R function ggscatter() [ggpubr]
  2. Create separately the box plot of x and y variables with transparent background. R function: ggboxplot() [ggpubr].
  3. Transform the box plots into graphical objects called a “grop” in Grid terminology. R function ggplotGrob() [ggplot2].
  4. Place the box plot grobs inside the scatter plot. R function: annotation_custom() [ggplot2].

As the inset box plot overlaps with some points, a transparent background is used for the box plots.

# Scatter plot colored by groups ("Species") #:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: sp <- ggscatter(iris, x = "Sepal.Length", y = "Sepal.Width", color = "Species", palette = "jco", size = 3, alpha = 0.6) # Create box plots of x/y variables #:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: # Box plot of the x variable xbp <- ggboxplot(iris$Sepal.Length, width = 0.3, fill = "lightgray") + rotate() + theme_transparent() # Box plot of the y variable ybp <- ggboxplot(iris$Sepal.Width, width = 0.3, fill = "lightgray") + theme_transparent() # Create the external graphical objects # called a "grop" in Grid terminology xbp_grob <- ggplotGrob(xbp) ybp_grob <- ggplotGrob(ybp) # Place box plots inside the scatter plot #:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: xmin <- min(iris$Sepal.Length); xmax <- max(iris$Sepal.Length) ymin <- min(iris$Sepal.Width); ymax <- max(iris$Sepal.Width) yoffset <- (1/15)*ymax; xoffset <- (1/15)*xmax # Insert xbp_grob inside the scatter plot sp + annotation_custom(grob = xbp_grob, xmin = xmin, xmax = xmax, ymin = ymin-yoffset, ymax = ymin+yoffset) + # Insert ybp_grob inside the scatter plot annotation_custom(grob = ybp_grob, xmin = xmin-xoffset, xmax = xmin+xoffset, ymin = ymin, ymax = ymax)

Arrange multiple ggplots on the same page

Add background image to ggplot2 graphs

Import the background image. Use either the function readJPEG() [in jpeg package] or the function readPNG() [in png package] depending on the format of the background image.

To test the example below, make sure that the png package is installed. You can install it using install.packages(“png”) R command.

# Import the image img.file <- system.file(file.path("images", "background-image.png"), package = "ggpubr") img <- png::readPNG(img.file)

Combine a ggplot with the background image. R function: background_image() [in ggpubr].

library(ggplot2) library(ggpubr) ggplot(iris, aes(Species, Sepal.Length))+ background_image(img)+ geom_boxplot(aes(fill = Species), color = "white")+ fill_palette("jco")

Arrange multiple ggplots on the same page

Change box plot fill color transparency by specifying the argument alpha. Value should be in [0, 1], where 0 is full transparency and 1 is no transparency.

library(ggplot2) library(ggpubr) ggplot(iris, aes(Species, Sepal.Length))+ background_image(img)+ geom_boxplot(aes(fill = Species), color = "white", alpha = 0.5)+ fill_palette("jco")

Arrange multiple ggplots on the same page

Another example, overlaying the France map and a ggplot2:

mypngfile <- download.file("https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/France_Flag_Map.svg/612px-France_Flag_Map.svg.png", destfile = "france.png", mode = 'wb') img <- png::readPNG('france.png') ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + background_image(img)+ geom_point(aes(color = Species), alpha = 0.6, size = 5)+ color_palette("jco")+ theme(legend.position = "top")

Arrange multiple ggplots on the same page

Arrange over multiple pages

If you have a long list of ggplots, say n = 20 plots, you may want to arrange the plots and to place them on multiple pages. With 4 plots per page, you need 5 pages to hold the 20 plots.

The function ggarrange() [in ggpubr] provides a convenient solution to arrange multiple ggplots over multiple pages. After specifying the arguments nrow and ncol, the function ggarrange() computes automatically the number of pages required to hold the list of the plots. It returns a list of arranged ggplots.

For example the following R code,

multi.page <- ggarrange(bxp, dp, bp, sp, nrow = 1, ncol = 2)

returns a list of two pages with two plots per page. You can visualize each page as follow:

multi.page[[1]] # Visualize page 1 multi.page[[2]] # Visualize page 2

You can also export the arranged plots to a pdf file using the function ggexport() [in ggpubr]:

ggexport(multi.page, filename = "multi.page.ggplot2.pdf")

PDF file: Multi.page.ggplot2


Note that, it’s also possible to use the function marrangeGrob() [in gridExtra] to create a multi-pages output.

library(gridExtra) res <- marrangeGrob(list(bxp, dp, bp, sp), nrow = 1, ncol = 2) # Export to a pdf file ggexport(res, filename = "multi.page.ggplot2.pdf") # Visualize interactively res Nested layout with ggarrange()

We’ll arrange the plot created in section (@ref(mix-table-text-and-ggplot)) and (@ref(create-some-plots)).

p1 <- ggarrange(sp, bp + font("x.text", size = 9), ncol = 1, nrow = 2) p2 <- ggarrange(density.p, stable.p, text.p, ncol = 1, nrow = 3, heights = c(1, 0.5, 0.3)) ggarrange(p1, p2, ncol = 2, nrow = 1)

Arrange multiple ggplots on the same page

Export plots

R function: ggexport() [in ggpubr].

First, create a list of 4 ggplots corresponding to the variables Sepal.Length, Sepal.Width, Petal.Length and Petal.Width in the iris data set.

plots <- ggboxplot(iris, x = "Species", y = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), color = "Species", palette = "jco" ) plots[[1]] # Print the first plot plots[[2]] # Print the second plots and so on...

Next, you can export individual plots to a file (pdf, eps or png) (one plot per page). It’s also possible to arrange the plots (2 plot per page) when exporting them.

Export individual plots to a pdf file (one plot per page):

ggexport(plotlist = plots, filename = "test.pdf")

Arrange and export. Specify nrow and ncol to display multiple plots on the same page:

ggexport(plotlist = plots, filename = "test.pdf", nrow = 2, ncol = 1) Acknoweledgment

We sincerely thank all developers for their efforts behind the packages that ggpubr depends on, namely:

Infos

This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.4.999).

jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

.content{padding:0px;}


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

thinking with data with "Modern Data Science with R"

Wed, 07/26/2017 - 14:39

One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data.  Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate?  How are the results interpreted and communicated? 


While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data.  The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods.  (Figure 8.2 above was taken from a case study in the supervised learning chapter.)

Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics.  Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation.  Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data.  A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.

We believe that several features of the book are distinctive:

  1. minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
  2. ethical considerations are raised early, to motivate later examples
  3. recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured

Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.  

This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We’ve taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.   We’ve made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we’re working to create a series of lab activities (e.g., text as data).  (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.) Modern Data Science with R


An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Oh, the places you’ll go – mapping packages

Wed, 07/26/2017 - 14:24

(This article was first published on Blog, and kindly contributed to R-bloggers)

by Aimée Gott, Senior Consultant

If you follow us on Twitter you might have noticed that Mango has been doing a bit of travelling this summer. He’s been to San Francisco for the EARL conference, Brussels for UseR, as well as a few other places to teach some training courses. He even managed to sneak in a little holiday in LA where he was disappointed to find there are no cats honoured on the Hollywood walk of fame.

It was when we landed in LA that we got talking about the largest airports in the world (me and our marketing manager Karis – don’t worry, I am not talking to stuffed cats just yet!). After a little guessing Google came to the rescue, and I was quite surprised by the airports that topped the list. That conversation stuck with me through the many airports I have since passed through and it came back to me again after seeing James Cheshire give another inspiring talk at LondonR recently.

James gave a great talk full of inspiring visualisations —many map based— that made me think it was about time to dust off my notes on mapping tools in R and have a play. Along the way I got slightly distracted by the data itself but it was a great chance to refresh myself on the pros and cons of a couple of my favourite mapping packages – ggmap and leaflet.

The data

I didn’t want to spend long collecting and cleaning so I used data available on Wikipedia showing the top 50 airports in terms of passenger numbers from 2000 to 2016. Frustratingly, for 2000 to 2004 the data for only 30 airports were available. For this reason, I only used the top 30 for all years.

ggmap package

I started out looking at ggmap. In general, my mapping requirements are not worldwide data but local data —country or region level— so this is a great tool. Because it’s built on top of ggplot2 it meant I didn’t have to learn another tool, but simply needed to extend the capability I already have. Unfortunately, it’s not the best tool for worldwide data as you can’t zoom to show the whole world map. Because of this, I decided to focus on Europe.

With very little effort I was able to use ggmap to obtain the latitude and longitude of the airports in the data, generate a Google map of Europe and overlay the airports, sized based on the number of passengers that passed through them in the given year. The geocode function was in fact what made all of this task much simpler, as it can find the locations of the airports automatically rather than having to search for additional data.

I didn’t like the Google map though and decided to play with some of the other options. There are quite a few for the base map in ggmap – some from Google, such as satellite, and some completely different ones like ‘watercolor’, which gives a completely different look that I personally like because it takes away a lot of the additional information that we don’t really need in this particular plot i.e. borders – although I would have liked to see some of the major cities.

Leaflet package

Moving on to leaflet I was reminded that I need to use it more often. It’s (almost too) rich in features and I had to stop myself from making my dots appear as planes.

In terms of the code, it uses magrittr to add layers to visualisations and it wasn’t a lot of work to get the data appearing on a world map with pop-ups to identify the airport, year and passenger numbers.

The biggest challenge was setting the relative sizes of the points. Unlike ggmap —which uses ggplot2 and handles the sizing for us— we have to be more specific with leaflet. It took a few runs of the plot to get to a size I was happy with. But once I was at this point I could easily repeat the layer to view other years on the same graphic. This is more clunky as you have to manually filter the data and add the layers but the trade-off is that you can add switches to turn layers on and off.

This was where I got distracted by the data. By being able to see the global data for multiple years in one graphic it was clear that over the last 16 years there has been a shift in the location of the airports that carry the most passengers from North America to Asia.

So, one last graphic took me back to ggplot2. The data has been scaled to account for the fact that air passenger numbers have continued to increase over the last 20 years. Interestingly, there has been a very clear increase in the number of passengers travelling through Asian airports; in fact, in 2016 half of the top 30 airports were located in Asia.

Is there a better package?

To return to the two packages, they both have their strengths and weaknesses and it would really depend on my needs as to which I would chose for a particular visualisation. For interactivity, I would without a doubt go straight to leaflet, for something more suited to local data or intended for static reporting ggmap would be my preference.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cleaner Generic Functions with RCPP_RETURN Macros

Wed, 07/26/2017 - 02:00
TL;DR
  • C++ templates and function overloading are incompatible with R’s C API, so
    polymorphism must be achieved via run-time dispatch, handled explicitly by
    the programmer.
  • The traditional technique for operating on SEXP objects in a generic
    manner entails a great deal of boilerplate code, which can be unsightly,
    unmaintainable, and error-prone.
  • The desire to provide polymorphic functions which operate on vectors
    and matrices is common enough that Rcpp provides the utility macros
    RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX to simplify the process.
  • Subsequently, these macros were extended to handle an (essentially)
    arbitrary number of arguments, provided that a C++11 compiler is used.
Background

To motivate a discussion of polymorphic functions, imagine that we desire a
function (ends) which, given an input vector x and an integer n, returns
a vector containing the first and last n elements of x concatenated.
Furthermore, we require ends to be a single interface which is capable of
handling multiple types of input vectors (integers, floating point values,
strings, etc.), rather than having a separate function for each case. How can
this be achieved?

R Implementation

A naïve implementation in R might look something like this:

ends <- function(x, n = 6L) { n <- min(n, length(x) %/% 2) c(head(x, n), tail(x, n)) } ends(1:9) [1] 1 2 3 4 6 7 8 9 ends(letters, 3) [1] "a" "b" "c" "x" "y" "z" ends(rnorm(20), 2) [1] -0.560476 -0.230177 0.701356 -0.472791

The simple function above demonstates a key feature of many dynamically-typed
programming languages, one which has undoubtably been a significant factor in their
rise to popularity: the ability to write generic code with little-to-no
additional effort on the part of the developer. Without getting into a discussion
of the pros and cons of static vs. dynamic typing, it is evident that being able
to dispatch a single function generically on multiple object types, as opposed to,
e.g. having to manage separate impementations of ends for each vector type,
helps us to write more concise, expressive code. Being an article about Rcpp,
however, the story does not end here, and we consider how this problem might
be approached in C++, which has a much more strict type system than R.

C++ Implementation(s)

For simplicity, we begin by considering solutions in the context of a “pure”
(re: not called from R) C++ program. Eschewing more complicated tactics
involving run-time dispatch (virtual functions, etc.), the C++ language
provides us with two straightforward methods of achieving this at compile time:

  1. Function Overloading (ad hoc polymorphism)
  2. Templates (parametric polymorphism)

The first case can be demonstrated as follows:

#include #include #include #include typedef std::vector<int> ivec; ivec ends(const ivec& x, std::size_t n = 6) { n = std::min(n, x.size() / 2); ivec res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end() - n, x.end(), res.begin() + n); return res; } typedef std::vector<double> dvec; dvec ends(const dvec& x, std::size_t n = 6) { n = std::min(n, x.size() / 2); dvec res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end() - n, x.end(), res.begin() + n); return res; } typedef std::vector<std::string> svec; // and so on... int main() { ivec x, xres; dvec y, yres; for (int i = 0; i < 20; i++) { x.push_back(i); y.push_back(i + 0.5); } xres = ends(x, 4); yres = ends(y); for (std::size_t i = 0; i < xres.size(); i++) { std::cout << xres[i] << "\n"; } for (std::size_t i = 0; i < yres.size(); i++) { std::cout << yres[i] << "\n"; } }

Although the above program meets our criteria, the code duplication is profound.
Being seasoned C++ programmers, we recognize this
as a textbook use case for templates and refactor accordingly:

#include #include #include #include template <typename T> T ends(const T& x, std::size_t n = 6) { n = std::min(n, x.size() / 2); T res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end() - n, x.end(), res.begin() + n); return res; } typedef std::vector<int> ivec; typedef std::vector<double> dvec; typedef std::vector<std::string> svec; // and so on... int main() { // as before }

This approach is much more maintainable as we have a single implementation
of ends rather than one implementation per typedef. With this in hand, we
now look to make our C++ version of ends callable from R via Rcpp.

Rcpp Implementation (First Attempt)

Many people, myself included, have attempted some variation of the following at
one point or another:

#include // [[Rcpp::export]] template <typename T> T ends(const T& x, std::size_t n = 6) { n = std::min(n, x.size() / 2); T res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end() - n, x.end(), res.begin() + n); return res; }

Sadly this does not work: magical as Rcpp attributes may be, there are limits
to what they can do, and at least for the time being, translating C++ template
functions into something compatible with R’s C API is out of the question. Similarly,
the first C++ approach from earlier is also not viable, as the C programming
language does not support function overloading. In fact, C does not
support any flavor of type-safe static polymorphism, meaning that our generic
function must be implemented through run-time polymorphism, as touched on in
Kevin Ushey’s Gallery article Dynamic Wrapping and Recursion with Rcpp.

Rcpp Implementation (Second Attempt)

Armed with the almighty TYPEOF macro and a SEXPTYPE cheatsheat, we
modify the template code like so:

#include using namespace Rcpp; namespace impl { template <int RTYPE> Vector<RTYPE> ends(const Vector<RTYPE>& x, int n) { n = std::min((R_xlen_t)n, x.size() / 2); Vector<RTYPE> res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end() - n, x.end(), res.begin() + n); return res; } } // impl // [[Rcpp::export]] SEXP ends(SEXP x, int n = 6) { switch (TYPEOF(x)) { case INTSXP: { return impl::ends(as<IntegerVector>(x), n); } case REALSXP: { return impl::ends(as<NumericVector>(x), n); } case STRSXP: { return impl::ends(as<CharacterVector>(x), n); } case LGLSXP: { return impl::ends(as<LogicalVector>(x), n); } case CPLXSXP: { return impl::ends(as<ComplexVector>(x), n); } default: { warning( "Invalid SEXPTYPE %d (%s).\n", TYPEOF(x), type2name(x) ); return R_NilValue; } } } ends(1:9) [1] 1 2 3 4 6 7 8 9 ends(letters, 3) [1] "a" "b" "c" "x" "y" "z" ends(rnorm(20), 2) [1] -1.067824 -0.217975 -0.305963 -0.380471 ends(list()) Warning in ends(list()): Invalid SEXPTYPE 19 (VECSXP). NULL

Some key remarks:

  1. Following the ubiquitous Rcpp idiom, we have converted our ends template to use
    an integer parameter instead of a type parameter. This is a crucial point, and
    later on, we will exploit it to our benefit.
  2. The template implementation is wrapped in a namespace in order to avoid a
    naming conflict; this is a personal preference but not strictly necessary.
    Alternatively, we could get rid of the namespace and rename either the template
    function or the exported function (or both).
  3. We use the opaque type SEXP for our input / output vector since we need a
    single input / output type. In this particular situation, replacing SEXP with
    the Rcpp type RObject would also be suitable as it is a generic class capable
    of representing any SEXP type.
  4. Since we have used an opaque type for our input vector, we must cast it
    to the appropriate Rcpp::Vector type accordingly within each case label. (For
    further reference, the list of vector aliases can be found here). Finally, we could dress each return value in Rcpp::wrap to convert
    the Rcpp::Vector to a SEXP, but it isn’t necessary because Rcpp attributes
    will do this automatically (if possible).

At this point we have a polymorphic function, written in C++, and callable from
R. But that switch statement sure is an eyesore, and it will need to be
implemented every time we wish to export a generic function to R. Aesthetics
aside, a more pressing concern is that boilerplate such as this increases the
likelihood of introducing bugs into our codebase – and since we are leveraging
run-time dispatch, these bugs will not be caught by the compiler. For example,
there is nothing to prevent this from compiling:

// ... case INTSXP: { return impl::ends(as<CharacterVector>(x), n); } // ...

In our particular case, such mistakes likely would not be too disastrous, but
it should not be difficult to see how situations like this can put you (or a
user of your library!) on the fast track to segfault.

Obligatory Remark on Macro Safety

The C preprocessor is undeniably one of the more controversial aspects of the
C++ programming language, as its utility as a metaprogramming tool is rivaled
only by its potential for abuse. A proper discussion of the various pitfalls
associated with C-style macros is well beyond the scope of this article, so
the reader is encouraged explore this topic on their own. On the bright side,
the particular macros that we will be discussing are sufficiently complex
and limited in scope that misuse is much more likely to result in a compiler
error than a silent bug, so practically speaking, one can expect a fair bit of
return for relatively little risk.

Synopsis

At a high level, we summarize the RCPP_RETURN macros as follows:

  • There are two separate macros for dealing with vectors and matrices,
    RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX, respectively.
  • In either case, code is generated for the following SEXPTYPEs:
    • INTSXP (integers)
    • REALSXP (numerics)
    • RAWSXP (raw bits)
    • LGLSXP (logicals)
    • CPLXSXP (complex numbers)
    • STRSXP (characters / strings)
    • VECSXP (lists)
    • EXPRSXP (expressions)
  • In C++98 mode, each macro accepts two arguments:
    1. A template function
    2. A SEXP object
  • In C++11 mode (or higher), each macro additionally accepts zero or more
    arguments which are forwarded to the template function.

Finally, the template function must meet the following criteria:

  • It is templated on a single, integer parameter.
  • In the C++98 case, it accepts a single SEXP (or something convertible to
    SEXP) argument.
  • In the C++11 case, it may accept more than one argument, but the first
    argument is subject to the previous constraint.

Examining our templated impl::ends function from the previous section, we see
that it meets the first requirement, but fails the second, due to its second
parameter n. Before exploring how ends might be adapted to meet the (C++98)
template requirements, it will be helpful demonstrate correct usage with a few
simple examples.

Fixed Return Type

We consider two situations where our input type is generic, but our output
type is fixed:

  1. Determining the length (number of elements) of
    a vector, in which an int is always returned.
  2. Determining the dimensions (number of rows and number of columns)
    of a matrix, in which a length-two IntegerVector is always returned.

First, our len function:

#include using namespace Rcpp; namespace impl { template <int RTYPE> int len(const Vector<RTYPE>& x) { return static_cast<int>(x.size()); } } // impl // [[Rcpp::export]] int len(RObject x) { RCPP_RETURN_VECTOR(impl::len, x); }

(Note that we omit the return keyword, as it is part of the macro definition.)
Testing this out on the various supported vector types:

classes <- c( "integer", "numeric", "raw", "logical", "complex", "character", "list", "expression" ) sapply(seq_along(classes), function(i) { x <- vector(mode = classes[i], length = i) all.equal(len(x), length(x)) }) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Similarly, creating a generic function that determines the dimensions of an
input matrix is trivial:

#include using namespace Rcpp; namespace impl { template <int RTYPE> Vector<INTSXP> dims(const Matrix<RTYPE>& x) { return Vector<INTSXP>::create(x.nrow(), x.ncol()); } } // impl // [[Rcpp::export]] IntegerVector dims(RObject x) { RCPP_RETURN_MATRIX(impl::dims, x); }

And checking this against base::dim,

classes <- c( "integer", "numeric", "raw", "logical", "complex", "character", "list", "expression" ) sapply(seq_along(classes), function(i) { x <- matrix( vector(mode = classes[i], length = i ^ 2), nrow = i ) all.equal(dims(x), dim(x)) }) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

everything seems to be in order.

It’s worth pointing out that, for various reasons, it is possible to pass a
matrix object to an Rcpp function which calls RCPP_RETURN_VECTOR:

len(1:9) [1] 9 len(matrix(1:9, 3)) [1] 9

Although this is sensible in the case of len – and even saves us from
implementing a matrix-specific version – there may be situations where
this behavior is undesirable. To distinguish between the two object types we
can rely on the API function Rf_isMatrix:

#include using namespace Rcpp; namespace impl { template <int RTYPE> int len(const Vector<RTYPE>& x) { return static_cast<int>(x.size()); } } // impl // [[Rcpp::export]] int len2(RObject x) { if (Rf_isMatrix(x)) { stop("matrix objects not supported."); } RCPP_RETURN_VECTOR(impl::len, x); } len2(1:9) [1] 9 tryCatch( len2(matrix(1:9, 3)), error = function(e) print(e) )

We don’t have to worry about the opposite scenario, as this is already handled
within Rcpp library code:

tryCatch( dims(1:5), error = function(e) print(e) ) Generic Return Type

In many cases our return type will correspond to our input type. For example,
exposing the Rcpp sugar function rev is trivial:

#include using namespace Rcpp; template <int RTYPE> Vector<RTYPE> Rev(const Vector<RTYPE>& x) { return rev(x); } // [[Rcpp::export]] RObject rev2(RObject x) { RCPP_RETURN_VECTOR(Rev, x); } rev2(1:5) [1] 5 4 3 2 1 rev2(as.list(1:5 + 2i)) [[1]] [1] 5+2i [[2]] [1] 4+2i [[3]] [1] 3+2i [[4]] [1] 2+2i [[5]] [1] 1+2i rawToChar(rev2(charToRaw("abcde"))) [1] "edcba"

As a slightly more complex example, suppose we would like to write a function
to sort matrices which preserves the dimensions of the input, since
base::sort falls short of the latter stipulation:

sort(matrix(c(1, 3, 5, 7, 9, 2, 4, 6, 8), 3)) [1] 1 2 3 4 5 6 7 8 9

There are two obstacles we need to overcome:

  1. The Matrix class does not implement its own sort method. However,
    since Matrix inherits from Vector,
    we can sort the matrix as a Vector and construct the result from this
    sorted data with the appropriate dimensions.
  2. As noted previously, the RCPP_RETURN macros will generate code to handle
    exactly 8 SEXPTYPEs; no less, no more. Some functions, like Vector::sort,
    are not implemented for all eight of these types, so in order to avoid a
    compilation error, we need to add template specializations.

With this in mind, we have the following implementation of msort:

#include using namespace Rcpp; // primary template template <int RTYPE> Matrix<RTYPE> Msort(const Matrix<RTYPE>& x) { return Matrix<RTYPE>( x.nrow(), x.ncol(), clone(x).sort().begin() ); } // template specializations for raw vectors, // lists, and expression vectors // // we can just throw an exception, as base::sort // does the same template <> Matrix<RAWSXP> Msort(const Matrix<RAWSXP>& x) { stop("sort not allowed for raw vectors."); } template <> Matrix<VECSXP> Msort(const Matrix<VECSXP>& x) { stop("sort not allowed for lists."); } template <> Matrix<EXPRSXP> Msort(const Matrix<EXPRSXP>& x) { stop("sort not allowed for expression vectors."); } // [[Rcpp::export]] RObject msort(RObject x) { RCPP_RETURN_MATRIX(Msort, x); }

Note that elements will be sorted in column-major order since we filled our
result using this constructor. We can verify that msort works as intended by checking a few test cases:

(x <- matrix(c(1, 3, 5, 7, 9, 2, 4, 6, 8), 3)) [,1] [,2] [,3] [1,] 1 7 4 [2,] 3 9 6 [3,] 5 2 8 msort(x) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 sort(x) [1] 1 2 3 4 5 6 7 8 9 (x <- matrix(c("a", "c", "z", "y", "b", "x"), 3)) [,1] [,2] [1,] "a" "y" [2,] "c" "b" [3,] "z" "x" msort(x) [,1] [,2] [1,] "a" "x" [2,] "b" "y" [3,] "c" "z" sort(x) [1] "a" "b" "c" "x" "y" "z" x <- matrix(as.list(1:9), 3); str(x) List of 9 $ : int 1 $ : int 2 $ : int 3 $ : int 4 $ : int 5 $ : int 6 $ : int 7 $ : int 8 $ : int 9 - attr(*, "dim")= int [1:2] 3 3 tryCatch( msort(x), error = function(e) print(e) ) tryCatch( sort(x), error = function(e) print(e) ) Revisiting the ‘ends’ Function

Having familiarized ourselves with basic usage of the RCPP_RETURN macros, we
can return to the problem of implementing our ends function with
RCPP_RETURN_VECTOR. Just to recap the situation, the template function
passed to the macro must meet the following two criteria in C++98 mode:

  1. It is templated on a single, integer parameter (representing the
    Vector type).
  2. It accepts a single SEXP (or convertible to SEXP) argument.

Currently ends has the signature

template <int RTYPE> Vector<RTYPE> ends(const Vector<RTYPE>&, int);

meaning that the first criterion is met, but the second is not. In order
preserve the functionality provided by the int parameter, we effectively
need to generate a new template function which has access to the user-provided
value at run-time, but without passing it as a function parameter.

The technique we are looking for is called partial function application, and it can be implemented
using one of my favorite C++ tools: the functor. Contrary to typical functor
usage, however, our implementation features a slight twist: rather than
using a template class with a non-template function call operator, as is the
case with std::greater, etc., we are
going to make operator() a template itself:

#include using namespace Rcpp; class Ends { private: int n; public: Ends(int n) : n(n) {} template <int RTYPE> Vector<RTYPE> operator()(const Vector<RTYPE>& x) { n = std::min((R_xlen_t)n, x.size() / 2); Vector<RTYPE> res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end() - n, x.end(), res.begin() + n); return res; } }; // [[Rcpp::export]] RObject ends(RObject x, int n = 6) { RCPP_RETURN_VECTOR(Ends(n), x); }

Not bad, right? All in all, the changes are fairly minor:

  • The function body of Ends::operator() is identical to that of
    impl::ends.
  • n is now a private data member rather than a function parameter, which
    gets initialized in the constructor.
  • Instead of passing a free-standing template function to RCPP_RETURN_VECTOR,
    we pass the expression Ends(n), where n is supplied at run-time from the
    R session. In turn, the macro will invoke Ends::operator() on the SEXP
    (RObject, in our case), using the specified n value.

We can demonstrate this on various test cases:

ends(1:9) [1] 1 2 3 4 6 7 8 9 ends(letters, 3) [1] "a" "b" "c" "x" "y" "z" ends(rnorm(20), 2) [1] -0.694707 -0.207917 0.123854 0.215942 A Modern Alternative

As alluded to earlier, a more modern compiler (supporting C++11 or later)
will free us from the “single SEXP argument” restriction, which means
that we no longer have to move additional parameters into a function
object. Here is ends re-implemented using the C++11 version of
RCPP_RETURN_VECTOR (note the // [[Rcpp::plugins(cpp11)]]
attribute declaration):

// [[Rcpp::plugins(cpp11)]] #include using namespace Rcpp; namespace impl { template <int RTYPE> Vector<RTYPE> ends(const Vector<RTYPE>& x, int n) { n = std::min((R_xlen_t)n, x.size() / 2); Vector<RTYPE> res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end() - n, x.end(), res.begin() + n); return res; } } // impl // [[Rcpp::export]] RObject ends(RObject x, int n = 6) { RCPP_RETURN_VECTOR(impl::ends, x, n); } ends(1:9) [1] 1 2 3 4 6 7 8 9 ends(letters, 3) [1] "a" "b" "c" "x" "y" "z" ends(rnorm(20), 2) [1] 0.379639 -0.502323 0.181303 -0.138891

The current definition of RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX allows for up
to 24 arguments to be passed; although in principal, the true upper bound
depends on your compiler’s implementation of the __VA_ARGS__ macro, which
is likely greater than 24. Having said this, if you find yourself trying
to pass around more than 3 or 4 parameters at once, it’s probably time
to do some refactoring.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Navigating the R Package Universe

Wed, 07/26/2017 - 02:00

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

Earlier this month, I, along with John Nash, Spencer Graves, and Ludovic Vannoorenberghe, organized a session at useR!2017 focused on discovering, learning about, and evaluating R packages. You can check out the recording of the session.

There are more than 11,000 packages on CRAN, and R users must approach this abundance of packages with effective strategies to find what they need and choose which packages to invest time in learning how to use. Our session centered on this issue, with three themes in our discussion.

Unification

John has been interested in working on wrappers, packages that call other, related packages for a common set of tasks. With a unified wrapper package, a user only has to learn one API but then can use many different implementations for a certain task. John has been particularly involved in numerical optimization techniques and presented possibilities there and beyond.

More generally, and as the session revealed in the breakout discussion, there are opportunities to merge either packages or their functionality. The key issues require, however, human cooperation and some give and take in a realm where egos can take precedence over the efficiency of the R ecosystem.

There were also suggestions that can be interpreted as the unification of the presentation of packages. Overlapping with the “guidance” and “search” themes, these ideas seek to provide selective presentations of packages.

Guidance

Julia explored resources that exist to guide users to packages for certains tasks. R users can turn to long-established resources like CRAN Task Views, or newer options under current development such as the packagemetrics package or the CRANsearcher RStudio add-in. Julia organized a survey before useR about how R users learn about R packages that informed our discussion.

Search

Spencer has worked on the issue of searching for R functions in his sos package, and led the last section focused specifically on how users can find what they are looking for. Ludovic spoke during this part of our talk about RDocumentation, how it works, how it was built, and how we as a community can use it and contribute to making it more useful.

Moving forward

After the main presentation, we broke out into three smaller sessions focused on these topics for discussion and brainstorming. Both the main session and then our three following breakout sessions were well-attended. We are so happy about the participation from the community we saw, and hope to use people’s enthusiasm and ideas to move forward with some steps that will improve parts of the R ecosystem. In the coming weeks, look for three more blog posts (from me and the other contributors) on these three topics with more details and ideas on projects. Perhaps something will resonate with you and you can get involved!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

June 2017 New Package Picks

Wed, 07/26/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Two hundred and thirty-eight new packages were added to CRAN in June. Below are my picks for the “Top 40”, organized into six categories: Biostatistics, Data, Machine Learning, Miscellaneous, Statistics and Utilities. Some packages, including geofacet and secret, already seem to be gaining traction.

Biostatistics

BIGL v1.0.1: Implements response surface methods for drug synergy analysis, including generalized and classical Loewe formulations and the Highest Single Agent methodology. There are vignettes on Methodology and Synergy Analysis.

colorpatch v0.1.2: Provides functions to show color patches for encoding fold changes (e.g., log ratios) and confidence values within a diagram; especially useful for rendering gene expression data and other types of differential experiments. See the vignette.

eesim v0.1.0: Provides functions to create simulated time series of environmental exposures (e.g., temperature, air pollution) and health outcomes for use in power analysis and simulation studies in environmental epidemiology. The vignette gives an overview of the package.

personalized v0.0.2: Provides functions for fitting and validating subgroup identification and personalized medicine models under the general subgroup identification framework of Chen et al. The vignette provides a brief tutorial.

tidygenomics v0.1.0: Provides method to deal with genomic intervals the “tidy way”. The vignette explains how they work.

Data

alfred v0.1.1: Provides direct access to the ALFRED and FRED databases. The vignette gives a brief example.

CityWaterBalance v0.1.0: Provides functions to retrieve data and estimate unmeasured flows of water through an urban network. Data for US cities can be gathered via web services using this package and dependencies. See the vignette for an introduction to the package.

censusapi v0.2.0: Provides a wrapper for the U.S. Census Bureau APIs that returns data frames of census data and metadata. Available data sets include the Decennial Census, American Community Survey, Small Area Health Insurance Estimates, Small Area Income and Poverty Estimates, and Population Estimates and Projections. There is a brief vignette.

dataverse v0.2.0: Provides access to Dataverse version 4 APIs, enabling data search, retrieval, and deposit. There are four vignettes: Introduction, Search and Discovery, Retrieval and Data Archiving.

data.world v1.1.1: Provides high-level tools for working with data.world data sets. There is a Quickstart Guide and a vignette for writing Queries.

SimMultiCorrData v0.1.0: Provides functions to generate continuous, binary, ordinal, and count variables with a specified correlation matrix that can be used to simulate data sets that mimic real-world situations (e.g., clinical data sets, plasmodes). There are several vignettes including an Overall Workflow for Data Simulation and a Comparison to Other Packages.

tidycensus v0.1.2: Provides an integrated R interface to the decennial US Census and American Community Survey APIs, and the US Census Bureau’s geographic boundary files.

ukbtools v0.9.0: Provides tools to work with UK Biobank datasets. The vignette shows how to get started.

wpp2017 v1.0-1: Provides and interface to data sets from the United Nation’s World Population Prospects 2017.

Machine Learning

cld3 v1.0: Provides an interface to Google’s experimental Compact Language Detector 3 algorithm, a neural network model for language identification that is the successor of cld2.

datafsm v0.2.0: Implements a method that automatically generates models of dynamic decision-making that both have strong predictive power and are interpretable in human terms. The vignette provides an example.

diceR v0.1.0: Provides functions for cluster analysis using an ensemble clustering framework. The vignette shows some examples.

glmertree v0.1-1: Implements recursive partitioning based on (generalized) linear mixed models (GLMMs) combining lmer() and glmer() from lme4 and lmtree() and glmtree() from partykit. The vignette shows an example.

greta v0.2.0: Lets users write statistical models in R and fit them by MCMC on CPUs and GPUs, using Google TensorFlow. There is a website, a Getting Started Guide, and vignettes providing Examples andTechnical Details.

penaltyLearning v2017.07.11: Implements algorithms from Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression. There is a vignette.

SentimentAnalysis v1.2-0: Implements functions to perform sentiment analysis of textual data using various existing dictionaries, such as Harvard IV, or finance-specific dictionaries, and create customized dictionaries. The vignette provides an introduction.

Miscellaneous

convexjlr v0.5.1: Provides a high-level wrapper for Julia package Convex.jl, which makes it easy to describe and solve convex optimization problems. There is a very nice vignette that shows how to optimize the parameters for several machine learning models.

interp v1.0-29: Implements bivariate data interpolation on both regular and irregular grids using either linear methods or splines.

pkggraph v0.2.0: Allows users to interactively explore and plot package dependencies for CRAN.

parallelDist v0.1.1: Provides a parallelized alternative to R’s native dist function to calculate distance matrices for continuous, binary, and multi-dimensional input matrices with support for a broad variety of distance functions from the stats, prox and dtw R packages. The vignette offers some results on performance.

Stats

anchoredDistR v1.0.3: Supplements the MAD# software that implements the Method of Anchored Distributions for inferring geostatistical parameters. There is a vignette.

bssm v01.1-1: Efficient methods for Bayesian inference of state space models via particle Markov chain Monte Carlo and importance sampling type corrected Markov chain Monte Carlo. There is a vignette on Bayesian Inference of State Space Models and an example of a Logistic Growth Model.

factorMerger v0.3.1 Provides a set of tools to support results of post-hoc testing and enable to extract hierarchical structure of factors. There is an Introduction and vignettes on Cox Regression Factor Merging and Multidimensional Gaussian Merging.

MittagLeffleR v0.1.0: Provides density, distribution, and quantile functions as well as random variate generation for the Mittag-Leffler distribution based on the algorithm by Garrappa. There are short vignettes for the math, distribution functions and random variate generation.

walker v0.2.0: Provides functions for building dynamic Bayesian regression models where the regression coefficients can vary over time as random walks. The vignette shows some examples.

Utilities

charlatan v0.1.0: Provides functions to make fake data, including addresses, person names, dates, times, colors, coordinates, currencies, DOIs, jobs, phone numbers, ‘DNA’ sequences, doubles and integers from distributions and within a range. The Introduction will get you started.

colordistances v0.8.0: Provides functions to load and display images, selectively mask specified background colors, bin pixels by color, quantitatively measure color similarity among images,and cluster images by object color similarity. There is an Introduction and vignettes on Pixel Binning Methods and Color Distance Metrics.

dbplyr v1.1.0: Implements a dplyr back end for databases that allows working with remote database tables as if they are in-memory data frames. There is an Introduction, a vignette for Adding a new DBI backend and one for SQL translation.

geofacet v0.1.5: Provides geofaciting functionality (the ability to arrange a sequence of plots for different geographical entities into a grid that preserves some geographical orientation) for ggplot2. There is a Package Reference vignette and an Introduction. The package is already getting some traction. This is a user submission:

ggformula v0.4.0: Provides a formula interface to ggplot2. There is a vignette explaining how it works.

gqlr v0.0.1: Provides an implementation of the GraphQL query language created by Facebook for describing data requirements on complex application data models. gqlr should be useful for integrating R computations into production applications that use GraphQL.

later v0.3: Allows users to execute arbitrary R or C functions some time after the current time, after the R execution stack has emptied. The vignette shows how to use later from C++.

secret v1.0.0: Allows sharing sensitive information like passwords, API keys, etc., in R packages, using public key cryptography. There is a vignette.

sessioninfo v1.0.0: Provides functions to query and print information about the current R session. It is similar to utils::sessionInfo(), but includes more information.

webglobe v1.0.2: Provides functions to display geospatial data on an interactive 3D globe. There is a vignette

_____='https://rviews.rstudio.com/2017/07/26/june-2017-new-package-picks/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Passing two SQL queries to sp_execute_external_script

Tue, 07/25/2017 - 22:14

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Recently, I got a question on one of my previous blog posts, if there is possibility to pass two queries in same run-time as an argument to external procedure sp_execute_external_script.

Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true.  External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

First I will create two small datasets, using T-SQL

USE SQLR; GO DROP TABLE IF EXISTS dataset; GO CREATE TABLE dataset (ID INT IDENTITY(1,1) NOT NULL ,v1 INT ,v2 INT CONSTRAINT pk_dataset PRIMARY KEY (id) ) SET NOCOUNT ON; GO INSERT INTO dataset(v1,v2) SELECT TOP 1 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1 ,(SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOD') ORDER BY NEWID()) AS v2 FROM master..spt_values GO 50

This dataset will be used directly into @input_data_1 argument. The next one will be used through R code:

CREATE TABLE external_dataset (ID INT IDENTITY(1,1) NOT NULL ,v1 INT CONSTRAINT pk_external_dataset PRIMARY KEY (id) ) SET NOCOUNT ON; GO INSERT INTO external_dataset(v1) SELECT TOP 1 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1 FROM master..spt_values GO 50

Normally,  one would use a single dataset like:

EXEC sp_execute_external_script @language = N'R' ,@script = N'OutputDataSet <- data.frame(MySet);' ,@input_data_1 = N'SELECT TOP 5 v1, v2 FROM dataset;' ,@input_data_1_name = N'MySet' WITH RESULT SETS (( Val1 INT ,Val2 INT ))

But by “injecting” the  ODBC into R code, we can allow external procedure, to get back to your SQL Server and get additional dataset.

This can be done by following:

EXECUTE AS USER = 'RR'; GO DECLARE @Rscript NVARCHAR(MAX) SET @Rscript = ' library(RODBC) myconn <-odbcDriverConnect("driver={SQL Server}; Server=SICN-KASTRUN;database=SQLR;uid=RR;pwd=Read!2$16") External_source <- sqlQuery(myconn, "SELECT v1 AS v3 FROM external_dataset") close(myconn) Myset <- data.frame(MySet) #Merge both datasets mergeDataSet <- data.frame(cbind(Myset, External_source));' EXEC sp_execute_external_script @language = N'R' ,@script = @Rscript ,@input_data_1 = N'SELECT v1, v2 FROM dataset;' ,@input_data_1_name = N'MySet' ,@output_data_1_name = N'mergeDataSet' WITH RESULT SETS (( Val1 INT ,Val2 INT ,Val3 INT )) REVERT; GO

And the result will be merged two datasets, in total three columns:

which correspond to two datasets:

-- Check the results! SELECT * FROM dataset SELECT * FROM external_dataset

There are, as already mentioned, several opposing factors to this approach, and I would not recommend this. Some are:

  • validating and keeping R code in one place
  • performance issues
  • additional costs of data transferring
  • using ODBC connectors
  • installing additional R packages (in my case RODBC package)
  • keeping different datasets in one place
  • security issues
  • additional login/user settings
  • firewall inbound/outbound rules setting

This, of course, can also be achieved with *.XDF file formats, if they are stored locally or on server as a files.

As always, code is available at Github.

Happy R-SQLing!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Effect Size Statistics for Anova Tables #rstats

Tue, 07/25/2017 - 19:20

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

My sjstats-package has been updated on CRAN. The past updates introduced new functions for various purposes, e.g. predictive accuracy of regression models or improved support for the marvelous glmmTMB-package. The current update, however, added some ANOVA tools to the package.

In this post, I want to give a short overview of these new functions, which report different effect size measures. These are useful beyond significance tests (p-values), because they estimate the magnitude of effects, independent from sample size. sjstats provides following functions:

  • eta_sq()
  • omega_sq()
  • cohens_f()
  • anova_stats()

First, we need a sample model:

library(sjstats) # load sample data data(efc) # fit linear model fit <- aov( c12hour ~ as.factor(e42dep) + as.factor(c172code) + c160age, data = efc )

All functions accept objects of class aov or anova, so you can also use model fits from the car-package, which allows fitting Anova’s with different types of sum of squares. Other objects, like lm, will be coerced to anova internally.

The following functions return the effect size statistic as named numeric vector, using the model’s term names.

Eta Squared

The eta squared is the proportion of the total variability in the dependent variable that is accounted for by the variation in the independent variable. It is the ratio of the sum of squares for each group level to the total sum of squares. It can be interpreted as percentage of variance accounted for by a variable.

For variables with 1 degree of freedeom (in the numerator), the square root of eta squared is equal to the correlation coefficient r. For variables with more than 1 degree of freedom, eta squared equals R2. This makes eta squared easily interpretable. Furthermore, these effect sizes can easily be converted into effect size measures that can be, for instance, further processed in meta-analyses.

Eta squared can be computed simply with:

eta_sq(fit) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.266114185 0.005399167 0.048441046 Partial Eta Squared

The partial eta squared value is the ratio of the sum of squares for each group level to the sum of squares for each group level plus the residual sum of squares. It is more difficult to interpret, because its value strongly depends on the variability of the residuals. Partial eta squared values should be reported with caution, and Levine and Hullett (2002) recommend reporting eta or omega squared rather than partial eta squared.

Use the partial-argument to compute partial eta squared values:

eta_sq(fit, partial = TRUE) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.281257128 0.007876882 0.066495448 Omega Squared

While eta squared estimates tend to be biased in certain situations, e.g. when the sample size is small or the independent variables have many group levels, omega squared estimates are corrected for this bias.

Omega squared can be simply computed with:

omega_sq(fit) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.263453157 0.003765292 0.047586841 Cohen’s F

Finally, cohens_f() computes Cohen’s F effect size for all independent variables in the model:

cohens_f(fit) #> as.factor(e42dep) as.factor(c172code) c160age #> 0.62555427 0.08910342 0.26689334 Complete Statistical Table Output

The anova_stats() function takes a model input and computes a comprehensive summary, including the above effect size measures, returned as tidy data frame (as tibble, to be exact):

anova_stats(fit) #> # A tibble: 4 x 11 #> term df sumsq meansq statistic p.value etasq partial.etasq omegasq cohens.f power #> #> 1 as.factor(e42dep) 3 577756.33 192585.444 108.786 0.000 0.266 0.281 0.263 0.626 1.00 #> 2 as.factor(c172code) 2 11722.05 5861.024 3.311 0.037 0.005 0.008 0.004 0.089 0.63 #> 3 c160age 1 105169.60 105169.595 59.408 0.000 0.048 0.066 0.048 0.267 1.00 #> 4 Residuals 834 1476436.34 1770.307 NA NA NA NA NA NA NA

Like the other functions, the input may also be an object of class anova, so you can also use model fits from the car package, which allows fitting Anova’s with different types of sum of squares:

anova_stats(car::Anova(fit, type = 3)) #> # A tibble: 5 x 11 #> term sumsq meansq df statistic p.value etasq partial.etasq omegasq cohens.f power #> #> 1 (Intercept) 26851.070 26851.070 1 15.167 0.000 0.013 0.018 0.012 0.135 0.973 #> 2 as.factor(e42dep) 426461.571 142153.857 3 80.299 0.000 0.209 0.224 0.206 0.537 1.000 #> 3 as.factor(c172code) 7352.049 3676.025 2 2.076 0.126 0.004 0.005 0.002 0.071 0.429 #> 4 c160age 105169.595 105169.595 1 59.408 0.000 0.051 0.066 0.051 0.267 1.000 #> 5 Residuals 1476436.343 1770.307 834 NA NA NA NA NA NA NA References

Levine TR, Hullet CR. Eta Squared, Partial Eta Squared, and Misreporting of Effect Size in Communication Research. Human Communication Research 28(4); 2002: 612-625

Tagged: anova, R, rstats

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R⁶ — General (Attys) Distributions

Tue, 07/25/2017 - 18:15

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice):

Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A daily viz: https://t.co/aJ4KDsC5kC pic.twitter.com/ZoiEV3MhGp

— Matt Stiles (@stiles) July 25, 2017

I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls.

Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis.

It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu.

The following code:

  • grabs the vis iframe
  • extracts and evaluates the target javascript to get a nice data frame
  • performs some factor re-coding (for better grouping and to make it easier to identify Sessions)
  • plots the distributions using the beeswarm quasirandom alogrithm
library(V8) library(rvest) library(ggbeeswarm) library(hrbrthemes) library(tidyverse) pg <- read_html("http://mattstiles.org/dailygraphics/graphics/attorney-general-tenure-20172517/child.html?initialWidth=840&childId=pym_0&parentTitle=Chart%3A%20If%20Ousted%2C%20Jeff%20Sessions%20Would%20Have%20a%20Historically%20Short%20Tenure%20%7C%20The%20Daily%20Viz&parentUrl=http%3A%2F%2Fthedailyviz.com%2F2017%2F07%2F25%2Fchart-if-ousted-jeff-sessions-would-have-a-historically-short-tenure%2F") ctx <- v8() ctx$eval(html_nodes(pg, xpath=".//script[contains(., 'DATA')]") %>% html_text()) ctx$get("DATA") %>% as_tibble() %>% readr::type_convert() %>% mutate(party = ifelse(is.na(party), "Other", party)) %>% mutate(party = fct_lump(party)) %>% mutate(color1 = case_when( party == "Democratic" ~ "#313695", party == "Republican" ~ "#a50026", party == "Other" ~ "#4d4d4d") ) %>% mutate(color2 = ifelse(grepl("Sessions", label), "#2b2b2b", "#00000000")) -> ags ggplot() + geom_quasirandom(data = ags, aes(party, amt, color = color1)) + geom_quasirandom(data = ags, aes(party, amt, color = color2), fill = "#ffffff00", size = 4, stroke = 0.25, shape = 21) + geom_text(data = data_frame(), aes(x = "Republican", y = 100, label = "Jeff Sessions"), nudge_x = -0.15, family = font_rc, size = 3, hjust = 1) + scale_color_identity() + scale_y_comma(limits = c(0, 4200)) + labs(x = "Party", y = "Tenure (days)", title = "U.S. Attorneys General", subtitle = "Distribution of tenure in office, by days & party: 1789-2017", caption = "Source data/idea: Matt Stiles ") + theme_ipsum_rc(grid = "XY")

I turned the data into a CSV and stuck it in this gist if folks want to play w/o doing the js scraping.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

SQL Server 2017 release candidate now available

Tue, 07/25/2017 - 18:03

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

SQL Server 2017, the next major release of the SQL Server database, has been available as a community preview for around 8 months, but now the first full-featured release candidate is available for public preview. For those looking to do data science with data in SQL Server, there are a number of new features compared to SQL Server 2017 which might be of interest: 

SQL Server 2017 Release Candidate 1 is available for download now. For more details on these and other new features in this release, check out the link below.

SQL Server Blog: SQL Server 2017 CTP 2.1 now available

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)

Tue, 07/25/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

Until now, in this series of exercise sets, we have used only continuous probability distributions, which are functions defined on all the real numbers on a certain interval. As a consequence, random variable who have those distributions can assume an infinity of values. However, a lot of random situations only have a finite amount of possible outcome and using a continuous probability distributions to analyze them is not really useful. In today set, we’ll introduce the concept of discrete probability functions, which can be used in those situations and some examples of problems in which they can be used.

Answers to the exercises are available here.

Exercise 1
Just as continuous probability distributions are characterized by a probability density function discrete probability functions are characterized by a probability mass function which gives the probability that a random variable is equal to one value.

The first probability mass function we will use today is the binomial distribution, which is used to simulate n iterations of a random process who can either result in a success, with a probability of p, or a failure, with a probability of (1-p). Basically, if you want to simulate something like a coins flip, the binomial distribution is the tool you need.

Suppose you roll a 20 sided dice 200 times and you want to know the probability to get a 20 exactly five times on your rolls. Use the dbinom(n, size, prob) function to compute this probability.

Exercise 2
For the binomial distribution, the individual events are independents, meaning that the probability of realization of two events can be calculated by adding the probability of realization of both event. This principle can be generalize to any number of events. For example, the probability of getting three tails or less when you flip a coins 10 time is equal to the probability of getting 1 tails plus the probability of getting 2 tails plus the probability of getting 3 tails.

Knowing this, use the dbinom() function to compute the probability of getting six correct responses at a test made of 10 questions which have true or false for answer if you answer randomly. Then, use the pbinom() function to compute the cumulative probability function of the binomial distribution in that situation.

Exercise 3
Another consequence of the independence of events is that if we know the probability of realization of a set of events we can compute the probability of realization of one of his subset by subtracting the probability of the unwanted event. For example, the probability of getting two or three tails when you flip a coins 10 time is equal to the probability of getting at least 3 tails minus the probability of getting 1 tails.

Knowing this, compute the probability of getting 6 or more correct answer on the test described in the previous exercise.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Work with about different binomial and logistic regression techniques
  • Know how to compare regression models and choose the right fit
  • And much more

Exercise 4
Let’s say that in an experiment a success is defined as getting a 1 if you roll a 20 sided die. Use the barplot() function to represent the probability of getting from 0 to 10 success if you roll the die 10 times. What happened to the barplot if you roll a 10 sided die instead? If you roll a 3 sided die?

Exercise 5
Another discrete probability distribution close to the binomial distribution is the Poisson distribution, which give the probability of a number of events to occur during a fixed amount of time if we know the average rate of his occurrence. For example, we could use this distribution to estimate the amount of visitor who goes on a website if we know the average number of visitor per second. In this case, we must assume two things: first that the website has visitor from around the world since the rate of visitor must be constant around the day and two that when a visitor is coming on the site he is not influenced by the last visitor since a process can be expressed by the Poisson distribution if the events are independent from each other.

Use the dpois() function to estimate the probability of having 85 visitors on a website in the next hour if in average 80 individual connect on the site per hour. What is the probability of getting 2000 unique visitors on the website in a day?

Exercise 6
Poisson distribution can be also used to compute the probability of an event occurring in an amount of space, as long as the unit of the average rate is compatible with the unit of measure of the space you use. Suppose that a fishing boat catch 1/2 ton of fish when his net goes through 5 squares kilometers of sea. If the boat combed 20 square kilometer, what is the probability that they catch 5 tons of fish?

Exercise 7
Until now, we used the Poisson distribution to compute the probability of observing precisely n occurrences of an event. In practice, we are often interested in knowing the probability that an event occur n times or less. To do so we can use the ppois() function to compute the cumulative Poisson distribution. If we are interested in knowing what is the probability of observing strictly more than n occurrences, we can use this function and set the parameter lower to FALSE.

In the situation of exercise 5, what is the probability that the boat caught 5 tons of fish or less? What is the probability that the caught more than 5 tons of fish?

Note that, just as in a binomial experiment, the events in a Poisson process are independant, so you can add or subtract probability of event to compute the probability of a particular set of events.
Exercise 8
Draw the Poisson distribution for average rate of 1,3,5 and 10.

Exercise 9
The last discrete probability distribution we will use today is the negative binomial distribution which give the probability of observing a certain number of success before observing a fixed number of failures. For example, imagine that a professional football player will retire at the end of the season. This player has scored 495 goals in his career and would really want to meet the 500 goal mark before retiring. If he is set to play 8 games until the end of the season and score one goal every three games in average, we can use the negative binomial distribution to compute the probability that he will meet his goal on his last game, supposing that he won’t score more than one goal per game.

Use the dnbinom() function to compute this probability. In this case, the number of success is 5, the probability of success is 1/3 and the number of failures is 3.

Exercise 10
Like for the Poisson distribution, R give us the option to compute the cumulative negative binomial distribution with the function pnbinom(). Again, the lower.tail parameter than give you the option to compute the probability of realizing more than n success if he is set to TRUE.

In the situation of the last exercise, what is the probability that the football player will score at most 5 goals in before the end of his career.

Related exercise sets:
  1. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
  2. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-1)
  3. Lets Begin with something sample
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Experiment designs for Agriculture

Tue, 07/25/2017 - 15:33

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

This post is more for personal use than anything else. It is just a collection of code and functions to produce some of the most used experimental designs in agriculture and animal science.  I will not go into details about these designs. If you want to know more about what to use in which situation you can find material at the following links: Design of Experiments (Penn State): https://onlinecourses.science.psu.edu/stat503/node/5

Statistical Methods for Bioscience (Wisconsin-Madison): http://www.stat.wisc.edu/courses/st572-larget/Spring2007/ R Packages to create several designs are presented here: https://cran.r-project.org/web/views/ExperimentalDesign.html A very good tutorial about the use of the package Agricolae can be found here:
https://cran.r-project.org/web/packages/agricolae/vignettes/tutorial.pdf Complete Randomized Design This is probably the most common design, and it is generally used when conditions are uniform, so we do not need to account for variations due for example to soil conditions.  In R we can create a simple CRD with the function expand.grid and then with some randomization: TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))
Data.CRD = TR.Structure[sample(1:nrow(TR.Structure),nrow(TR.Structure)),]

Data.CRD = cbind(PlotN=1:nrow(Data.CRD), Data.CRD[,-1])

write.csv(Data.CRD, "CompleteRandomDesign.csv", row.names=F)
The first line create a basic treatment structure, with rep that identifies the number of replicate, that looks like this:

> TR.Structure
rep Treatment1 Treatment2
1 1 A A
2 2 A A
3 3 A A
4 1 B A
5 2 B A
6 3 B A
7 1 A B
8 2 A B
9 3 A B
10 1 B B
11 2 B B
12 3 B B
13 1 A C
14 2 A C
15 3 A C
16 1 B C
17 2 B C
18 3 B C

The second line randomizes the whole data.frame to obtain a CRD, then we add with cbind a column at the beginning with an ID for the plot, while also eliminating the columns with rep.

Add Control To add a Control we need to write two separate lines, one for the treatment structure and the other for the control: TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))
CR.Structure = expand.grid(rep=1:3, Treatment1=c("Control"), Treatment2=c("Control"))

Data.CCRD = rbind(TR.Structure, CR.Structure)
This will generate the following table:

> Data.CCRD
rep Treatment1 Treatment2
1 1 A A
2 2 A A
3 3 A A
4 1 B A
5 2 B A
6 3 B A
7 1 A B
8 2 A B
9 3 A B
10 1 B B
11 2 B B
12 3 B B
13 1 A C
14 2 A C
15 3 A C
16 1 B C
17 2 B C
18 3 B C
19 1 Control Control
20 2 Control Control
21 3 Control Control

As you can see the control is totally excluded from the rest. Now we just need to randomize, again using the function sample:

Data.CCRD = Data.CCRD[sample(1:nrow(Data.CCRD),nrow(Data.CCRD)),]

Data.CCRD = cbind(PlotN=1:nrow(Data.CCRD), Data.CCRD[,-1])

write.csv(Data.CCRD, "CompleteRandomDesign_Control.csv", row.names=F)
Block Design with Control The starting is the same as before. The difference starts when we need to randomize, because in CRD we randomize over the entire table, but with blocks, we need to do it by block. TR.Structure = expand.grid(Treatment1=c("A","B"), Treatment2=c("A","B","C"))
CR.Structure = expand.grid(Treatment1=c("Control"), Treatment2=c("Control"))

Data.CBD = rbind(TR.Structure, CR.Structure)

Block1 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]
Block2 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]
Block3 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Data.CBD = rbind(Block1, Block2, Block3)

BlockID = rep(1:nrow(Block1),3)

Data.CBD = cbind(Block = BlockID, Data.CBD)

write.csv(Data.CBD, "BlockDesign_Control.csv", row.names=F)
As you can see from the code above, we’ve created three objects, one for each block, where we used the function sample to randomize.

Other Designs with Agricolae The package agricolae includes many designs, which I am sure will cover all your needs in terms of setting up field and lab experiments. We will look at some of them, so first let’s install the package: install.packages("agricolae")
library(agricolae)

The main syntax for design in agricolae is the following:

Trt1 = c("A","B","C")
design.crd(trt=Trt1, r=3)

The result is the output below:

> design.crd(trt=Trt1, r=3)
$parameters
$parameters$design
[1] "crd"

$parameters$trt
[1] "A" "B" "C"

$parameters$r
[1] 3 3 3

$parameters$serie
[1] 2

$parameters$seed
[1] 1572684797

$parameters$kinds
[1] "Super-Duper"

$parameters[[7]]
[1] TRUE


$book
plots r Trt1
1 101 1 A
2 102 1 B
3 103 2 B
4 104 2 A
5 105 1 C
6 106 3 A
7 107 2 C
8 108 3 C
9 109 3 B

As you can see the function takes only one argument for treatments and another for replicates. Therefore, if we need to include a more complex treatment structure we first need to work on them:

Trt1 = c("A","B","C")
Trt2 = c("1","2")
Trt3 = c("+","-")

TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))
TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))
TRT.Control = c(TRT, rep("Control", 3))

As you can see we have now three treatments, which are merged into unique strings within the function sapply:

> TRT
[1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"

Then we need to include the control, and then we can use the object TRT.Control with the function design.crd, from which we can directly obtain the data.frame with $book:

> design.crd(trt=TRT.Control, r=3)$book
plots r TRT.Control
1 101 1 A2+
2 102 1 B1+
3 103 1 Control
4 104 1 B2+
5 105 1 A1+
6 106 1 C2+
7 107 2 A2+
8 108 1 C2-
9 109 2 Control
10 110 1 B2-
11 111 3 Control
12 112 1 Control
13 113 2 C2-
14 114 2 Control
15 115 1 C1+
16 116 2 C1+
17 117 2 B2-
18 118 1 C1-
19 119 2 C2+
20 120 3 C2-
21 121 1 A2-
22 122 2 C1-
23 123 2 A1+
24 124 3 C1+
25 125 1 B1-
26 126 3 Control
27 127 3 A1+
28 128 2 B1+
29 129 2 B2+
30 130 3 B2+
31 131 1 A1-
32 132 2 B1-
33 133 2 A2-
34 134 1 Control
35 135 3 C2+
36 136 2 Control
37 137 2 A1-
38 138 3 B1+
39 139 3 Control
40 140 3 A2-
41 141 3 A1-
42 142 3 A2+
43 143 3 B2-
44 144 3 C1-
45 145 3 B1-

Other possible designs are:

#Random Block Design
design.rcbd(trt=TRT.Control, r=3)$book

#Incomplete Block Design
design.bib(trt=TRT.Control, r=7, k=3)

#Split-Plot Design
design.split(Trt1, Trt2, r=3, design=c("crd"))

#Latin Square
design.lsd(trt=TRT.tmp)$sketch

Others not included above are: Alpha designs, Cyclic designs, Augmented block designs, Graeco – latin square designs, Lattice designs, Strip Plot Designs, Incomplete Latin Square Design

Final Note For repeated measures and crossover designs I think we can create designs simply using again the function expand.grid and including time and subjects, as I did in my previous post about Power Analysis. However, there is also the package Crossover that deals specifically with crossover design and on this page you can find more packages that deal with clinical designs: https://cran.r-project.org/web/views/ClinicalTrials.html var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Twitter Coverage of the Bioinformatics Open Source Conference 2017

Tue, 07/25/2017 - 13:55

July 21-22 saw the 18th incarnation of the Bioinformatics Open Source Conference, which generally precedes the ISMB meeting. I had the great pleasure of attending BOSC way back in 2003 and delivering a short presentation on Bioperl. I knew almost nothing in those days, but everyone was very kind and appreciative.

My trusty R code for Twitter conference hashtags pulled out 3268 tweets and without further ado here is:

The ISMB/ECCB meeting wraps today and analysis of Twitter coverage for that meeting will appear here in due course.

Filed under: bioinformatics, meetings, R, statistics Tagged: bosc

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The new MODIStsp website (based on pkgdown) is online !

Tue, 07/25/2017 - 11:58

(This article was first published on Spatial Processing in R, and kindly contributed to R-bloggers)

The MODIStsp website, which lay abandoned since several months on github pages, recently underwent a major overhaul thanks to pkgdown. The new site is now available at http://lbusett.github.io/MODIStsp/

We hope that the revised website will allow to navigate MODIStsp-related material much more easily than either github or the standard CRAN documentation, and will therefore help users in better and more easily exploiting MODIStsp functionality.

The restyling was possible thanks to the very nice “pkgdown” R package (http://hadley.github.io/pkgdown), which allows to easily create a static documentation website.

Though pkgdown does already quite a good job in creating a bare-bones website exploiting just the material available in a standard devtools-based R package file structure, I was surprised at how simply the standard website could be tweaked to obtain a very nice (IMO) final result without needing any particular background on html, css, etc. !

(On that, I plan to soon post a short guide containing a few tips and tricks I learnt in the last days for setting up and configuring a pkgdown website, so stay tuned !)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Spatial Processing in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages