Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 18 min ago

Scico and the Colour Conundrum

Wed, 05/30/2018 - 02:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

I’m happy to once again announce the release of a package. This time it happens to be a rather unplanned and quick new package, which is fun for a change. The package in question is scico which provides access to the colour palettes developed by Fabio Crameri as well as scale functions for ggplot2 so they can be used there. As there is not a lot to talk about in such a simple package I’ll also spend some time discussing why choice of colour is important beyond aesthtic considerations, and discuss how the quality of a palette might be assesed.

An overview of the package

scico provides a total of 17 different continuous palettes, all of which are available with the scico() function. For anyone having used viridis() the scico() API is very familiar:

library(scico) scico(15, palette = 'oslo') ## [1] "#000000" "#09131E" "#0C2236" "#133352" "#19456F" "#24588E" "#3569AC" ## [8] "#4D7CC6" "#668CCB" "#7C99CA" "#94A8C9" "#ABB6C7" "#C4C7CC" "#E1E1E1" ## [15] "#FFFFFF"

In order to get a quick overview of all the available palettes use the scico_palette_show() function:

scico_palette_show()

As can be seen, the collection consists of both sequential and diverging palettes, both of which have their own uses depending on the data you want to show. A special mention goes to the oleron palette which is intended for topographical height data in order to produce the well known atlas look. Be sure to center this palette around 0 or else you will end up with very misleading maps.

ggplot2 support is provided with the scale_[colour|color|fill]_scico() functions, which works as expected:

library(ggplot2) volcano <- data.frame( x = rep(seq_len(ncol(volcano)), each = nrow(volcano)), y = rep(seq_len(nrow(volcano)), ncol(volcano)), Altitude = as.vector(volcano) ) ggplot(volcano, aes(x = x, y = y, fill = Altitude)) + geom_raster() + theme_void() + scale_fill_scico(palette = 'turku')

This is more or less all there is for this package… Now, let’s get to the meat of the discussion.

What’s in a colour?

If you’ve ever wondered why we are not all just using rainbow colours in our plots (after all, rainbows are pretty…) it’s because our choice of colour scale have a deep impact on what changes in the underlying data our eyes can percieve. The rainbow colour scale is still very common and notoriously bad – see e.g. Borland & Taylor (2007), The Rainbow Colour Map (repeatedly) considered harmful, and How The Rainbow Color Map Misleads – due to two huge problems that are fundamental to designing good colour scales: Perceptual Uniformity and Colour Blind Safe. Both of these issues have been taken into account when designing the scico palettes, but let’s tackle them one by one:

Colour blindness

Up to 10% of north european males have the most common type of colour blindness (deuteranomaly, also known as red-green colour blindness), while the number is lower for other population groups. In addition, other, rarer, types of colour blindness exists as well. In any case, the chance that a person with a color vision deficiency will look at your plots is pretty high.

As we have to assume that the plots we produce will be looked at by people with color vision deficiency, we must make sure that the colours we use to encode data can be clearly read by them (ornamental colours are less important as they – hopefully – don’t impact the conclusion of the graphic). Thanksfully there are ways to simulate how colours are percieved by people with various types of colour blindness. Let’s look at the rainbow colour map:

library(pals) pal.safe(rainbow, main = 'Rainbow scale')

As can be seen, there are huge areas of the scale where key tints disappears, making it impossible to correctly map colours back to their original data values. Put this in contrast to one of the scico palettes:

pal.safe(scico(100, palette = 'tokyo'), main = 'Tokyo scale')

While colour blindness certainly have an effect here, it is less detrimental as changes along the scale can still be percieved and the same tint is not occuring at multiple places.

Perceptual uniformity

While lack of colour blind safety “only” affects a subgroup of your audience, lack of perceptual uniformity affects everyone – even you. Behind the slightly highbrow name lies the criteria that equal jumps in the underlying data should result in equal jumps in percieved colour difference. Said in another way, every step along the palette should be percieved as giving the same amount of difference in colour.

One way to assess perceptual uniformity is by looking at small oscillations inside the scale. Let’s return to our favourite worst rainbow scale:

pal.sineramp(rainbow, main = 'Rainbow scale')

We can see that there are huge differences in how clearly the oscilations appear along the scale and around the green area they even disappears. In comparison the scico palettes produces much more even resuls:

pal.sineramp(scico(100, palette = 'tokyo'), main = 'Tokyo scale')

But wait – there’s more!

This is just a very short overview into the world of colour perception and how it affects information visualisation. The pals package contains more functions to assess the quality of colour palettes, some of which has been collected in an ensemble function:

pal.test(scico(100, palette = 'broc'), main = 'Broc scale')

It also has a vignette that explains in more detail how the different plots can be used to look into different aspects of the palette.

scico is also not the only package that provides well-designed, safe, colour palettes. RColorBrewer has been a beloved utility for a long time, as well as the more recent viridis. Still, choice is good and using the same palettes for prolonged time can make them seem old and boring, so the more the merrier.

A last honerable mention is the overview of palettes in R that Emil Hvitfeldt has put together. Not all of the palettes in it (the lions share actually) have been designed with the issues discussed above in mind, but sometimes thats OK – at least you now know how to assess the impact of your choice and weigh it out with the other considerations you have.

Always be weary of colours

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

User authentication in R Shiny – sneak peek of shiny.users and shiny.admin packages

Wed, 05/30/2018 - 02:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

At Appsilon we frequently build advanced R/Shiny dashboards that need user authentication. I would like to share with you how we implement user management – user accounts, the authorization process and gather usage statistics to have a better understanding of how the app is used.

For user authentication, we created the shiny.users package. For user administration and usage statistics, we created the shiny.admin package. We will show you a sneak peek of how these packages work. They are not released yet – but we opened an Early Adopter Waitlist – sign up and we will contact you as soon as these packages will be ready for production testing!

Shiny.users – convenient authentication on the application level

Here is an example of how shiny.users works when you add it to your shiny application:

Below you can view the code of this example:

library(shiny) library(shiny.users) demo_users <- list( list( username = "demo-appsilon", password = "QZt85C46Ab", roles = list("basic", "admin") ), list( username = "john", password = "C4SRx3xJcy", roles = list("basic") ) ) ui <- shinyUI(fluidPage( div(class = "container", style = "padding: 4em", login_screen_ui('login_screen'), uiOutput("authorized_content") ) )) server <- shinyServer(function(input, output) { users <- initialize_users(demo_users) callModule(login_screen, 'login_screen', users) output$authorized_content <- renderUI({ if (!is.null(users$user()) { ... # application content } }) }) shinyApp(ui, server)

This is a simple demo with users defined in a list. In a real-world scenario, you would store user credentials in a database or authenticate users with an external API. This is also covered in shiny.users.

In bigger apps with more users, you will also need user management, authorization management (different roles e.g. admin, writer, reader, etc) and usage statistics. For such advanced use cases, we created shiny.admin package.

Shiny.admin – user management and usage statistics

This package is an extension to the shiny.users package. When you add it to your Shiny app, you instantly get admin panel under the url http://application-url/#!/admin. This panel is only accessible by the users with “admin” role.

Features of shiny.admin:

  • Adding and removing users
  • Managing existing users – their passwords, roles and metadata
  • Statistics – counting unique users and sessions (similar to Google Analytics)
  • Session recording – recording defined actions e.g. provided input values and clicked elements. You can view all sessions details (similar to Hotjar)

We observed huge gains from above features. We exactly understood how our apps were used by our users. Here are some screenshots from shiny.admin:

I hope you like the idea of shiny.users and shiny.admin packages. Let us know what would you expect from user accounts and admin panel features. Do you think there is a need for developing such packages? Tell us about your use cases and needs in the comments or contact us at dev@appsilondatascience.com.

Also, don’t forget to sign up to shiny.users and shiny.admin waitlist!




Subscribe to shiny.users & shiny.admin waitlist!

Read the original post at
Appsilon Data Science Blog.

Follow Appsilon Data Science var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A recipe for recipes

Tue, 05/29/2018 - 16:15

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

If you build statistical or machine learning models, the recipes package can be useful for data preparation. A recipe object is a container that holds all the steps that should be performed to go from the raw data set to the set that is fed into model a algorithm. Once your recipe is ready it can be executed on a data set at once, to perform all the steps. Not only on the train set on which the recipe was created, but also on new data, such as test sets and data that should be scored by the model. This assures that new data gets the exact same preparation as the train set, and thus can be validly scored by the learned model. The author of recipes, Max Kuhn, has provided abundant material to familiarize yourself with the richness of the package, see here, here, and here. I will not dwell on how to use the package. Rather, I’d like to share what in my opinion is a good way to create new steps and checks to the package1. Use of the package is probably intuitive. Developing new steps and checks, however, does require a little more understanding of the package inner workings. With this procedure, or recipe if you like, I hope you will find adding (and maybe even contributing) your own steps and checks becomes easier and more organized.

S3 classes in recipes

Lets build a very simple recipe:

library(tidyverse) library(recipes) rec <- recipe(mtcars) %>% step_center(everything()) %>% step_scale(everything()) %>% check_missing(everything()) rec %>% class() ## [1] "recipe"

As mentioned a recipe is a container for the steps and checks. It is a list of class recipe on which the prep, bake, print and tidy methods do the work as described in their respective documentation files. The steps and checks added to the recipe are stored inside this list. As you can see below, each of them are of their own subclass, as well as of the generic step or check classes.

rec$steps %>% map(class) ## [[1]] ## [1] "step_center" "step" ## ## [[2]] ## [1] "step_scale" "step" ## ## [[3]] ## [1] "check_missing" "check"

Each subclass has the same four methods defined as the recipe class. As one of the methods is called on the recipe, it will call the same method on all the steps and checks that are added to the recipe. (Exception is prep.recipe, which does not only callprep on all the steps and checks, but also bake). This means that when implementing a new step or check, you should provide these four methods. Additionally, we’ll need the function that is called by the user to add it to the recipe object and a constructor (if you are not sure what this is, see 15.3.1 of Advanced R by Hadley Wickham).

The recipes for recipes

When writing a new step or check you will probably be inclined to copy-paste an existing step and start tweaking it from the top. Thus, first writing the function, then the constructor, and then the methods one by one. I think this is suboptimal and can get messy quickly. My preferred way is to start by not bothering about recipes at all, but write a general function that does the preparation work on a single vector. Once you are happy with this function you sit and think about which arguments to this function should be provided with upfront. These should be added as arguments to the function that is called by the user. Next you think about which function arguments are statistics that should be learned on the train set in the prep part of the recipe. You’ll then go on and do the constructor that is called by both the main function and the prep method. Only then you’ll write the function that is called by the user. You’ll custom step or check is completed by writing the four methods, since the functionality is already created, these are mostly bookkeeping.

For both checks and steps I made a skeleton available, in which all the “overhead” that should be in a new step or check is present. This is more convenient than copy-paste and existing example and try to figure out what is step specific and should be erased. You’ll find the skeletons on github. Note that for creating your own steps and checks you should clone the package source code, since the helper functions that are used are not exported by the package.

Putting it to practice

We are going to do two examples in which the recipe for recipes is applied.

Example 1: A signed log

First up is a signed-log (inspired by Practical Data Science with R), which is taking the log over the absolute value of a variable, multiplied by its original sign. Thus, enabling us to take logs over negative values. If a variable is between -1 and 1, we’ll set it to 0, otherwise things get messy.

1) preparing the function

This is what the function on a vector could look like, if we did not bother about recipes:

signed_log <- function(x, base = exp(1)) { ifelse(abs(x) < 1, 0, sign(x) * log(abs(x), base = base)) } 2) think about the arguments

The only argument of the function is base, that should be provided upfront when adding the step to a recipe object. There are no statistics to be learned in the prep step.

3) the constructor

Now we are going to write the first of the recipes functions. This is the constructor that produces new instances of the object of class step_signed_log. The first four arguments are present in each step or check, they are therefore part of the skeletons. The terms argument will hold the information on which columns the step should be performed. For role, train and skip, please see the documentation in one of the skeletons. base is step_signed_log specific, as used in 1). In prep.step_signed_log the tidyselect arguments will be converted to a character vector holding the actual column names. columns will store these names in the step_signed_log object. This container argument will not be necessary if the columns names are also present in another way. For instance, step_center has the argument means, that will be populated by the means of the variables of the train set by its prep method. In the bake method the names of the columns to be prepared are already provided by the names of the means and there is need to use the columns argument.

step_signed_log_new <- function(terms = NULL, role = NA, skip = FALSE, trained = FALSE, base = NULL, columns = NULL) { step( subclass = "signed_log", terms = terms, role = role, skip = skip, trained = trained, base = base, columns = columns ) } 4) the function to add it to the recipe

Next up is the function that will be called by the user to add the step to its recipe. You’ll see the internal helper function add_step is called. It will expand the recipe with the step_signed_log object that is produced by the constructor we just created.

step_signed_log <- function(recipe, ..., role = NA, skip = FALSE, trained = FALSE, base = exp(1), columns = NULL) { add_step( recipe, step_signed_log_new( terms = ellipse_check(...), role = role, skip = skip, trained = trained, base = base, columns = columns ) ) } 5) the prep method

As recognized in 2) we don’t have to do much in the prep method of this particular step, since the preparation of new sets does not depend on statistics learned on the train set. The only thing we do here is applying the internal function terms_select function to replace the tidyselect selections, by the actual names of the columns on which step_signed_log should be applied. We call the constructor again, indicating that the step is trained and we supplying the column names at the columns argument.

prep.step_signed_log <- function(x, training, info = NULL, ...) { col_names <- terms_select(x$terms, info = info) step_signed_log_new( terms = x$terms, role = x$role, skip = x$skip, trained = TRUE, base = x$base, columns = col_names ) } 6) the bake method

We are now ready to apply the baking function, designed at 1), inside the recipes framework. We loop through the variables, apply the function and return the updated data frame.

bake.step_signed_log <- function(object, newdata, ...) { col_names <- object$columns for (i in seq_along(col_names)) { col <- newdata[[ col_names[i] ]] newdata[, col_names[i]] <- ifelse(abs(col) < 1, 0, sign(col) * log(abs(col), base = object$base)) } as_tibble(newdata) } 7) the print method

This assures pretty printing of the recipe object to which step_signed_log is added. You use the internal printer function with a message specific for the step.

print.step_signed_log <- function(x, width = max(20, options()$width - 30), ...) { cat("Taking the signed log for ", sep = "") printer(x$columns, x$terms, x$trained, width = width) invisible(x) } 8) the tidy method

Finally, tidy will add a line for this step to the data frame when the tidy method is called on a recipe.

tidy.step_signed_log <- function(x, ...) { if (is_trained(x)) { res <- tibble(terms = x$columns) } else { res <- tibble(terms = sel2char(x$terms)) } res }

Lets do a quick check to see if it works as expected

recipe(data_frame(x = 1)) %>% step_signed_log(x) %>% prep() %>% bake(data_frame(x = -3:3)) ## # A tibble: 7 x 1 ## x ## ## 1 -1.0986123 ## 2 -0.6931472 ## 3 0.0000000 ## 4 0.0000000 ## 5 0.0000000 ## 6 0.6931472 ## 7 1.0986123 Example 2: A range check

Model predictions might be invalid when the range of a variable in new data is shifted from the range of the variable in the train set. Lets do a second example in which we check if the range of a numeric variable is approximately equal to the range of the variable in the train set. We do so by checking if the variable’s minimum value in the new data is not smaller than its minimum value in the train set. The variable’s maximum value in the test set should not exceed the maximum in the train set. We allow for some slack (proportion of the variable range in the train set) to account for natural variation.

1) preparing the function

As mentioned, checks are about throwing informative errors if assumptions are not met. This is a function we could apply on new variables, without bothering about recipes:

range_check_func <- function(x, lower, upper, slack_prop = 0.05, colname = "x") { min_x <- min(x) max_x <- max(x) slack <- (upper - lower) * slack_prop if (min_x < (lower - slack) & max_x > (upper + slack)) { stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack, "\n", "max x is ", max_x, ", upper bound is ", upper + slack, call. = FALSE) } else if (min_x < (lower - slack)) { stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack, call. = FALSE) } else if (max_x > (upper + slack)) { stop("max ", colname, " is ", max_x, ", upper bound is ", upper + slack, call. = FALSE) } } 2) think about the arguments

The slack_prop is a choice that the user of the check should make upfront. This is thus an argument of check_range. Then there are two statistics to be learned in the prep method: lower and upper. These should be arguments of the function and the constructor as well. However, when calling the function these are always NULL, they are container arguments that will filled when calling prep.check_range.

3) the constructor

We start again with the four arguments present in every step or check. Subsequently we add the three arguments that we recognized to be part of the check.

check_range_new <- function(terms = NULL, role = NA, skip = FALSE, trained = FALSE, lower = NULL, upper = NULL, slack_prop = NULL) { check(subclass = "range", terms = terms, role = role, trained = trained, lower = lower, upper = upper, slack_prop = slack_prop) } 4) the function to add it to the recipe

As we know by now, it is just calling the constructor and adding it to the recipe.

check_range <- function(recipe, ..., role = NA, skip = FALSE, trained = FALSE, lower = NULL, upper = NULL, slack_prop = 0.05) { add_check( recipe, check_range_new( terms = ellipse_check(...), role = role, skip = skip, trained = trained, lower = lower, upper = upper, slack_prop = slack_prop ) ) } 5) the prep method

Here the method is getting a lot more interesting, because we actually have work to do. We are calling vapply on each of the columns the check should be applied on, to derive the minimum and maximum. Again the constructor is called and the learned statistics are now populating the lower and upper arguments.

prep.check_range <- function(x, training, info = NULL, ...) { col_names <- terms_select(x$terms, info = info) lower_vals <- vapply(training[ ,col_names], min, c(min = 1), na.rm = TRUE) upper_vals <- vapply(training[ ,col_names], max, c(max = 1), na.rm = TRUE) check_range_new( x$terms, role = x$role, trained = TRUE, lower = lower_vals, upper = upper_vals, slack_prop = x$slack_prop ) } 6) the bake method

The hard work has been done already. We just get the columns on which to apply the check and check them with the function we created at 1).

bake.check_range <- function(object, newdata, ...) { col_names <- names(object$lower) for (i in seq_along(col_names)) { colname <- col_names[i] range_check_func(newdata[[ colname ]], object$lower[colname], object$upper[colname], object$slack_prop, colname) } as_tibble(newdata) } 7) the print method print.check_range <- function(x, width = max(20, options()$width - 30), ...) { cat("Checking range of ", sep = "") printer(names(x$lower), x$terms, x$trained, width = width) invisible(x) } 8) the tidy method tidy.check_range <- function(x, ...) { if (is_trained(x)) { res <- tibble(terms = x$columns) } else { res <- tibble(terms = sel2char(x$terms)) } res }

Again, we check quickly if it works

cr1 <- data_frame(x = 0:100) cr2 <- data_frame(x = -1:101) cr3 <- data_frame(x = -6:100) cr4 <- data_frame(x = 0:106) recipe_cr <- recipe(cr1) %>% check_range(x) %>% prep() cr2_baked <- recipe_cr %>% bake(cr2) cr3_baked <- recipe_cr %>% bake(cr3) ## Error: min x is -6, lower bound is -5 cr4_baked <- recipe_cr %>% bake(cr4) ## Error: max x is 106, upper bound is 105 Conclusion

If you like to add your own data preparation steps and data checks to the recipes package, I advise you to do this in a structured way so you are not distracting by the bookkeeping while implementing the functionality. I propose eight subsequent parts to develop a new step or check:

1) Create a function on a vector that could be applied in the bake method, but does not bother about recipes yet.
2) Recognize which function arguments should be provided upfront and which should be learned in the prep method.
3) Create a constructor in the form step__new or check__new.
4) Create the actual function to add the step or check to a recipe, in the form step_ or check_.
5) Write the prep method.
6) Write the bake method.
7) Write the print method.
8) Write the tidy method.

As mentioned, the source code is maintained here. Make sure to clone the latest version to access the package internal functions.

  1. Together with Max I added the checks framework to the package. Where steps are transformations of variables, checks are assertions of them. If a check passes nothing happens. If it fails, it will break the bake method of the recipe and throws an informative error. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL Boston 2018 Keynote Speaker announcement: Bob Rudis

Tue, 05/29/2018 - 16:06

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Mango Solutions are delighted to announce our Keynote Speaker for the EARL Boston Conference: Bob Rudis, [Master] Chief Data Scientist at Rapid7.

Bob has more than 20 years’experience using data to help defend global Fortune 100 companies and in his current role at Rapid7, he specializes in research on internet-scale exposure. He was formerly a Security Data Scientist and Managing Principal at Verizon, overseeing the team that produces the annual Data Breach Investigations Report.

If you’re on Twitter, you’ll know Bob as a serial tweeter (he’s even got the blue tick) – if you don’t already follow him, you can find him at @hrbrmstr. He’s also an avid blogger at rud.is, author (Data-Driven Security) and speaker.

As a prominent and avuncular participant of the twitter #rstats conversation and prolific package author, Bob is a respected and valued member of the global R community.

We’re looking forward to seeing him speak on 13 November in Boston!

Take the stage with Bob

Abstract submissions for the US Roadshow close on 30 April 2018. You could be on the agenda with Bob in Boston as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Fix Is In: Finding infix functions inside contributed R package “utilities” files

Tue, 05/29/2018 - 14:42

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Regular readers will recall the “utility belt” post from back in April of this year. This is a follow-up to a request made asking for a list of all the % infix functions in those files.

We’re going to:

  • collect up all of the sources
  • parse them
  • find all the definitions of % infix functions
  • write them to a file

We’ll start by grabbing the data from the previous post and look at it as a refresher:

library(stringi) library(tidyverse) utils <- read_rds(url("https://rud.is/dl/utility-belt.rds")) utils ## # A tibble: 1,746 x 13 ## permsissions links owner group size month day year_hr path date pkg fil file_src ## 1 -rw-r--r-- 0 hornik users 1658 Jun 05 2016 AHR/R… 2016-06-05 AHR util… "## \\int f(x)dg(x) … ## 2 -rw-r--r-- 0 ligges users 12609 Dec 13 2016 ALA4R… 2016-12-13 ALA4R util… "## some utility fun… ## 3 -rw-r--r-- 0 hornik users 0 Feb 24 2017 AWR.K… 2017-02-24 AWR.… util… "" ## 4 -rw-r--r-- 0 ligges users 4127 Aug 30 2017 Alpha… 2017-08-30 Alph… util… "#\n#' Assign API ke… ## 5 -rw-r--r-- 0 ligges users 121 Jan 19 2017 Amylo… 2017-01-19 Amyl… util… "make_decision <- fu… ## 6 -rw-r--r-- 0 herbrandt herbrandt 52 Aug 10 2017 BANES… 2017-08-10 BANE… util… "#' @importFrom dply… ## 7 -rw-r--r-- 0 ripley users 36977 Jan 06 2015 BEQI2… 2015-01-06 BEQI2 util… "#' \tRemove Redunda… ## 8 -rw-r--r-- 0 hornik users 34198 May 10 2017 BGDat… 2017-05-10 BGDa… util… "# A more memory-eff… ## 9 -rwxr-xr-x 0 ligges users 3676 Aug 14 2016 BGLR/… 2016-08-14 BGLR util… "\n readBinMat=funct… ## 10 -rw-r--r-- 0 ripley users 2547 Feb 04 2015 BLCOP… 2015-02-04 BLCOP util… "###################… ## # ... with 1,736 more rows

Note that we somewhat expected the file source to potentially come in handy at a later date and also expected the need to revisit that post, so the R data file [←direct link to RDS] included a file_src column.

Now, let’s find all the source files with at least one infix definition, collect them together and parse them so we can do more code spelunking:

filter(utils, stri_detect_fixed(file_src, "`%")) %>% # only find sources with infix definitions pull(file_src) %>% paste0(collapse="\n\n") %>% parse(text = ., keep.source=TRUE) -> infix_src str(infix_src, 1) ## length 1364 expression(dplyr::`%>%`, `%||%` <- function(a, b) if (is.null(a)) b else a, get_pkg_path <- function(ctx) { pkg_| __truncated__ ... ## - attr(*, "srcref")=List of 1364 ## - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' ## - attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 15768 0 0 0 1 15768 ## ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'

We can now take all of that lovely parsed source and tokenize it to work with the discrete elements in a very tidy manner:

infix_parsed <- tbl_df(getParseData(infix_src)) # tbl_df() is mainly for pretty printing infix_parsed ## # A tibble: 118,242 x 9 ## line1 col1 line2 col2 id parent token terminal text ## 1 1 1 1 24 1 -10 COMMENT TRUE #' @impor… ## 2 2 1 2 10 4 -10 COMMENT TRUE #' @export ## 3 3 1 3 12 10 0 expr FALSE "" ## 4 3 1 3 5 7 10 SYMBOL_PACKAGE TRUE dplyr ## 5 3 6 3 7 8 10 NS_GET TRUE :: ## 6 3 8 3 12 9 10 SYMBOL TRUE `%>%` ## 7 5 1 5 49 51 0 expr FALSE "" ## 8 5 1 5 6 16 18 SYMBOL TRUE `%||%` ## 9 5 1 5 6 18 51 expr FALSE "" ## 10 5 8 5 9 17 51 LEFT_ASSIGN TRUE <- ## # ... with 118,232 more rows

We just need to find a sequence of tokens that make up a function definition, then whittle those down to ones that look like our % infix names:

pat <- c("SYMBOL", "expr", "LEFT_ASSIGN", "expr", "FUNCTION") # pattern for function definition # find all of ^^ sequences (there's a good twitter discussion on this abt a month ago) idx <- which(infix_parsed$token == pat[1]) # find location of match of start of seq # look for the rest of the sequences starting at each idx position map_lgl(idx, ~{ all(infix_parsed$token[.x:(.x+(length(pat)-1))] == pat) }) -> found f_defs <- idx[found] # starting indices of all the places where functions are defined # filter ^^ to only find infix ones infix_defs <- f_defs[stri_detect_regex(infix_parsed$text[f_defs], "^`\\%")] # there aren't too many, but remember we're just searching `util` functions length(infix_defs) ## [1] 106

Now, write it out to a file so we can peruse the infix functions:

# nuke a file and fill it with the function definition cat("", sep="", file="infix_functions.R") walk2( getParseText(infix_parsed, infix_parsed$id[infix_defs]), # extract the infix name getParseText(infix_parsed, infix_parsed$id[infix_defs + 3]), # extract the function definition body ~{ cat(.x, " <- ", .y, "\n\n", sep="", file="infix_functions.R", append=TRUE) } )

There are 106 of them so you can find the extracted ones in this gist.

Here's an overview of what you can expect to find:

# A tibble: 39 x 2 name n 1 `%||%` 47 2 `%+%` 7 3 `%AND%` 4 4 `%notin%` 4 5 `%:::%` 3 6 `%==%` 3 7 `%!=%` 2 8 `%*diag%` 2 9 `%diag*%` 2 10 `%nin%` 2 11 `%OR%` 2 12 `%::%` 1 13 `%??%` 1 14 `%.%` 1 15 `%@%` 1 16 `%&&%` 1 17 `%&%` 1 18 `%+&%` 1 19 `%++%` 1 20 `%+|%` 1 21 `%<<%` 1 22 `%>>%` 1 23 `%~~%` 1 24 `%assert_class%` 1 25 `%contains%` 1 26 `%din%` 1 27 `%fin%` 1 28 `%identical%` 1 29 `%In%` 1 30 `%inr%` 1 31 `%M%` 1 32 `%notchin%` 1 33 `%or%` 1 34 `%p%` 1 35 `%pin%` 1 36 `%R%` 1 37 `%s%` 1 38 `%sub_in%` 1 39 `%sub_nin%` 1 FIN

If any of those are useful, feel free to PR them in to https://github.com/hrbrmstr/freebase/blob/master/inst/templates/infix-helpers.R (and add yourself to the DESCRIPTION if you do).

Hopefully this provided some further inspiration to continue to use R not only as your language of choice but also as a fun data source.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Daily Streamflow Trend Analysis

Tue, 05/29/2018 - 02:00

(This article was first published on The USGS OWI blog , and kindly contributed to R-bloggers)

Introduction

This document describes how to produce a set of graphics and perform the associated statistical tests that describe trends in daily streamflow at a single streamgage. The trends depicted cover the full range of quantiles of the streamflow distribution, from the lowest discharge of the year, through the median discharge, up through the highest discharge of the year. The method is built around the R package EGRET Exploration and Graphics for RivEr Trends. It makes it possible to consider trends over any portion of the time period for which there are daily streamflow records, and any period of analysis (the portion of the year to be considered, e.g. the entire year or just the summer months).

Getting Started

First, the EGRET package needs to be installed and loaded. Then, you’ll need need to create an eList object, which contains site information and daily discharge data for streamgage. Page 13 of the EGRET user guide here.

For this post, we will use the Choptank River in Maryland as our first example. There is an example data set included in EGRET. The data set consists of site information, daily discharge data, and water quality data, but this application does not use the water quality data. Refer to the section near the end of this document called Downloading the data for your site of interest to see how you can set up an eList object for any USGS streamgage.

There are two limitations that users should know about this application before proceeding any further.

  1. The code was designed for discharge records that are complete (no gaps). This is usually not a problem with USGS discharge records unless there is a major gap (typically years in length) when the streamgage was not operating. If records have a number of small gaps, users could use some established method for filling in missing data to create a gap-free record.

  2. The discharge on every day should be a positive value (not zero or negative). The EGRET code that is used here to read in new data has a “work around” for situations where there are a very small number of non-positive discharge values. It adds a small constant to all the discharge data so they will all be positive. This should have almost no impact on the results provided the number of non-positive days is very small, say less than 0.1% of all the days. That translates to about 11 days out of 30 years. For data sets with more zero or negative flow days different code would need to be written (we would appreciate it if an user could work on developing such a set of code).

To start, the following R commands are needed.

library(EGRET) eList <- Choptank_eList

Just to get some sense about the data we will look a portion of the metadata (gage ID number, name, and drainage area in square kilometers) and also see a summary of the discharge data (discharge in in cubic meters per second).

print(eList$INFO$site.no) ## [1] "01491000" print(eList$INFO$shortName) ## [1] "Choptank River" print(eList$INFO$drainSqKm) ## [1] 292.6687 print(summary(eList$Daily$Date)) ## Min. 1st Qu. Median Mean 3rd Qu. ## "1979-10-01" "1987-09-30" "1995-09-30" "1995-09-30" "2003-09-30" ## Max. ## "2011-09-30" print(summary(eList$Daily$Q)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00991 0.93446 2.40693 4.08658 4.61565 246.35656 Loading the necessary packages and other R code

To run the analysis and produce the graphs you will need a few R functions in addition to the EGRET package. You can copy the entire block of code shown below here and paste it into your workspace (all as a single copy and paste) or you can create an .R file from the code that you will source each time you want to use it. Let’s say you call it flowTrends.R, then each time you want to use it you need to “source” the object flowTrends.R.

In addition to EGRET the functions use the packages rkt, zyp, lubridate, and Kendall. Make sure you have installed all of them.

Show/Hide Code

library(rkt) library(zyp) library(lubridate) ######################################################################################### ########## this is the function you will use to make a single trend graph ############## ######################################################################################### plotFlowTrend <- function (eList, istat, startDate = NA, endDate = NA, paStart = 4, paLong = 12, window = 30, qMax = NA, printTitle = TRUE, tinyPlot = FALSE, customPar = FALSE, runoff = FALSE, qUnit = 2, printStaName = TRUE, printPA = TRUE, printIstat = TRUE, cex = 0.8, cex.axis = 1.1, cex.main = 1.1, lwd = 2, col = "black", ...){ localDaily <- getDaily(eList) localINFO <- getInfo(eList) localINFO$paStart <- paStart localINFO$paLong <- paLong localINFO$window <- window start <- as.Date(startDate) end <- as.Date(endDate) if(is.na(startDate)){ start <- as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end <- as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily <- subset(localDaily, Date >= start & Date <= end) eList <- as.egret(localINFO,localDaily) localAnnualSeries <- makeAnnualSeries(eList) qActual <- localAnnualSeries[2, istat, ] qSmooth <- localAnnualSeries[3, istat, ] years <- localAnnualSeries[1, istat, ] Q <- qActual time <- years LogQ <- log(Q) mktFrame <- data.frame(time,LogQ) mktFrame <- na.omit(mktFrame) mktOut <- rkt::rkt(mktFrame$time,mktFrame$LogQ) zypOut <- zyp::zyp.zhang(mktFrame$LogQ,mktFrame$time) slope <- mktOut$B slopePct <- 100 * (exp(slope)) - 100 slopePct <- format(slopePct,digits=2) pValue <- zypOut[6] pValue <- format(pValue,digits = 3) if (is.numeric(qUnit)) { qUnit <- qConst[shortCode = qUnit][[1]] } else if (is.character(qUnit)) { qUnit <- qConst[qUnit][[1]] } qFactor <- qUnit@qUnitFactor yLab <- qUnit@qUnitTiny if (runoff) { qActual <- qActual * 86.4/localINFO$drainSqKm qSmooth <- qSmooth * 86.4/localINFO$drainSqKm yLab <- "Runoff in mm/day" } else { qActual <- qActual * qFactor qSmooth <- qSmooth * qFactor } localSeries <- data.frame(years, qActual, qSmooth) yInfo <- generalAxis(x = qActual, maxVal = qMax, minVal = 0, tinyPlot = tinyPlot) xInfo <- generalAxis(x = localSeries$years, maxVal = decimal_date(end), minVal = decimal_date(start), padPercent = 0, tinyPlot = tinyPlot) line1 <- localINFO$shortName nameIstat <- c("minimum day", "7-day minimum", "30-day minimum", "median daily", "mean daily", "30-day maximum", "7-day maximum", "maximum day") line2 <- paste0("\n", setSeasonLabelByUser(paStartInput = paStart, paLongInput = paLong), " ", nameIstat[istat]) line3 <- paste0("\nSlope estimate is ",slopePct,"% per year, Mann-Kendall p-value is ",pValue) if(tinyPlot){ title <- paste(nameIstat[istat]) } else { title <- paste(line1, line2, line3) } if (!printTitle){ title <- "" } genericEGRETDotPlot(x = localSeries$years, y = localSeries$qActual, xlim = c(xInfo$bottom, xInfo$top), ylim = c(yInfo$bottom, yInfo$top), xlab = "", ylab = yLab, customPar = customPar, xTicks = xInfo$ticks, yTicks = yInfo$ticks, cex = cex, plotTitle = title, cex.axis = cex.axis, cex.main = cex.main, tinyPlot = tinyPlot, lwd = lwd, col = col, ...) lines(localSeries$years, localSeries$qSmooth, lwd = lwd, col = col) } ######################################################################################### ###### this the the function you will use to make the Quantile Kendall Plot ############# ######################################################################################### plotQuantileKendall <- function(eList, startDate = NA, endDate = NA, paStart = 4, paLong = 12, legendLocation = "topleft", legendSize = 1.0, yMax = NA, yMin = NA) { localDaily <- eList$Daily localINFO <- eList$INFO localINFO$paStart <- paStart localINFO$paLong <- paLong start <- as.Date(startDate) end <- as.Date(endDate) if(is.na(startDate)){ start <- as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end <- as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily <- subset(localDaily, Date >= start & Date <= end) eList <- as.egret(localINFO,localDaily) eList <- setPA(eList, paStart=paStart, paLong=paLong) v <- makeSortQ(eList) sortQ <- v[[1]] time <- v[[2]] results <- trendSortQ(sortQ, time) pvals <- c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999) zvals <- qnorm(pvals) name <- eList$INFO$shortName # ymax <- trunc(max(results$slopePct)*10) # ymax <- max(ymax + 2, 5) # ymin <- floor(min(results$slopePct)*10) # ymin <- min(ymin - 2, -5) # yrange <- c(ymin/10, ymax/10) # yticks <- axisTicks(yrange, log = FALSE) ymax <- max(results$slopePct + 0.5, yMax, na.rm = TRUE) ymin <- min(results$slopePct - 0.5, yMin, na.rm = TRUE) yrange <- c(ymin, ymax) yticks <- axisTicks(yrange, log = FALSE, nint =7) p <- results$pValueAdj color <- ifelse(p <= 0.1,"black","snow3") color <- ifelse(p < 0.05, "red", color) pvals <- c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999) zvals <- qnorm(pvals) name <- paste0("\n", eList$INFO$shortName,"\n", start," through ", end, "\n", setSeasonLabelByUser(paStartInput = paStart, paLongInput = paLong)) plot(results$z,results$slopePct,col = color, pch = 20, cex = 1.0, xlab = "Daily non-exceedance probability", ylab = "Trend slope in percent per year", xlim = c(-3.2, 3.2), ylim = yrange, yaxs = "i", las = 1, tck = 0.02, cex.lab = 1.2, cex.axis = 1.2, axes = FALSE, frame.plot=TRUE) mtext(name, side =3, line = 0.2, cex = 1.2) axis(1,at=zvals,labels=pvals, las = 1, tck = 0.02) axis(2,at=yticks,labels = TRUE, las = 1, tck = 0.02) axis(3,at=zvals,labels=FALSE, las = 1, tck=0.02) axis(4,at=yticks,labels = FALSE, tick = TRUE, tck = 0.02) abline(h=0,col="blue") legend(legendLocation,c("> 0.1","0.05 - 0.1","< 0.05"),col = c("snow3", "black","red"),pch = 20, title = "p-value", pt.cex=1.0, cex = legendSize * 1.5) } ######################################################################################### ############ This next function combines four individual trend graphs (for mimimum day, ########### median day, mean day, and maximum day) along with the quantile kendall graph ######################################################################################### plotFiveTrendGraphs <- function(eList, startDate = NA, endDate = NA, paStart = 4, paLong = 12, qUnit = 2, window = 30, legendLocation = "topleft", legendSize = 1.0) { localDaily <- eList$Daily localINFO <- eList$INFO localINFO$paStart <- paStart localINFO$paLong <- paLong localINFO$window <- window start <- as.Date(startDate) end <- as.Date(endDate) if(is.na(startDate)){ start <- as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end <- as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily <- subset(localDaily, Date >= start & Date <= end) eList <- as.egret(localINFO,localDaily) eList <- setPA(eList, paStart=paStart, paLong=paLong, window=window) # this next line of code is inserted so that when paLong = 12, we always use the # climate year when looking at the trends in the annual minimum flow paStart1 <- if(paLong == 12) 4 else paStart plotFlowTrend(eList, istat = 1, qUnit = qUnit, paStart = paStart1, paLong = paLong, window = window) plotFlowTrend(eList, istat = 4, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) plotFlowTrend(eList, istat = 8, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) plotFlowTrend(eList, istat = 5, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) # now the quantile kendall plotQuantileKendall(eList, startDate = startDate, endDate = endDate, paStart = paStart, paLong = paLong, legendLocation = legendLocation, legendSize = legendSize) } ######################################################################################### ########### makeSortQ creates a matrix called Qsort. ############It sorted from smallest to largest over dimDays ############(if working with full year dimDays=365), #############and also creates other vectors that contain information about this array. ######################################################################################### makeSortQ <- function(eList){ localINFO <- getInfo(eList) localDaily <- getDaily(eList) paStart <- localINFO$paStart paLong <- localINFO$paLong # determine the maximum number of days to put in the array numDays <- length(localDaily$DecYear) monthSeqFirst <- localDaily$MonthSeq[1] monthSeqLast <- localDaily$MonthSeq[numDays] # creating a data frame (called startEndSeq) of the MonthSeq values that go into each year Starts <- seq(paStart, monthSeqLast, 12) Ends <- Starts + paLong - 1 startEndSeq <- data.frame(Starts, Ends) # trim this list of Starts and Ends to fit the period of record startEndSeq <- subset(startEndSeq, Ends >= monthSeqFirst & Starts <= monthSeqLast) numYearsRaw <- length(startEndSeq$Ends) # set up some vectors to keep track of years good <- rep(0, numYearsRaw) numDays <- rep(0, numYearsRaw) midDecYear <- rep(0, numYearsRaw) Qraw <- matrix(nrow = 366, ncol = numYearsRaw) for(i in 1: numYearsRaw) { startSeq <- startEndSeq$Starts[i] endSeq <- startEndSeq$Ends[i] startJulian <- getFirstJulian(startSeq) # startJulian is the first julian day of the first month in the year being processed # endJulian is the first julian day of the month right after the last month in the year being processed endJulian <- getFirstJulian(endSeq + 1) fullDuration <- endJulian - startJulian yearDaily <- localDaily[localDaily$MonthSeq >= startSeq & (localDaily$MonthSeq <= endSeq), ] nDays <- length(yearDaily$Q) if(nDays == fullDuration) { good[i] <- 1 numDays[i] <- nDays midDecYear[i] <- (yearDaily$DecYear[1] + yearDaily$DecYear[nDays]) / 2 Qraw[1:nDays,i] <- yearDaily$Q } else { numDays[i] <- NA midDecYear[i] <- NA } } # now we compress the matrix down to equal number of values in each column j <- 0 numGoodYears <- sum(good) dayCounts <- ifelse(good==1, numDays, NA) lowDays <- min(dayCounts, na.rm = TRUE) highDays <- max(dayCounts, na.rm = TRUE) dimYears <- numGoodYears dimDays <- lowDays sortQ <- matrix(nrow = dimDays, ncol = dimYears) time <- rep(0,dimYears) for (i in 1:numYearsRaw){ if(good[i]==1) { j <- j + 1 numD <- numDays[i] x <- sort(Qraw[1:numD, i]) # separate odd numbers from even numbers of days if(numD == lowDays) { sortQ[1:dimDays,j] <- x } else { sortQ[1:dimDays,j] <- if(odd(numD)) leapOdd(x) else leapEven(x) } time[j] <- midDecYear[i] } } sortQList = list(sortQ,time) return(sortQList) } ######################################################################################### ########## Another function trendSortQ needed for Quantile Kendall ######################################################################################### trendSortQ <- function(Qsort, time){ # note requires packages zyp and rkt nFreq <- dim(Qsort)[1] nYears <- length(time) results <- as.data.frame(matrix(ncol=9,nrow=nFreq)) colnames(results) <- c("slopeLog","slopePct","pValue","pValueAdj","tau","rho1","rho2","freq","z") for(iRank in 1:nFreq){ mkOut <- rkt::rkt(time,log(Qsort[iRank,])) results$slopeLog[iRank] <- mkOut$B results$slopePct[iRank] <- 100 * (exp(mkOut$B) - 1) results$pValue[iRank] <- mkOut$sl outZYP <- zyp.zhang(log(Qsort[iRank,]),time) results$pValueAdj[iRank] <- outZYP[6] results$tau[iRank] <- mkOut$tau # I don't actually use this information in the current outputs, but the code is there # if one wanted to look at the serial correlation structure of the flow series serial <- acf(log(Qsort[iRank,]), lag.max = 2, plot = FALSE) results$rho1[iRank] <- serial$acf[2] results$rho2[iRank] <- serial$acf[3] frequency <- iRank / (nFreq + 1) results$freq[iRank] <- frequency results$z[iRank] <- qnorm(frequency) } return(results) } ######################################################################################### ################################## getFirstJulian finds the julian date of first day ################################## of a given month ######################################################################################### getFirstJulian <- function(monthSeq){ year <- 1850 + trunc((monthSeq - 1) / 12) month <- monthSeq - 12 * (trunc((monthSeq-1)/12)) charMonth <- ifelse(month<10, paste0("0",as.character(month)), as.character(month)) theDate <- paste0(year,"-",charMonth,"-01") Julian1 <- as.numeric(julian(as.Date(theDate),origin=as.Date("1850-01-01"))) return(Julian1) } ######################################################################################### ########### leapOdd is a function for deleting one value ############when the period that contains Februaries has a length that is an odd number ######################################################################################### leapOdd <- function(x){ n <- length(x) m <- n - 1 mid <- (n + 1) / 2 mid1 <- mid + 1 midMinus <- mid - 1 y <- rep(NA, m) y[1:midMinus] <- x[1:midMinus] y[mid:m] <- x[mid1:n] return(y)} ######################################################################################### ########### leapEven is a function for deleting one value ############when the period that contains Februaries has a length that is an even number ######################################################################################### leapEven <- function(x){ n <- length(x) m <- n - 1 mid <- n / 2 y <- rep(NA, m) mid1 <- mid + 1 mid2 <- mid + 2 midMinus <- mid - 1 y[1:midMinus] <- x[1:midMinus] y[mid] <- (x[mid] + x[mid1]) / 2 y[mid1:m] <- x[mid2 : n] return(y) } ######################################################################################### ####### determines if the length of a vector is an odd number ########################### ######################################################################################### odd <- function(x) {(!(x %% 2) == 0)} ######################################################################################### ########### calcWY calculates the water year and inserts it into a data frame ######################################################################################### calcWY <- function (df) { df$WaterYear <- as.integer(df$DecYear) df$WaterYear[df$Month >= 10] <- df$WaterYear[df$Month >= 10] + 1 return(df) } ######################################################################################### ##### calcCY calculates the climate year and inserts it into a data frame ######################################################################################### calcCY <- function (df){ df$ClimateYear <- as.integer(df$DecYear) df$ClimateYear[df$Month >= 4] <- df$ClimateYear[df$Month >= 4] + 1 return(df) } ######################################################################################### ######## smoother is a function does the trend in real discharge units and not logs. ######## It is placed here so that users wanting to run this alternative have it available ######## but it is not actually called by any function in this document ######################################################################################### smoother <- function(xy, window){ edgeAdjust <- TRUE x <- xy$x y <- xy$y n <- length(y) z <- rep(0,n) x1 <- x[1] xn <- x[n] for (i in 1:n) { xi <- x[i] distToEdge <- min((xi - x1), (xn - xi)) close <- (distToEdge < window) thisWindow <- if (edgeAdjust & close) (2 * window) - distToEdge else window w <- triCube(x - xi, thisWindow) mod <- lm(xy$y ~ x, weights = w) new <- data.frame(x = x[i]) z[i] <- predict(mod, new) } return(z) } Making a graph of the trend in a single flow statistic

Now all we need to do is to run the plotFlowTrend function that was the first part of the code we just read in. First we will run it in its simplest form, we will use the entire discharge record and our period of analysis will be the full climatic year. A climatic year is the year that starts on April 1 and ends on March 31. It is our default approach because it tends to avoid breaking a long-low flow period into two segments, one in each of two adjacent years. We will first run it for the annual minimum daily discharge.

To use it you will need to know about the index of flow statistics used in EGRET. They are called istat and the statistics represented by istat are: (1) 1-day minimum, (2) 7-day minimum, (3) 30-day minimum, (4) median (5) mean, (6) 30-day maximum, (7) 7-day maximum, and (8) 1-day maximum. So to run the annual minimum daily discharge statistic we would set istat = 1. To run the plotFlowTrend function it in its simplest form the only arguments we need are the eList (which contains the metadata and the discharge data), and istat.

plotFlowTrend(eList, istat = 1)

Explanation of the FlowTrend Plot, annual minimum day

The dots indicate the discharge on the minimum day of each climate year in the period of record. The solid curve is a smoothed representation of those data. It is specifically the smoother that is defined in the EGRET user guide (pages 16-18) with a 30-year window. For record as short as this one (only 32 years) it will typically look like a straight line or a very smooth curve. For longer records it can display some substantial changes in slope and even be non-monotonic. At the top of the graph we see two pieces of information. A trend slope expressed in percent per year and a p-value for the Mann-Kendall trend test of the data. The slope is computed using the Thiel-Sen slope estimator. It is discussed on pages 266-274 of Helsel and Hirsch, 2002, which can be found here although it is called the “Kendall-Theil Robust Line” in that text. It is calculated on the logarithms of the discharge data and then transformed to express the trend in percent per year. The p-value for the Mann-Kendall test is computed using the adjustment for serial correlation introduced in the zyp R package (David Bronaugh and Arelia Werner for the Pacific Climate Impacts Consortium (2013). zyp: Zhang + Yue-Pilon trends package. R package version 0.10-1. https://CRAN.R-project.org/package=zyp).

A few more FlowTrend Plots

We can do some other similar plots. In this case we will do the annual median day istat = 4, the annual maximum day istat = 8, and the annual mean day istat = 5.

The annual median day

The median day is computed for each year in the record.It is the middle day, that is 182 values had discharges lower than it and 182 values had discharges greater than it (for a leap year it is the average of the 183rd and 184th ranked values). Everything else about it is the same as the first plot. Here is the annual median day discharge record.

plotFlowTrend(eList, istat = 4)

The annual maximum day

The third plot is for the maximum day of the year. Otherwise it is exactly the same in its construction as the first two plots. Note that this is the maximum daily average discharge and in general it will be smaller than the annual peak discharge, which is the maximum instantaneous discharge for the year, whereas this is the highest daily average discharge. For very large rivers the annual maximum day discharge tends to be very close to the annual peak discharge and can serve as a rough surrogate for it in a trend study. For a small stream, where discharges may rise or fall by a factor of 2 or more in the course of a day, these maximimum day values can be very different from the annual peak discharge.

At this point we will add one more concept, which is the period of analysis. In the first two plots the data were organized by climate year. But, when we look at annual maximum or annual mean discharge the standard approach is to use the water year. To do this we need to designate that we want our period of analysis to start with the month of October (month 10). To do this we need to add one more argument to our call to the function. paStart = 10. We did not need to specify the paStart in the first two plots, because the default is paStart = 4. Notice that this change in the period of analysis is noted in the text on the graphic.

plotFlowTrend(eList, istat = 8, paStart = 10)

The fourth plot is the annual mean discharge. It is constructed in exactly the same manner as the previous three, but it represents the mean streamflow for all of the days in the year rather than a single order statistic such as minimum, median, or maximum. We will also use the water year for this analysis.

plotFlowTrend(eList, istat = 5, paStart = 10)

Some observations about these figures

It is interesting to note that the minimum discharge values have been trending downwards, although not a statistically significant trend. The other three discharge statistics are all trending upwards and this is most notable in terms of the annual maximum daily discharge, both in terms of slope +4.7% per year, and a p-value that is well below 0.001. This raises many questions about how the variability of discharge might be changing over time and what the drivers are of the trends that are happening in various parts of the probability distribution of discharge (which we commonly call the flow duration curve). We will turn our focus to a way of summarizing these changes after we look at a few more variations on this plotFlowTrend graphic.

Variations on the simplest example

The call to the function plotFlowTrend has many arguments (you can see them all in the code shown above) but here we will just focus on a few that may be most useful to the data analyst:

plotFlowTrend(eList, istat, startDate = NA, endDate = NA, paStart = 4, paLong = 12, window = 30, qMax = NA, qUnit = 2, runoff = FALSE)

Here is a list of the optional arguments that are available to the user.

startDate If we want to evaluate the trend for a shorter period than what is contained in the entire Daily data frame, then we can specify a different starting date. For example, if we wanted the analysis to start with Water Year 1981, then we would say: startDate = “1980-10-01”. By leaving out the startDate argment we are requesting the analysis to start where the data starts (which in this case is 1979-10-01).

endDate If we want to evalute the trend for a period that ends before the end of the data set we can specify that with endDate. So if we wanted to end with Water Year 2009 we would say: endDate = “2009-09-30”.

paStart is the starting month of the period of analysis. For example if we were interested in the trends in streamflow in a series of months starting with August, we would say paStart = 8. The default is paStart = 4, which starts in April.

paLong is the duration of the period of analysis. So, an analysis that covers all 12 months would have paLong = 12 (the default). A period that runs for the months of August – November would have paStart = 8 and paLong = 4. See pages 14 and 15 of the user guide for more detail on paStart and paLong.

window The smoothness of the curve plotted depends on the size of the window (which is measured in years). If window were much smaller (say 10) then the curve shown would be rather jagged, and as it becomes longer it causes the curve to converge to a straight line. The default value of 30 works well and is also roughly consistant with the convention used in the climate science community, when they discuss a 30-year normal period. The precise meaning of window is described in the User Guide (pages 16 – 18). It has no effect on the reported trend slope or p-value.

qMax is an argument that allows the user to set the maximum value on the vertical axis. If it is left out, the graphic scales itself. But for a set of similar graphics it may be useful to set qMax.

qUnit is an argument that allows the user to pick different discharge units. The default is qUnit = 2 which is cubic meters per second. The other most common unit to use is qUnit = 1 which is cubic feet per second.

runoff is a logical argument. The default is runoff = FALSE. In some studies it is nice to express discharge in runoff terms (discharge per unit area) amd setting runoff = TRUE can cause it to be expressed in terms of runoff.

Here is an example that shows the use of several of these arguments. Say we want to see what would happen if we only looked at annual maximum daily discharge from 1983 – 2010 (trimming off three low years at the start and one high year at the end), and used a shorter smoothing window, and expressed our results as runoff in mm/day

plotFlowTrend(eList, istat = 8, startDate = "1982-10-01", endDate = "2010-09-30", window = 15, runoff = TRUE)

It is worth noting here that compared to the previous plot of the annual maximum daily discharges, the removal of four particular years brought about a substantial change in the slope and the significance of the trend. This is a common issue in trend analysis with relatively short records.

The Quantile-Kendall Plot

Now we will look at a new kind of plot called the Quantile-Kendall plot. Each plotted point on the plot is a trend slope (computed in the same manner as in the previous plots) for a given order statistic. The point at the far left edge is the first order statistic, which is the annual minimum daily discharge. This is the result described on the first plot. The next point to its right is the trend slope for the second order statistic (second lowest daily discharge for each year), and it continues to the far right being the trend slope for the 365th order statistic (annual maximum daily discharge). Their placement with respect to the x-axis is based on the z-statistic (standard normal deviate) associated with that order statistic. It is called the daily non-exceedance probability. It is a scale used for convenience. It in no-way assumes that the data follow any particular distribution. It is simply used to provide the greatest resolution near the tails of the daily discharge distribution. The color represents the p value for the Mann-Kendall test for trend as described above. Red indicates a trend that is significant at alpha = 0.05. Black indicates an attained significance between 0.05 and 0.1. The grey dots are trends that are not significant at the alpha level of 0.1.

Here is what it looks like for the Choptank data set. What we see is that generally across the probability distribution the changes are generally not statistically significant except for the highest and second highest days of the year. The trends are slightly negative at the low end of the distribution (the lowest 10%). Near the median they are positive and modestly large (around 1.5% per year). Then near the 90th percentile they drop to near zero slope and only above about the 99th percentile do they seem to be particularly large.

plotQuantileKendall(eList)

There is one special manipulation of the data that is needed to account for leap years (this is a detail about the computation but is not crucial to understanding the plots). The 366 daily values observed in a leap year are reduced by one so that all years have 365 values. The one value eliminated is accomplished by replacing the two middle values in the ranked list of values by a single value which is the average of the two. A similar approach is used when the period of analysis is any set of months that contains the month of February. The number of leap year values are reduced by one and the reduction takes place at the median value for the year.

Variations on the Quantile Kendall

We might be interested in particularly looking at trends in a certain season of the year. In the case of the Choptank we know that tropical storm generated floods have been particularly significant in the later part of this record and these have been mostly in August, September and October. Or, if our interest were related to the flows happening in the spring and early summer, which is the period of time during which the delivery to the Chesapeake Bay (into which the Choptank River flows) then we might want to run our analysis on the period March through June. To do that we can use paStart = 3 and paLong = 4.

plotQuantileKendall(eList, paStart = 3, paLong = 4)

What we can see is that for this portion of the year, the discharge trends are rather minimal and certainly not statistically significant, except that the 5 or so highest flow days in the season do show rather large trends and the annual maximum trend is the largest of all (in percent per year) and is the only one that is significcant.

There are a few other arguments for the plotQuantileKendall function. As mentioned in the discussion of the plotFlowTrend function, it has the startDate and endDate arguments. If they aren’t specified then it is assumed that the analysis covers the whole period in the Daily data frame.

There are two arguments that relate to the appearance of the graph.

legendLocation this argument simply determines where in the graph the legend is placed. It sometimes happens that the legend obscures the data so we may need to move it to another part of the graph. The argument can take on names such as “bottomleft”, “topright” or “bottomright”. The default is “topleft”. If you don’t want a legend (say you are making many Quantile Kendall plots in a single job and can’t go in and adjust each one), then you can specify legendLocation = NA.

legendSize this argument determines the size of the legend. The default is legendSize = 1.0. If you want to decrease the dimensions of the legend you decrease this value. For example a 25% reduction would be achieved by setting legendSize = 0.75. Typically one wouldn’t want to go below about 0.4.

yMax and yMin are two more optional arguments to set up the limits on the y-axis. If you don’t specify them (just leave them out) then the data will dictate the minimum and maximum values on the y-axis (it will extend slightly beyond the highest and lowest values). But, if you are doing many graphs you may want them to have consistent axes, so a quick look at the graph will tell you if the trends are large or small, then you can set these two values. For example, say you want every graph to have a maximum value of 5 and a minimum value of -5, then the arguments would just be: yMax = 5, yMin = 5. There is a “fail safe” mechanism in the code if the trends are bigger than the specified values. Let’s say we set yMax = 5 but in one particular graph there is a trend slope of 7.2 % per year. In this case the y-axis would extend out slightly beyond a value of 7.2. Thus, no value is ever out of range. All of the slopes will be plotted.

Here is an example bringing all of the arguments into play. It is for a case where we want to consider the time period 1982-08-01 through 2010-11-30, for only the months of August through October, and we want the legend to go in the bottom right and be only half the size of what we saw in the first example.

plotQuantileKendall(eList, startDate = "1982-08-01", endDate = "2010-11-30", paStart = 8, paLong = 3, legendLocation = "bottomright", legendSize = 0.5, yMax = 4, yMin = -4)

What this graph shows us is that unlike the year as a whole, this late summer to fall period has been one of very substantially increasing discharge. From about the median of the distribution to the top, the trends are greater than 3% per year and many are significant at least at the alpha level of 0.1. There is one feature of this figure that may look odd. That is the fact that at the lower end of the distribution there are many trend slopes that are exactly zero. Using the gray dot at the far left as an example, the trend slope estimation method uses all pairwise comparisons of the lowest discharge of each year. Because of the USGS reporting conventions, many of these discharge values cluster around a few specific values, so that in the comparisons, there are many ties. The slope estimate is the median of all the pairwise slopes and since a significant number of these slopes are zero, the slope estimate is zero.

Putting it all together into a set of five graphs

This post also provides an additional function that produces a total of five graphics from a single command. It does the four trend plots and then the Quantile-Kendall. It is called plotFiveTrendGraphs. The arguments are all ones that we have seen before: eList, startDate, endDate, paStart, paLong, qUnit, window, legendLocation, and legendSize. In its simplest form it would look like this (output not plotted here).

plotFiveTrendGraphs(eList) Downloading the data for your site of interest

The steps for downloading data from USGS web services (or obtaining it from user supplied information) and also for creating and saving the eList are described on pages 4-13 of the EGRET user guide. Here is a simple example:

Say we want to look at USGS station number 01646500 (Sugar River near Brodhead, WI) and we want to consider data from Climate Years 1921 through 2016. Note that this much longer period of record causes the computations to take a good deal more time than the previous cases, but it should still take well under a minute to complete the whole job. The following commands would do what is needed.

library(EGRET) sta <- "05436500" param <- "00060" # this is the parameter code for daily discharge startDate <- "1920-04-01" endDate <- "2016-03-31" INFO <- readNWISInfo(siteNumber = sta, parameterCd = param, interactive = FALSE) Daily <- readNWISDaily(siteNumber = sta, parameterCd = param, startDate = startDate, endDate = endDate, verbose = FALSE) eList <- as.egret(INFO, Daily) plotFiveTrendGraphs(eList, legendLocation = "bottomleft")

Final thoughts

This last set of plots is particularly interesting. What we see is that for the lowest 90 percent of the distribution, discharges have been rising over most of this record, and particularly so since about 1950. But in the highest 1% of the distribution discharges have been falling thoughout the record. As a consequence the overall variablity of streamflow in this agricultural watershed has generally been declining over time. This is generally thought to be related to the increasing use of conservation practices in the watershed.

It is worth noting that we can express the percentage changes per year in other ways than as percent per year, for example percentage change per decade or percentage change over the entire period of record. Take, for example, the estimated trend slope for the median in this last example. It is 0.68% per year. If we were to express that as percentage change per decade it would be 7% per decade (the change can be computed as 1.0068^10, or 1.070, which is a 7% increase). If we expressed it for the entire 96 year record it would be about a 92% increase over that period (computed as 1.0068^96, which is 1.9167 or and increase of about 92%).

These graphical tools are one way of summarizing what we might call a signature of change for any given watershed. It shows the changes, or lack of changes, that have taken place through all parts of the probability distribution of streamflow. This has potential as a tool for evaluating hydrologic or linked climate and hydrologic models. Observing the patterns on these graphs for the actual data versus the patterns seen when simulated data are used, can provide insights on the ability of the models used to project hydrologic changes into the future.

If you have questions or comments on the concepts or the implementation please contact the author rhirsch@usgs.gov.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Native scoring in SQL Server 2017 using R

Mon, 05/28/2018 - 22:13

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Native scoring is a much overlooked feature in SQL Server 2017 (available only under Windows and only on-prem), that provides scoring and predicting in pre-build and stored machine learning models in near real-time.

                                                                                             Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY

 

Depending on the definition of real-time, and what does it mean for your line of business, I will not go into the definition of real-time, but for sure, we can say scoring 10.000 rows in a second from a mediocre client computer (similar to mine) .

Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Limitations are:

  • currently supports only SQL server 2017 and Windows platform
  • trained model should not exceed 100 MiB in size
  • Native scoring with PREDICT function supports only following algorithms from RevoScaleR library:
    • rxLinMod (linear model as linear regression)
    • rxLogit (logistic regression)
    • rxBTrees (Parallel external memory algorithm for Stochastic Gradient Boosted Decision Trees)
    • rxDtree (External memory algorithm for Classification and Regression Trees
    • rxDForest (External memory algorithm for Classification and Regression Decision Trees)

Benefits of using PREDICT function for native scoring are:

  • No configuration of R or ML environment is needed (assuming that the trained models are already stored in the database),
  • Code is cleaner, more readable, and no additional R code is needed when performing scoring,
  • No R engine is called in the run-time, so tremendous deduction of  CPU and I/O costs as well as, no external calls,
  • Client or server running Native scoring with PREDICT function does not need R engine installed, because it uses C++ libraries from Microsoft, that can read serialized model stored in a table and un-serialize it and generate predictions, all without the need of R

Overall, if you are looking for a faster predictions in your enterprise and would love to have a faster code and solution deployment, especially integration with other applications or building API in your ecosystem, native scoring with PREDICT function will surely be advantage to you. Although not all of the predictions/scores are supported, majority of predictions can be done using regression models or decision trees models (it is estimated that both type (with derivatives of regression models and ensemble methods) of algorithms are used in 85% of the predictive analytics).

To put the PREDICT function to the test, I have deliberately taken the semi-larger dataset, available in RevoScaleR package in R – AirlineDemoSmall.csv. Using a simple BULK INSERT, we get the data into the database:

BULK INSERT ArrivalDelay FROM 'C:\Program Files\Microsoft SQL Server\140\R_SERVER\library\RevoScaleR\SampleData\AirlineDemoSmall.csv' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR = '0x0a', FIRSTROW = 2, CODEPAGE = 'RAW');

Once data is in the database, I will split the data into training and test sub-sets.

SELECT TOP 20000 * INTO ArrDelay_Train FROM ArrDelay ORDER BY NEWID() -- (20000 rows affected) SELECT * INTO ArrDelay_Test FROM ArrDelay AS AR WHERE NOT EXISTS (SELECT * FROM ArrDelay_Train as ATR WHERE ATR.arrDelay = AR.arrDelay AND ATR.[DayOfWeek] = AR.[DayOfWeek] AND ATR.CRSDepTime = AR.CRSDepTime ) -- (473567 rows affected

And the outlook of the dataset is relatively simple:

ArrDelay CRSDepTime DayOfWeek 1 9,383332 3 4 18,983334 4 0 13,883333 4 65 21,499998 7 -3 6,416667 1 Creating models

So we will create essentially two same models using rxLinMod function with same formula, but one with additional parameter for real-time scoring set to TRUE.

-- regular model creation DECLARE @model VARBINARY(MAX); EXECUTE sp_execute_external_script @language = N'R' ,@script = N' arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = InputDataSet) model <- rxSerializeModel(arrDelay.LM)' ,@input_data_1 = N'SELECT * FROM ArrDelay_Train' ,@params = N'@model varbinary(max) OUTPUT' ,@model = @model OUTPUT INSERT [dbo].arrModels([model_name], [native_model]) VALUES('arrDelay.LM.V1', @model) ; -- Model for Native scoring DECLARE @model VARBINARY(MAX); EXECUTE sp_execute_external_script @language = N'R' ,@script = N' arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = InputDataSet) model <- rxSerializeModel(arrDelay.LM, realtimeScoringOnly = TRUE)' ,@input_data_1 = N'SELECT * FROM ArrDelay_Train' ,@params = N'@model varbinary(max) OUTPUT' ,@model = @model OUTPUT INSERT [dbo].arrModels([model_name], [native_model]) VALUES('arrDelay.LM.NativeScoring.V1', @model) ;

Both models will have same training set, and will be stored into a table for future scoring. Upon first inspection, we can see there is a difference in the model size:

Scoring Models

Both models  took relatively the same amount of time to train and to store in the table. Both can also be created on R Machine Learning server and stored in the same way (with or without argument realtimeScoringOnly). The model size gives you an idea, why and how the realtime scoring can be achieved -> is to keep your model as small as possible. Both models will give you exact same predictions scores, just that the native scoring will be much faster. Note also, if you are planning to do any text analysis with real-time scoring, keep in mind the 100 MiB limitation, as the text prediction models often exceed this limitation.

Comparing the execution of scoring models, I will compare using “traditional way” of using external procedure sp_execute_external_script and using PREDICT function.

------------------------------------ -- Using sp_execute_external_script ------------------------------------ DECLARE @model VARBINARY(MAX) = (SELECT native_model FROM arrModels WHERE model_name = 'arrDelay.LM.V1') EXEC sp_execute_external_script @language = N'R' ,@script = N' modelLM <- rxUnserializeModel(model) OutputDataSet <- rxPredict( model=modelLM, data = ArrDelay_Test, type = "link", predVarNames = "ArrDelay_Pred", extraVarsToWrite = c("ArrDelay","CRSDepTime","DayOfWeek") )' ,@input_data_1 = N'SELECT * FROM dbo.ArrDelay_Test' ,@input_data_1_name = N'ArrDelay_Test' ,@params = N'@model VARBINARY(MAX)' ,@model = @model WITH RESULT SETS (( AddDelay_Pred FLOAT ,ArrDelay INT ,CRSDepTime NUMERIC(16,5) ,[DayOfWeek] INT )) -- (473567 rows affected) -- Duration 00:00:08 --------------------------- -- Using Real Time Scoring --------------------------- DECLARE @model varbinary(max) = ( SELECT native_model FROM arrModels WHERE model_name = 'arrDelay.LM.NativeScoring.V1'); SELECT NewData.* ,p.* FROM PREDICT(MODEL = @model, DATA = dbo.ArrDelay_Test as newData) WITH(ArrDelay_Pred FLOAT) as p; GO -- (473567 rows affected) -- Duration 00:00:04

Both examples are different from each other, but PREDICT function looks much more readable and neater. Time performance is also on the PREDICT function side, as the model returns the predictions much faster.

In addition, I have mentioned that PREDICT function does not need R engine or Launchpad Service to be running in the same environment, where the code will be executed. To put this to test, I will simply stop the SQL Server Launchpad Service:

After executing the first set of predictions using sp_execute_external_script, SQL Server or Machine Learning Server will notify you that the service is not running:

whereas, the PREDICT function will work flawlessly.

Verdict

For sure, faster predictions are the something that can be very welcoming in gaming industry, in transport, utility and metal industry, financial as well as any other types, where real-time predictions against OLTP systems will be much appreciated. With the light-weight models and good algorithm support, I would for sure give it an additional thought, especially, if you see a good potential in faster and near real-time predictions.

As always, complete code and data sample are available at Github. Happy coding!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Take Screenshot of Webpage using R

Mon, 05/28/2018 - 21:12

(This article was first published on ListenData, and kindly contributed to R-bloggers)

Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.

Take Screenshot from R
What is PhantomJS?

PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

PhantomJS is an optimal solution for the following:

  • Headless website testing
  • Screen Capture
  • Page Automation
  • Network Monitoring

Webshot : R Package 

The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).

Install and Load Package

The stable version of webshot is available on CRAN hence can be installed using the below code:

install.packages(‘webshot’)
library(‘webshot’)

Also, the latest development version of webshot is hosted on github and can be installed using the below code:

#install.packages(‘devtools’)
devtools::install_github(‘wch/webshot’)

Initial Setup

As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.

webshot::install_phantomjs()

The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.

Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.

Screenshot Function

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like ‘.jpg’, ‘.png’, ‘.pdf’ based on which the output file is rendered. Below is the basic structure of how the function goes:

library(webshot)

#webshot(url, filename.extension)
webshot(“https://www.listendata.com/”, “listendata.png”)

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases – starting with downloading/converting a webpage as a PDFcopy.

Case #1: PDF Copy of WebPage

Let us assume, we would like to download Bill Gates’ notes on Best Books of 2017 as a PDF copy.

#loading the required library
 library(webshot)

#PDF copy of a web page / article
 webshot(“https://www.gatesnotes.com/About-Bill-Gates/Best-Books-2017”,
 “billgates_book.pdf”,
 delay = 2)

The above code generates a PDF whose (partial) screenshot is below:

Snapshot of PDF Copy

Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.

  1. URL from which the screenshot has to be taken. 
  2. Output Filename along with its file extensions. 
  3. Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.

Thus, a webpage can be converted/downloaded as a PDF programmatically in R.

Case #2: Webpage Screenshot (Viewport Size)

Now, I’d like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.

#Screenshot of Viewport
webshot(‘https://www.livemint.com/’,’livemint.png’, cliprect = ‘viewport’)

While the first two arguments are similar to the above function, there’s a new third argument cliprect which specifies the size of the Clipping rectangle.

If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value viewportwhich clips only the viewport part of the browser, as below.

Screenshot of Viewport of Browser

Case #3: Multiple Selector Based Screenshots

All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.

#Multiple Selector Based Screenshots
webshot(“https://github.com/hadley”,
 file = c(“organizations.png”,”contributions.png”),
 selector = list(“div.border-top.py-3.clearfix”,”div.js-contribution-graph”))

In the above code, we take screenshot of two CSS Selectors from the github profile page of  Hadley Wickham and save them in two PNG files – organizations.png and contributions.png.

Contributions.png

Organizations.png

Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.

References

About Author:

Deepanshu founded ListenData with a simple objective – Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.

Let's Get Connected: LinkedIn

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ListenData. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ShinyProxy 1.1.1 released!

Mon, 05/28/2018 - 15:00

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations.

Theming

ShinyProxy 1.1.1 is in essence a maintenance release, but there is one new feature that has been on the wish list of our users for a long time: the possibility of theming the landing page of ShinyProxy which displays the overview of the Shiny apps.

The standard display when using the ShinyProxy demo image from the Getting Started guide is a plain listing:

With an HTML-template makeover, it can look like this:

or maybe you prefer to have the apps displayed in two columns?

As one can see, it is possible to use images or icons for the apps and the display and text can be entirely
customized with CSS. The templating technology used for the customization of the landing page is Thymeleaf, a mature and amply documented framework for natural HTML templates.

To help you get started customizing your own ShinyProxy instance, a fully worked theming example is availeble in our shinyproxy-configuration-examples repository on Github.

Miscellaneous fixes

In our previous release we made a major leap to run Shiny apps at datacenter scale by adding support for a Kubernetes back-end (see this blog post). We have used it for customers that roll out internet-facing Shiny apps with high numbers of concurrent users and needs for automated deployment. Following community feedback, this release has an important fix related to deletion of Kubernetes services when shutting down containers.

For the users that embed ShinyProxy in larger application frameworks, we released a fix for passing arguments to Shiny apps.
Some other small fixes include improved browser display of the apps and a more strict adherence to YAML for the ShinyProxy configuration file.

For the new features, documentation has been updated on https://shinyproxy.io and as always community support on this new release is available at

https://support.openanalytics.eu

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Two R Packages to Check Out

Sun, 05/27/2018 - 16:27

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

I’m currently out of town and not spending as much time on my computer as I have over the last couple months. (It’s what happens when you’re the only one in your department at work and also most of your hobbies involve a computer.) But I wanted to write up something for Statistics Sunday and I recently discovered two R packages I need to check out in the near future.

The first is called echor, which allows you to search and download data directly from the US Environmental Protection Agency (EPA) Environmental Compliance and History Online (ECHO), using the ECHO-API. According to the vignette, linked above, “ECHO provides data for:

  • Stationary sources permitted under the Clean Air Act, including data from the National Emissions Inventory, Greenhouse Gas Reporting Program, Toxics Release Inventory, and Clean Air Markets Division Acid Rain Program and Clean Air Interstate Rule. 
  • Public drinking water systems permitted under the Safe Drinking Water Act, including data from the Safe Drinking Water Information System. 
  • Hazardous Waste Handlers permitted under the Resource Conservation and Recovery Act, with data drawn from the RCRAInfo data system. 
  • Facilities permitted under the Clean Water Act and the National Pollutant Discharge Elimination Systems (NPDES) program, including data from EPA’s ICIS-NPDES system and possibly waterbody information from EPA’s ATTAINS data system.”
The second package is papaja, or Preparing APA Journal Articles, which uses RStudio and R Markdown to create APA-formatted papers. Back when I wrote APA style papers regularly, I had Word styles set up to automatically format headers and subheaders, but properly formatting tables and charts was another story. This package promises to do all of that. It’s still in development, but you can find out more about it here and here.

I have some fun analysis projects in the works! Stay tuned.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Intersecting points and overlapping polygons

Sun, 05/27/2018 - 15:01

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

I’ve been doing some spatial stuff of late and the next little step will involve intersecting points with possibly many overlapping polygons. The sp package has a function called over which returns the polygons that points intersects with. The catch though, is that it only returns the last (highest numerical value) polygon a point overlaps with. So it’s not so useful if you have many overlapping polygons. A little playing, and I’ve overcome that problem…

Here’s a toy example.

Create a couple of polygons and put them into a SpatialPolygons object.

library(sp) p1 <- matrix(c(1,1, 2,1, 4,2, 3,2), ncol = 2, byrow = TRUE) p2 <- matrix(c(2.2,1, 3,1, 3,2, 3,3, 2.8,3), ncol = 2, byrow = TRUE) p1s <- Polygons(list(Polygon(p1)), 3) p2s <- Polygons(list(Polygon(p2)), 4) sps <- SpatialPolygons(list(p1s, p2s))

Define a few points and put them in a SpatialPoints object

point <- matrix(c(2.5, 1.5, 3.2, 1.75, 2,3, 1.5, 1.25, 2.75, 2.5), ncol = 2, byrow = TRUE) points <- SpatialPoints(point)

We can plot them…(not shown)

plot(sps, border = c("black", "blue")) plot(points, add = TRUE)

As here we have the issue:

over(points, sps) 1 2 3 4 5 2 1 NA 1 2

only returning a single “hit” per point (point 1 overlaps with both polygons 1 and 2).

To get around this, we can rip the individual polygons out of the SpatialPolygons object and put them in a list, converting the individual polygons into SpatialPolygons along the way:

sps2 <- lapply(sps@polygons, function(x) SpatialPolygons(list(x)))

Then lapply-ing over sps2 we can see which polygons each point intersects…

lapply(sps2, function(x) over(points, x)) [[1]] 1 2 3 4 5 1 1 NA 1 NA [[2]] 1 2 3 4 5 1 NA NA NA 1

And now we see that point one overlaps with both polygons.

Code >>>here<<<

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Power of Standards and Consistency

Sun, 05/27/2018 - 06:32

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m going to (eventually) write a full post on the package I’m mentioning in this one : osqueryr. The TLDR on osqueryr is that it is an R DBI wrapper (that has just enough glue to also be plugged into dbplyr) for osquery. The TLDR on osquery is that it “exposes an operating system as a high-performance relational database. This design allows you to write SQL-based queries efficiently and easily to explore operating systems.”

In short, osquery turns the metadata and state information of your local system (or remote system(s)) into a SQL-compliant database. It also works on Windows, Linux, BSD and macOS. This means you can query a fleet of systems with a (mostly) normalized set of tables and get aggregated results. Operations and information security staff use this to manage systems and perform incident response tasks, but you can use it to get just about anything and there are even more powerful modes of operation for osquery. But, more on all the features of osquery[r] in another post.

If you are skeptical, here’s some proof (which I need to show regardless of your skepticism state). First, a local “connection”:

library(DBI) library(osqueryr) con <- DBI::dbConnect(Osquery()) head(dbListTables(con), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 2 x 2 ## name system_time ## 1 Firewall 3 ## 2 firefox 517846

then, a remote "connection":

con2 <- osqueryr::dbConnect(Osquery(), host = "hrbrmstr@osq1") head(dbListTables(con2), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con2, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con2, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 1 x 2 ## name system_time ## 1 firefox 1071992 "You're talking an awful lot about the package when you said this was a post on 'standards' and 'consistency'."

True, but we needed that bit above for context. To explain what this post has to do with "standards" and "consistency" I also need to tell you a bit more about how both osquery and the osqueryr package are implemented.

You can read about osquery in-depth starting at the link at the top of this post, but the authors of the tool really wanted a consistent idiom for accessing system metadata with usable, normalized output. They chose (to use a word they didn't but one that works for an R audience) a "data frame" as the output format and picked the universal language of "data frames" -- SQL -- as the inquiry interface. So, right there are examples of both standards and consistency: using SQL vs coming up with yet-another-query-language and avoiding the chaos of the myriad of outputs from various system commands by making all results conform to a rectangular data structure.

Let's take this one-step further with a specific example. All modern operating systems have the concept of a "process" and said processes have (mostly) similar attributes. However, the commands used to get a detailed listing of those processes differ (sometimes wildly) from OS to OS. The authors of osquery came up with a set of schemas to ensure a common, rectangular output and naming conventions (note that some schemas are unique to a particular OS since some elements of operating systems have no useful counterparts on other operating systems).

The osquery authors also took consistency and standards to yet-another-level by taking advantage of a feature of SQLite called virtual tables. That enables them to have C/C++/Objective-C "glue" that gets called when a query is made so they can dispatch the intent to the proper functions or shell commands and then send all the results back -- or -- use the SQLite engine capabilities to do joining, filtering, UDF-calling, etc to produce rich, targeted rectangular output back.

By not reinventing the wheel and relying on well-accepted features like data frames, SQL and SQLite the authors could direct all their focus on solving the problem they posited.

"Um, you're talking alot about everything but R now."

We're getting to the good (i.e. "R") part now.

Because the authors didn't try to become SQL parser writer experts and relied on the standard SQL offerings of SQLite, the queries made are "real" SQL (if you've worked with more than one database engine, you know how they all implement different flavours of SQL).

Because these queries are "real" SQL, we can write an R DBI driver for it. The DBI package aims "[to define] a common interface between R and database management systems (DBMS). The interface defines a small set of classes and methods similar in spirit to Perl's DBI, Java's JDBC, Python's DB-API, and Microsoft's ODBC. It defines a set of classes and methods defines what operations are possible and how they are performed."

If you look at the osqueryr package source, you'll see a bunch of DBI boilerplate code (which is in the r-dbi organization example code) and only a handful of "touch points" for the actual calls to osqueryi (the command that processes SQL). No handling of anything but passing on SQL to the osqueryi engine and getting rectangular results back. By abstracting the system call details, R users can work with a familiar, consistent, standard interface and have full access to the power of osquery without firing up a terminal.

But it gets even better.

As noted above, one design aspect of osquery was to enable remote usage. Rather than come up with yet-another-daemon-and-custom-protocol, the osquery authors suggest ssh as one way of invoking the command on remote systems and getting the rectangular results back.

Because the osqueryr package used the sys package for making local system calls, there was only a tiny bit of extra effort required to switch from sys::exec_internal() to a sibling call in the ssh package -- ssh::ssh_exec_internal() when remote connections were specified. (Said effort could have been zero if I chose a slightly different function in sys, too.)

Relying on well-accepted standards made both osqueryi and the R DBI-driver work seamlessly without much code at all and definitely without a rats nest of nested if/else statements and custom httr functions.

But it gets even more better-er

Some folks like & grok SQL, others don't. (Humans have preferences, go figure.)

A few years ago, Hadley (do I even need to use his last name at this point in time?) came up with the idea to have a more expressive and consistent way to work with data frames. We now know this as the tidyverse but one core element of the tidyverse is dplyr, which can really level-up your data frame game (no comments about data.table, or the beauty of base R, please). Not too long after the birth of dplyr came the ability to work with remote, rectangular, SQL-based data sources with (mostly) the same idioms.

And, not too long after that, the remote dplyr interface (now, dbplyr) got up close and personal with DBI. Which ultimately means that if you make a near-fully-compliant DBI interface to a SQL back-end you can now do something like this:

library(DBI) library(dplyr) library(osqueryr) con <- DBI::dbConnect(Osquery()) osqdb <- src_dbi(con) procs <- tbl(osqdb, "processes") listen <- tbl(osqdb, "listening_ports") left_join(procs, listen, by="pid") %>% filter(port != "", protocol == "17") %>% # 17 == TCP distinct(name, port, address, pid) ## # Source: lazy query [?? x 4] ## # Database: OsqueryConnection ## address name pid port ## 1 0.0.0.0 BetterTouchTool 46317 57183 ## 2 0.0.0.0 Dropbox 1214 17500 ## 3 0.0.0.0 SystemUIServer 429 0 ## 4 0.0.0.0 SystemUIServer 429 62240 ## 5 0.0.0.0 UserEventAgent 336 0 ## 6 0.0.0.0 WiFiAgent 493 0 ## 7 0.0.0.0 WiFiProxy 725 0 ## 8 0.0.0.0 com.docker.vpnkit 732 0 ## 9 0.0.0.0 identityservicesd 354 0 ## 10 0.0.0.0 loginwindow 111 0 ## # ... with more rows

The src_dbi() call wires up everything for us because d[b]plyr can rely on DBI doing it's standard & consistent job and DBI can rely on the SQLite processing crunchy goodness of osqueryi to ultimately get us a list of really dangerous (if not firewalled off) processes that are listening on all network interfaces. (Note to self: find out why the BetterTouchTool and Dropbox authors feel the need to bind to 0.0.0.0…)

FIN

What did standards and consistency get us?

  • The osquery authors spent time solving a hard problem vs creating new data formats and protocols
  • Rectangular data (i.e. "data frame") provides consistency and structure which ends up causing more freedom
  • "Standard" SQL enables a consistent means to work with rectangular data
  • ssh normalizes (secure) access across systems with a consistent protocol
  • A robust, well-defined standard mechanism for working with SQL databases enabled nigh instantaneous wiring up of a whole new back-end to R
  • ssh and sys common idioms made working with the new back-end on remote systems as easy as is on a local system
  • Another robust, well-defined modern mechanism for working with rectangular data got wired up to this new back-end with (pretty much) one line of code because of the defined standard and expectation of consistency (and works for local and remote)

Standards and consistency are pretty darned cool.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Extracting Specific Lines from a Large (Compressed) Text File

Sun, 05/27/2018 - 02:00

(This article was first published on R on Yixuan's Blog, and kindly contributed to R-bloggers)

A few days ago a friend asked me the following question: how to efficiently extract some specific lines from a large text file, possibily compressed by Gzip? He mentioned that he tried some R functions such as read.table(skip = …), but found that reading the data was too slow. Hence he was looking for some alternative ways to extracting the data.
This is a common task in preprocessing large data sets, since in data exploration, very often we want to peek at a small subset of the whole data to gain some insights.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Yixuan's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio:addins part 3 – View objects, files, functions and more with 1 keypress

Sat, 05/26/2018 - 16:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

In this post in the RStudio:addins series we will try to make our work more efficient with an addin for better inspection of objects, functions and files within RStudio. RStudio already has a very useful View function and a Go To Function / File feature with F2 as the default keyboard shortcut and yes, I know I promised automatic generation of @importFrom roxygen tags in the previous post, unfortunately we will have to wait a bit longer for that one but I believe this one more than makes up for it in usefulness.

The addin we will create in this article will let us use RStudio to View and inspect a wide range of objects, functions and files with 1 keypress.

The addins in action

Contents
  1. Retrieving objects from sys.frames
  2. Viewing files, objects and functions and more efficiently
  3. The addin function, updating the .dcf file and key binding
  4. The addin in action
  5. TL;DR – Just give me the package
  6. References
Retrieving objects from sys.frames

As a first step, we need to be able to retrieve the value of the object we are looking for based on a character string from a frame within the currently present sys.frames() for our session. This may get tricky, as it is not sufficient to only look at parent frames, because we may easily have multiple sets of “parallel” call stacks, especially when executing addins.

An example can be seen in the following screenshot, where we have a browser() call executed during the Addin execution itself. We can see that our current frame is 18 and browsing through its parent would get us to frames 17 -> 16 -> 15 -> 14 -> 0 (0 being the .GlobalEnv). The object we are looking for is however most likely in one of the other frames (9 in this particular case):

Example of sys.frames

getFromSysframes <- function(x) { if (!(is.character(x) && length(x) == 1 && nchar(x) > 0)) { warning("Expecting a non-empty character of length 1. Returning NULL.") return(invisible(NULL)) } validframes <- c(sys.frames()[-sys.nframe()], .GlobalEnv) res <- NULL for (i in validframes) { inherits <- identical(i, .GlobalEnv) res <- get0(x, i, inherits = inherits) if (!is.null(res)) { return(res) } } return(invisible(res)) } Viewing files, objects, functions and more efficiently

As a second step, we write a function to actually view our object in RStudio. We have quite some flexibility here, so as a first shot we can do the following:

  1. Open a file if the selection (or the selection with quotes added) is a path to an existing file. This is useful for viewing our scripts, data files, etc. even if they are not quoted, such as the links in your Rmd files
  2. Attempt to retrieve the object by the name and if found, try to use View to view it
  3. If we did not find the object, we can optionally still try to retrieve the value by evaluating the provided character string. This carries some pitfalls, but is very useful for example for
    • viewing elements of lists, vectors, etc. where we need to evaluate [, [[ or $ to do so.
    • viewing operation results directly in the viewer, as opposed to writing them out into the console, useful for example for wide matrices that (subjectively) look better in the RStudio viewer, compared to the console output
  4. If the View fails, we can still show useful information by trying to View its structure, enabling us to inspect objects that cannot be coerced to a data.frame and therefore would fail to be viewed.
viewObject <- function(chr, tryEval = getOption("jhaddins_view_tryeval", default = TRUE) ) { if (!(is.character(chr) && length(chr) == 1 && nchar(chr) > 0)) { message("Invalid input, expecting a non-empty character of length 1") return(invisible(1L)) } ViewWrap <- get("View", envir = as.environment("package:utils")) # maybe it is an unquoted filename - if so, open it if (file.exists(chr)) { rstudioapi::navigateToFile(chr) return(invisible(0L)) } # or maybe it is a quoted filename - if so, open it if (file.exists(gsub("\"", "", chr, fixed = TRUE))) { rstudioapi::navigateToFile(gsub("\"", "", chr, fixed = TRUE)) return(invisible(0L)) } obj <- getFromSysframes(chr) if (is.null(obj)) { if (isTRUE(tryEval)) { # object not found, try evaluating try(obj <- eval(parse(text = chr)), silent = TRUE) } if (is.null(obj)) { message(sprintf("Object %s not found", chr)) return(invisible(1L)) } } # try to View capturing output for potential errors Viewout <- utils::capture.output(ViewWrap(obj, title = chr)) if (length(Viewout) > 0 && grepl("Error", Viewout)) { # could not view, try to at least View the str of the object strcmd <- sprintf("str(%s)", chr) message(paste(Viewout,"| trying to View", strcmd)) ViewWrap(utils::capture.output(utils::str(obj)), title = strcmd) } return(invisible(0L)) }

This function can of course be improved and updated in many ways, for example using the summary method instead of str for selected object classes, or showing contents of .csv (or other data) files already read into a data.frame.

The addin function, updating the .dcf file and key binding

If you followed the previous posts in the series, you most likely already know what is coming up next. First, we need a function serving as a binding for the addin that will execute out viewObject function on the active document’s selections:

viewSelection <- function() { context <- rstudioapi::getActiveDocumentContext() lapply(X = context[["selection"]] , FUN = function(thisSel) { viewObject(thisSel[["text"]]) } ) return(invisible(NULL)) }

Secondly, we update the inst/rstudio/addins.dcf file by adding the binding for the newly created addin:

Name: viewSelection Description: Tries to use View to View the object defined by a text selected in RStudio Binding: viewSelection Interactive: false

Finally, we re-install the package and assign the keyboard shortcut in the Tools -> Addins -> Browse Addins... -> Keyboard Shortcuts... menu. Personally I assigned a single F4 keystroke for this, as I use it very often:

Assigning a keyboard shortcut to use the Addin

The addin in action

Now, let’s view a few files, a data.frame, a function and a try-error class object just pressing F4.

TL;DR – Just give me the package References Did you find the article helpful or interesting? Help others find it by sharing

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tips for great graphics

Sat, 05/26/2018 - 09:13

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

R is a great program for generating top-notch graphics. But to get the best out of it, you need to put in a little more work. Here are a few tips for adapting your R graphics to make them look a little better.

1) Dont use the “File/Save as…/” menu.

If you set up your graphic in the first place then theres no need to post-process (eg crop, scale etc) the graphic in other software.

Use the graphic devices (jpeg(), tiff(), postscript(), etc), set your height and width to whatever you want the finished product to be and then create the graph.

tiff("~/Manuscript/Figs/Fig1.tiff", width =2, height =2, units ="in", res = 600) plot(dist ~ speed, cars) # cars is a base R dataset - data(cars) dev.off()

The first argument to a graphic device such as tiff or jpeg is the filename, to which you can include the path. So that I dont have to worry about the order of arguments, I include the argument name. Width and height specify the width and height of the graphic in the units provided, in this case inches, but pixels, cm or mm can also be used. The resolution tells R how higher quality the graphic should be, the higher the better, but if you go too high, you could find that you have problems running out of space. I find 600 a nice compromise. Nice crisp lines, smooth curves and sharp letters. You can then import that straight into MS Word or whatever other word processor you use, or upload to go with that manuscript youve worked so hard on. Although, you could find yourself with BIG files. A 4×6 inch figure I made recently was 17.1MB! And for the web, 100 or 200 is probably enough.

This technique also provides you with the same output every time, which is not the case if you adjust the size of the default device window produced by plot()

2) Dont be afraid to change the default settings!

Personally, I find that a 1 inch margin at the base of the graphic is a bit generous. I also find the ticks a bit long and the gap between tick and the axis label a bit big. Luckily, these things are easy to change!

jpeg("t1.jpeg", width=3, height=3, units="in", res=100) plot(dist~speed, cars) dev.off()

The above code produces this figure.

See what I mean about the margins?

Heres how to change it!

par(mai=c(0.5, 0.5, 0.1, 0.1) ) plot(dist ~ speed, cars, tck = -0.01, las=1, mgp=c(1.4,0.2,0))

That call to par changes the “MArgin in Inches” setting. The tck argument to par deals with TiCK length (negative for outside, positive for inside) while mgp controls on which line certain things are printed (titles are the first argument, labels are the second and the axis itself is the third). The las argument controls the rotation of the labels (1 for horizontal, 2 for perpendicular and 3 for vertical).

This leads me nicely to number 3: Dont be afraid to have separate lines for different parts of your plot.

This allows far more freedom and flexibility than setting par arguments in the plot argument. You can have different mgp settings for each axis for instance.

par(mai=c(0.4, 0.5, 0.1, 0.1)) plot(dist ~ speed, cars, xaxt="n", mgp=c(1.4,0.2,0), las=1, tck=-0.01) axis(side=1, tck = -0.01, las=1, mgp=c(0.5,0.2,0)) mtext("speed", side=1, line= 1)

This plots the same graph, but allows different distances for the x and y axes, in terms of margin and where the title is situated. The axis function places an axis on the side determined by its side argument and mtext places Margin TEXT, again at the side in its argument and in this case on line 1.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to plot with patchwork

Fri, 05/25/2018 - 15:18

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The goal of patchwork is to make it simple to combine separate ggplots into the same graphic. As such it tries to solve the same problem as gridExtra::grid.arrange() and cowplot::plot_grid but using an API that incites exploration and iteration.

Installation
You can install patchwork from github with:

# install.packages("devtools")
devtools::install_github("thomasp85/patchwork")

The usage of patchwork is simple: just add plots together!

library(ggplot2)
library(patchwork)

p1 <- ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 <- ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))

p1 + p2

you are of course free to also add the plots together as part of the same plotting operation:

ggplot(mtcars) +
geom_point(aes(mpg, disp)) +
ggplot(mtcars) +
geom_boxplot(aes(gear, disp, group = gear))

layouts can be specified by adding a plot_layout() call to the assemble. This lets you define the dimensions of the grid and how much space to allocate to the different rows and columns

p1 + p2 + plot_layout(ncol = 1, heights = c(3, 1))

If you need to add a bit of space between your plots you can use plot_spacer() to fill a cell in the grid with nothing

p1 + plot_spacer() + p2

You can make nested plots layout by wrapping part of the plots in parentheses – in these cases the layout is scoped to the different nesting levels

p3 <- ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 <- ggplot(mtcars) + geom_bar(aes(carb))

p4 + {
p1 + {
p2 +
p3 +
plot_layout(ncol = 1)
}
} +
plot_layout(ncol = 1)

Advanced features
In addition to adding plots and layouts together, patchwork defines some other operators that might be of interest. – will behave like + but put the left and right side in the same nesting level (as opposed to putting the right side into the left sides nesting level). Observe:

p1 + p2 + p3 + plot_layout(ncol = 1)

this is basically the same as without braces (just like standard math arithmetic) – the plots are added sequentially to the same nesting level. Now look:

p1 + p2 - p3 + plot_layout(ncol = 1)

Now p1 + p2 and p3 is on the same level.

Often you are interested in just putting plots besides or on top of each other. patchwork provides both | and / for horizontal and vertical layouts respectively. They can of course be combined for a very readable layout syntax:

(p1 | p2 | p3) /
p4

There are two additional operators that are used for a slightly different purpose, namely to reduce code repetition. Consider the case where you want to change the theme for all plots in an assemble. Instead of modifying all plots individually you can use & or * to add elements to all subplots. The two differ in that * will only affect the plots on the current nesting level:

(p1 + (p2 + p3) + p4 + plot_layout(ncol = 1)) * theme_bw()

whereas & will recurse into nested levels:

p1 + (p2 + p3) + p4 + plot_layout(ncol = 1) & theme_bw()

Note that parenthesis is required in the former case due to higher precedence of the * operator. The latter case is the most common so it has deserved the easiest use.

Related exercise sets:
  1. How to Display Multivariate Relationship Graphs With Lattice
  2. Lattice Exercises
  3. Explore all our (>1000) R exercises
  4. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Programmatically creating text output in R – Exercises

Fri, 05/25/2018 - 14:01

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the age of Rmarkdown and Shiny, or when making any custom output from your data you want your output to look consistent and neat. Also, when writing your output you often want it to obtain a specific (decorative) format defined by the html or LaTeX engine. These exercises are an opportunity to refresh our memory on functions such as paste, sprintf, formatC and others that are convenient tools to achieve these ends. All of the solutions rely partly on the ultra flexible sprintf() but there are no-doubt many ways to solve the exercises with other functions, feel free to share your solutions in the comment section.

Example solutions are available here.

Exercise 1

Print out the following vector as prices in dollars (to the nearest cent):
c(14.3409087337707, 13.0648270623048, 3.58504267621646, 18.5077076398145,
16.8279241011882). Example: $14.34

Exercise 2

Using these numbers c(25, 7, 90, 16) make a vector of filenames in the following format: file_025.txt. That is, left pad the numbers so they are all three digits.

Exercise 3

Actually, if we are only dealing numbers less than hundred file_25.txt would have been enough. Change the code from last exercise so that the padding is progammatically decided by the biggest number in the vector.

Exercise 4

Print out the following haiku on three lines, right aligned, with the help of cat. c("Stay the patient course.", "Of little worth is your ire.", "The network is down.").

Exercise 5

Write a function that converts a number to its hexadecimal representation. This is a useful skill when converting bmp colors from one representation to another. Example output:

tohex(12) [1] "12 is c in hexadecimal"

Exercise 6

Take a string and programmatically surround it with the html header tag h1

Exercise 7

Back to the poem from exercise 4, let R convert to html unordered list. So that it would appear like the following in a browser:

  • Stay the patient course.
  • Of little worth is your ire.
  • The network is down.

Exercise 8

Here is a list of the currently top 5 movies on imdb.com in terms of rating c("The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "The Dark Knight", "12 Angry Men", "Schindler's List") convert them into a list compatible with written text.

Example output:

[1] "The top ranked films on imdb.com are The Shawshank Redemption, The Godfather, The Godfather: Part II, The Dark Knight, 12 Angry Men and Schindler's List"

Exercise 9

Now you should be able to solve this quickly: write a function that converts a proportion to a percentage that takes as input the number of decimal places. Input of 0.921313 and 2 decimal places should return "92.13%"

Exercise 10

Improve the function from last exercise so the percentage take consistently 10 characters by doing some left padding. Raise an error if percentage already happens to be longer than 10.

(Image by Daniel Friedman).

Related exercise sets:
  1. Parallel Computing Exercises: Foreach and DoParallel (Part-2)
  2. Parallel Computing Exercises: Snow and Rmpi (Part-3)
  3. Building Shiny App exercises part 4
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Slopegraphs and R – A pleasant diversion – May 26, 2018

Fri, 05/25/2018 - 02:00

(This article was first published on Chuck Powell, and kindly contributed to R-bloggers)

I try to at least scan the R-bloggers
feed everyday. Not every article is of interest to me, but I often have
one of two different reactions to at least one article. Sometimes it is
an “ah ha” moment because the article is right on point for a problem
I have now or have had in the past and the article provides a (better)
solution. Other times my reaction is more of an “oh yeah”, because it
is something I have been meaning to investigate, or something I once
knew, but the article brings a different perspective to it.

The second case happened to me this week. I’ve been aware of slopegraphs
and bumpcharts for quite some time, and I certainly am aware of Tufte’s
work
. As an amateur military
historian I’ve always loved, for example, his
poster
depicting Napoleon’s
Russian Campaign. So when I saw the article from Murtaza
Haider
titled
“Edward Tufte’s Slopegraphs and political fortunes in Ontario” I
just had to take a peek and revisit the topic.

The article does a good job of looking at slopegraphs in both R (via
plotrix) and Stata, even providing the code to do the work. My
challenge was that even though I’m adequate at plotting in base R, I
much prefer using ggplot2 wherever and whenever possible. My memory
was that I had seen another article on the related topic of a
bumpchart on R-bloggers in the not too distant past. A little
sleuthing turned up this earlier
article
from Dominik
Koch
who wrote some code to
compare national performance at the Winter Olympics, “Bump Chart –
Track performance over time”
.

Finally, I wound up at this Github
repository
for a project called
“Edward Tufte-Inspired Slopegraphs” from Thomas J.
Leeper
who has been building code to make
slopegraphs using both base plotting functions and ggplot2.

My post today will draw a little bit from all their work and hopefully
provide some useful samples for others to draw on if they share some of
my quirks about data layout and a preference for ggplot2 versus base
plot. I’m going to focus almost exclusively on slopegraphs, although
much of the work could be extended to bumpcharts as well.

Setup and library loading

We’re going to make occasional use of dplyr to manipulate the data,
extensive use of ggplot2 to do the plotting and ggrepel to solve one
specific labeling problem. We’ll load them and I am suppressing the
message from dplyr about namespace overrides.

require(dplyr) require(ggplot2) require(ggrepel) require(kableExtra) Politics in Ontario

The original
post

is about plotting the data from some polling results in Ontario. For the
reader’s convenience I’ve made the data available via a structure
command. We have data about two different polling dates, for 5 political
parties, and the measured variable is percent of people supporting
expressed as x.x (i.e. already multiplied by
100).

data <- structure(list( Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("11-May-18", "18-May-18"), class = "factor"), Party = structure(c(5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L), .Label = c("Green", "Liberal", "NDP", "Others", "PC"), class = "factor"), Pct = c(42.3, 28.4, 22.1, 5.4, 1.8, 41.9, 29.3, 22.3, 5, 1.4)), class = "data.frame", row.names = c(NA, -10L)) str(data) ## 'data.frame': 10 obs. of 3 variables: ## $ Date : Factor w/ 2 levels "11-May-18","18-May-18": 1 1 1 1 1 2 2 2 2 2 ## $ Party: Factor w/ 5 levels "Green","Liberal",..: 5 3 2 1 4 5 3 2 1 4 ## $ Pct : num 42.3 28.4 22.1 5.4 1.8 41.9 29.3 22.3 5 1.4 head(data) ## Date Party Pct ## 1 11-May-18 PC 42.3 ## 2 11-May-18 NDP 28.4 ## 3 11-May-18 Liberal 22.1 ## 4 11-May-18 Green 5.4 ## 5 11-May-18 Others 1.8 ## 6 18-May-18 PC 41.9

Let’s just take the data as we have it and feed it to ggplot in a nice
simple fashion and see what we get with very little effort.

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

The nice thing about ggplot is once you get used to the syntax it
becomes very “readable”. We’ve identified our dataset, the x & y
variables and our grouping variable. Lines too big? An adjustment to
size = 2 does it. Don’t like colors? Pull the color = Party clause.

So we’re already pretty close to what we need. Things are scaled
properly and the basic labeling of titles etc. is accomplished. Our
biggest “problem” is that ggplot has been a little too helpful and
adding some things we’d like to remove to give it a more “Tuftesque”
look. So what we’ll do in the next few steps is add lines of code – but
they are mainly designed to remove unwanted elements. This is in
contrast to a base plot where we have to write the code to add elements.

So lets:

  • Move the x axis labels to the top with scale_x_discrete(position =
    "top")
  • Change to a nice clean black and white theme theme_bw()
  • Not display any legend(s) theme(legend.position = "none")
  • Remove the default border from our plot theme(panel.border =
    element_blank())

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Nice progress! Continuing to remove things that can be considered
“clutter” we add some additional lines that all end in
element_blank() and are invoked to remove default plot items such as
the plot grid, the y axcis text, etc..

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Remove just about everything from the y axis theme(axis.title.y = element_blank()) + theme(axis.text.y = element_blank()) + theme(panel.grid.major.y = element_blank()) + theme(panel.grid.minor.y = element_blank()) + # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()) + theme(panel.grid.major.x = element_blank()) + theme(axis.text.x.top = element_text(size=12)) + # Remove x & y tick marks theme(axis.ticks = element_blank()) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Very nice! We’re almost there! The “almost” is because now that we
have removed both the legend and all scales and tick marks we no longer
know who is who, and what the numbers are! Plus, I’m a little unhappy
with the way the titles are formatted, so we’ll play with that. Later,
I’ll get fancy but for now let’s just add some simple text labels on
the left and right to show the party name and their percentage. The code
geom_text(aes(label = Party)) will place the party name right on top
of the points that anchor either end of the line. If we make that
geom_text(aes(label = paste0(Party, " - ", Pct, "%"))) then we’ll get
labels that have both the party and the percent all neatly formatted,
but still right on top of the points that anchor the ends of the line.
hjust controls horizontal justification so if we change it to
geom_text(aes(label = paste0(Party, " - ", Pct, "%")), hjust = 1.35)
both sets of labels will slide to the left which is exactly what we want
for the May 11 labels but not the May 18 labels. If we feed hjust a
negative number they’ll go the other way. So what we’ll do is filter the
data using the filter function from dplyr and place the left hand
labels differently than the right hand labels. While we’re at it we’ll
make it bold face font and a little larger…

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + geom_text(data = data %>% filter(Date == "11-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = 1.35, fontface = "bold", size = 4) + geom_text(data = data %>% filter(Date == "18-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = -.35, fontface = "bold", size = 4) + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Remove just about everything from the y axis theme(axis.title.y = element_blank()) + theme(axis.text.y = element_blank()) + theme(panel.grid.major.y = element_blank()) + theme(panel.grid.minor.y = element_blank()) + # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()) + theme(panel.grid.major.x = element_blank()) + theme(axis.text.x.top = element_text(size=12)) + # Remove x & y tick marks theme(axis.ticks = element_blank()) + # Format title & subtitle theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)) + theme(plot.subtitle = element_text(hjust = 0.5)) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Eureka! Not perfect yet but definitely looking good.

Adding complexity

I’m feeling pretty good about the solution so far but there are three
things I’d like to make better.

  1. How well will this solution work when we have more than two time
    periods? Need to make sure it generalizes to a more complex case.
  2. As Murtaza
    Haider
    notes in
    his post we’ll have issues if the data points are identical or very
    close together. Our very neat little labels will overlap each other.
    In his post I believe he mentions that he manually moved them in
    some cases. Let’s try and fix that.
  3. Oh my, that’s a lot of code to keep cutting and pasting, can we
    simplify?

To test #1 and #2 I have “invented”” a new dataset called moredata.
It is fictional it’s labelled May 25th but today is actually May
24th. But I created it to add a third polling date and to make sure that
we had a chance to test what happens when we have two identical
datapoints on the same day. Notice that on May 25th the polling numbers
for the Liberals and the NDP are identical at
26.8%.

moredata <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("11-May-18", "18-May-18", "25-May-18"), class = "factor"), Party = structure(c(5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L), .Label = c("Green", "Liberal", "NDP", "Others", "PC"), class = "factor"), Pct = c(42.3, 28.4, 22.1, 5.4, 1.8, 41.9, 29.3, 22.3, 5, 1.4, 41.9, 26.8, 26.8, 5, 1.4)), class = "data.frame", row.names = c(NA, -15L)) tail(moredata) ## Date Party Pct ## 10 18-May-18 Others 1.4 ## 11 25-May-18 PC 41.9 ## 12 25-May-18 NDP 26.8 ## 13 25-May-18 Liberal 26.8 ## 14 25-May-18 Green 5.0 ## 15 25-May-18 Others 1.4

You’ll notice at the beginning of this post I loaded the ggrepel
library. ggrepel works with ggplot2 to repel things that overlap,
in this case our geom_text labels. The invocation is geom_text_repel
and it is very similar to geom_text but allows us to deconflict the
overlaps. We’ll use hjust = "left" and hjust = "right" to control
justifying the labels. We’ll use a fixed nudge left and right nudge_x =
-.45 and nudge_x = .5 to move the labels left and right off the
plotted data points and we will explicitly tell geom_text_repel to
only move the labels vertically to avoid overlap with direction = "y".
Everything else remains the same.

ggplot(data = moredata, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + geom_text_repel(data = moredata %>% filter(Date == "11-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = "left", fontface = "bold", size = 4, nudge_x = -.45, direction = "y") + geom_text_repel(data = moredata %>% filter(Date == "25-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = "right", fontface = "bold", size = 4, nudge_x = .5, direction = "y") + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Remove just about everything from the y axis theme(axis.title.y = element_blank()) + theme(axis.text.y = element_blank()) + theme(panel.grid.major.y = element_blank()) + theme(panel.grid.minor.y = element_blank()) + # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()) + theme(panel.grid.major.x = element_blank()) + theme(axis.text.x.top = element_text(size=12)) + # Remove x & y tick marks theme(axis.ticks = element_blank()) + # Format title & subtitle theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)) + theme(plot.subtitle = element_text(hjust = 0.5)) + # Labelling as desired labs( title = "Bogus Data", subtitle = "(Chuck Powell)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Very nice! We have confirmed that our solution works for more than two
dates without any additional changes and we have found a solution to the
label overlap issue. In a little while we’ll talk about labeling the
data points in the center (if we want to).

Before we move on let’s make our life a little simpler. While the output
plot is good it’s a lot of code to produce one graph. Let’s see if we
can simplify…

Since ggplot2 objects are just regular R objects, you can put them in a
list. This means you can apply all of R’s great functional programming
tools. For example, if you wanted to add different geoms to the same
base plot, you could put them in a list and use lapply().

But for now let’s at least take all the invariant lines of code and put
them in a list. Then when we go to plot we can just invoke the list and
remain confident we get the right formatting. For now let’s name this
list something quaint and obvious like MySpecial.

MySpecial <- list( # move the x axis labels up top scale_x_discrete(position = "top"), theme_bw(), # Format tweaks # Remove the legend theme(legend.position = "none"), # Remove the panel border theme(panel.border = element_blank()), # Remove just about everything from the y axis theme(axis.title.y = element_blank()), theme(axis.text.y = element_blank()), theme(panel.grid.major.y = element_blank()), theme(panel.grid.minor.y = element_blank()), # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()), theme(panel.grid.major.x = element_blank()), theme(axis.text.x.top = element_text(size=12)), # Remove x & y tick marks theme(axis.ticks = element_blank()), # Format title & subtitle theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)), theme(plot.subtitle = element_text(hjust = 0.5)) ) summary(MySpecial) ## Length Class Mode ## [1,] 17 ScaleDiscretePosition environment ## [2,] 57 theme list ## [3,] 1 theme list ## [4,] 1 theme list ## [5,] 1 theme list ## [6,] 1 theme list ## [7,] 1 theme list ## [8,] 1 theme list ## [9,] 1 theme list ## [10,] 1 theme list ## [11,] 1 theme list ## [12,] 1 theme list ## [13,] 1 theme list ## [14,] 1 theme list

MySpecial is actually an incredibly complex structure so I used the
summary function. What’s important to us is that in the future all we
need to do is include it in the ggplot command and magic happens.
Perhaps another day I’ll make it a proper function but for now I can
change little things like line size or titles and labels without
worrying about the rest. So here it is with some little things changed.

ggplot(data = moredata, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 1) + geom_point(aes(color = Party, alpha = 1), size = 3) + geom_text_repel(data = moredata %>% filter(Date == "11-May-18"), aes(label = paste0(Party, " : ", Pct, "%")) , hjust = "left", fontface = "bold", size = 4, nudge_x = -.45, direction = "y") + geom_text_repel(data = moredata %>% filter(Date == "25-May-18"), aes(label = paste0(Party, " : ", Pct, "%")) , hjust = "right", fontface = "bold", size = 4, nudge_x = .5, direction = "y") + MySpecial + labs( title = "Bogus Data", subtitle = "(Chuck Powell)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Even more complex

Feeling good about the solution so far I decided to press on to a much
more complex problem. Thomas J. Leeper has
a nice plot of Tufte’s Cancer survival
slopegraph

N.B. that the original Tufte is not accurate on the vertical scale.
Look at Prostate and Thyroid for example since visually I would argue
they should cross to reflect the data
.

Let’s grab the data as laid out by
Tufte.

cancer <- structure(list(Year.5 = c(99, 96, 95, 89, 86, 85, 84, 82, 71, 69, 63, 62, 62, 58, 57, 55, 43, 32, 30, 24, 15, 14, 8, 4), Year.10 = c(95, 96, 94, 87, 78, 80, 83, 76, 64, 57, 55, 54, 55, 46, 46, 49, 32, 29, 13, 19, 11, 8, 6, 3), Year.15 = c(87, 94, 91, 84, 71, 74, 81, 70, 63, 46, 52, 50, 54, 38, 38, 50, 30, 28, 7, 19, 7, 8, 6, 3), Year.20 = c(81, 95, 88, 83, 75, 67, 79, 68, 60, 38, 49, 47, 52, 34, 33, 50, 26, 26, 5, 15, 6, 5, 8, 3)), class = "data.frame", row.names = c("Prostate", "Thyroid", "Testis", "Melanomas", "Breast", "Hodgkin's", "Uterus", "Urinary", "Cervix", "Larynx", "Rectum", "Kidney", "Colon", "Non-Hodgkin's", "Oral", "Ovary", "Leukemia", "Brain", "Multiple myeloma", "Stomach", "Lung", "Esophagus", "Liver", "Pancreas")) str(cancer) ## 'data.frame': 24 obs. of 4 variables: ## $ Year.5 : num 99 96 95 89 86 85 84 82 71 69 ... ## $ Year.10: num 95 96 94 87 78 80 83 76 64 57 ... ## $ Year.15: num 87 94 91 84 71 74 81 70 63 46 ... ## $ Year.20: num 81 95 88 83 75 67 79 68 60 38 ... kable(head(cancer,10)) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Year.5

Year.10

Year.15

Year.20

Prostate

99

95

87

81

Thyroid

96

96

94

95

Testis

95

94

91

88

Melanomas

89

87

84

83

Breast

86

78

71

75

Hodgkin’s

85

80

74

67

Uterus

84

83

81

79

Urinary

82

76

70

68

Cervix

71

64

63

60

Larynx

69

57

46

38

There, we have it in a neat data frame but not organized as we need it.
Not unusual, and an opportunity to use some other tools from broom and
reshape2. Let’s do the following:

  1. Let’s transpose the data with t
  2. Let’s use broom::fix_data_frame to get valid column names and
    convert rownames to a proper column all in one function. Right now
    the types of cancer are nothing but rownames.
  3. Use reshape2::melt to take our transposed dataframe and convert it
    to long format so we can send it off to ggplot. Along the way
    we’ll rename the resulting dataframe newcancer with columns
    named Year, Type and Survival.

# stepping through for demonstration purposes t(cancer) # returns a matrix ## Prostate Thyroid Testis Melanomas Breast Hodgkin's Uterus Urinary ## Year.5 99 96 95 89 86 85 84 82 ## Year.10 95 96 94 87 78 80 83 76 ## Year.15 87 94 91 84 71 74 81 70 ## Year.20 81 95 88 83 75 67 79 68 ## Cervix Larynx Rectum Kidney Colon Non-Hodgkin's Oral Ovary ## Year.5 71 69 63 62 62 58 57 55 ## Year.10 64 57 55 54 55 46 46 49 ## Year.15 63 46 52 50 54 38 38 50 ## Year.20 60 38 49 47 52 34 33 50 ## Leukemia Brain Multiple myeloma Stomach Lung Esophagus Liver ## Year.5 43 32 30 24 15 14 8 ## Year.10 32 29 13 19 11 8 6 ## Year.15 30 28 7 19 7 8 6 ## Year.20 26 26 5 15 6 5 8 ## Pancreas ## Year.5 4 ## Year.10 3 ## Year.15 3 ## Year.20 3 broom::fix_data_frame( t(cancer), newcol = "Year") # make it a dataframe with Year as a proper column ## Year Prostate Thyroid Testis Melanomas Breast Hodgkin.s Uterus ## 1 Year.5 99 96 95 89 86 85 84 ## 2 Year.10 95 96 94 87 78 80 83 ## 3 Year.15 87 94 91 84 71 74 81 ## 4 Year.20 81 95 88 83 75 67 79 ## Urinary Cervix Larynx Rectum Kidney Colon Non.Hodgkin.s Oral Ovary ## 1 82 71 69 63 62 62 58 57 55 ## 2 76 64 57 55 54 55 46 46 49 ## 3 70 63 46 52 50 54 38 38 50 ## 4 68 60 38 49 47 52 34 33 50 ## Leukemia Brain Multiple.myeloma Stomach Lung Esophagus Liver Pancreas ## 1 43 32 30 24 15 14 8 4 ## 2 32 29 13 19 11 8 6 3 ## 3 30 28 7 19 7 8 6 3 ## 4 26 26 5 15 6 5 8 3 reshape2::melt( broom::fix_data_frame( t(cancer), newcol = "Year"), id="Year", variable.name="Type", value.name = "Survival") # melt it to long form ## Year Type Survival ## 1 Year.5 Prostate 99 ## 2 Year.10 Prostate 95 ## 3 Year.15 Prostate 87 ## 4 Year.20 Prostate 81 ## 5 Year.5 Thyroid 96 ## 6 Year.10 Thyroid 96 ## 7 Year.15 Thyroid 94 ## 8 Year.20 Thyroid 95 ## 9 Year.5 Testis 95 ## 10 Year.10 Testis 94 ## 11 Year.15 Testis 91 ## 12 Year.20 Testis 88 ## 13 Year.5 Melanomas 89 ## 14 Year.10 Melanomas 87 ## 15 Year.15 Melanomas 84 ## 16 Year.20 Melanomas 83 ## 17 Year.5 Breast 86 ## 18 Year.10 Breast 78 ## 19 Year.15 Breast 71 ## 20 Year.20 Breast 75 ## 21 Year.5 Hodgkin.s 85 ## 22 Year.10 Hodgkin.s 80 ## 23 Year.15 Hodgkin.s 74 ## 24 Year.20 Hodgkin.s 67 ## 25 Year.5 Uterus 84 ## 26 Year.10 Uterus 83 ## 27 Year.15 Uterus 81 ## 28 Year.20 Uterus 79 ## 29 Year.5 Urinary 82 ## 30 Year.10 Urinary 76 ## 31 Year.15 Urinary 70 ## 32 Year.20 Urinary 68 ## 33 Year.5 Cervix 71 ## 34 Year.10 Cervix 64 ## 35 Year.15 Cervix 63 ## 36 Year.20 Cervix 60 ## 37 Year.5 Larynx 69 ## 38 Year.10 Larynx 57 ## 39 Year.15 Larynx 46 ## 40 Year.20 Larynx 38 ## 41 Year.5 Rectum 63 ## 42 Year.10 Rectum 55 ## 43 Year.15 Rectum 52 ## 44 Year.20 Rectum 49 ## 45 Year.5 Kidney 62 ## 46 Year.10 Kidney 54 ## 47 Year.15 Kidney 50 ## 48 Year.20 Kidney 47 ## 49 Year.5 Colon 62 ## 50 Year.10 Colon 55 ## 51 Year.15 Colon 54 ## 52 Year.20 Colon 52 ## 53 Year.5 Non.Hodgkin.s 58 ## 54 Year.10 Non.Hodgkin.s 46 ## 55 Year.15 Non.Hodgkin.s 38 ## 56 Year.20 Non.Hodgkin.s 34 ## 57 Year.5 Oral 57 ## 58 Year.10 Oral 46 ## 59 Year.15 Oral 38 ## 60 Year.20 Oral 33 ## 61 Year.5 Ovary 55 ## 62 Year.10 Ovary 49 ## 63 Year.15 Ovary 50 ## 64 Year.20 Ovary 50 ## 65 Year.5 Leukemia 43 ## 66 Year.10 Leukemia 32 ## 67 Year.15 Leukemia 30 ## 68 Year.20 Leukemia 26 ## 69 Year.5 Brain 32 ## 70 Year.10 Brain 29 ## 71 Year.15 Brain 28 ## 72 Year.20 Brain 26 ## 73 Year.5 Multiple.myeloma 30 ## 74 Year.10 Multiple.myeloma 13 ## 75 Year.15 Multiple.myeloma 7 ## 76 Year.20 Multiple.myeloma 5 ## 77 Year.5 Stomach 24 ## 78 Year.10 Stomach 19 ## 79 Year.15 Stomach 19 ## 80 Year.20 Stomach 15 ## 81 Year.5 Lung 15 ## 82 Year.10 Lung 11 ## 83 Year.15 Lung 7 ## 84 Year.20 Lung 6 ## 85 Year.5 Esophagus 14 ## 86 Year.10 Esophagus 8 ## 87 Year.15 Esophagus 8 ## 88 Year.20 Esophagus 5 ## 89 Year.5 Liver 8 ## 90 Year.10 Liver 6 ## 91 Year.15 Liver 6 ## 92 Year.20 Liver 8 ## 93 Year.5 Pancreas 4 ## 94 Year.10 Pancreas 3 ## 95 Year.15 Pancreas 3 ## 96 Year.20 Pancreas 3 # all those steps in one long line saved to a new dataframe newcancer <- reshape2::melt(broom::fix_data_frame(t(cancer), newcol = "Year"), id="Year", variable.name="Type", value.name = "Survival")

Now we have whipped the data into the shape we need it. 96 rows with the
three columns we want to plot, Year, Type, and Survival. If you
look at the data though, you’ll notice two small faults. First, Year
is not a factor. The plot will work but have an annoying limitation.
Since “Year.5” is a character string it will be ordered after all the
other years. We could fix that on the fly within our ggplot call but I
find it cleaner and more understandable if I take care of that first.
I’ll use the factor function from base R to accomplish that and
while I’m at it make the values nicer looking. Second in three cases R
changed cancer type names because they couldn’t be column names in a
dataframe. I’ll use forcats::fct_recode to make them look better.

newcancer$Year <- factor(newcancer$Year, levels = c("Year.5", "Year.10", "Year.15", "Year.20"), labels = c("5 Year","10 Year","15 Year","20 Year"), ordered = TRUE) newcancer$Type <- forcats::fct_recode(newcancer$Type, "Hodgkin's" = "Hodgkin.s", "Non-Hodgkin's" = "Non.Hodgkin.s", "Multiple myeloma" = "Multiple.myeloma") head(newcancer) ## Year Type Survival ## 1 5 Year Prostate 99 ## 2 10 Year Prostate 95 ## 3 15 Year Prostate 87 ## 4 20 Year Prostate 81 ## 5 5 Year Thyroid 96 ## 6 10 Year Thyroid 96

Now that we have the data the way we want it we can make our slopegraph.
Some of the necessary changes are obvious x = Year, y = Survival and
group = Type for example. Since there are a lot of plotted lines I’ve
reduced the weight or size of the individual lines. We no longer want to
plot the big round points, we’re going to substitute in the actual
numbers, so that line gets commented out. The left and right labels
require no change and geom_text_repel will keep them from overlapping
which is almost inevitable given the data. To put the actual survival
numbers on the plot we’ll turn to geom_label. It’s like geom_text
only it puts a label box around the text. We’ll choose a smallish size,
minimize the amount of padding, and make the border of the box
invisible. The end result is what we want. It overlays on top of the
lines we’ve already plotted and the invisible padding gives us just
enough room.

ggplot(data = newcancer, aes(x = Year, y = Survival, group = Type)) + geom_line(aes(color = Type, alpha = 1), size = 1) + # geom_point(aes(color = Type, alpha = .1), size = 4) + geom_text_repel(data = newcancer %>% filter(Year == "5 Year"), aes(label = Type) , hjust = "left", fontface = "bold", size = 3, nudge_x = -.45, direction = "y") + geom_text_repel(data = newcancer %>% filter(Year == "20 Year"), aes(label = Type) , hjust = "right", fontface = "bold", size = 3, nudge_x = .5, direction = "y") + geom_label(aes(label = Survival), size = 2.5, label.padding = unit(0.05, "lines"), label.size = 0.0) + MySpecial + labs( title = "Estimates of Percent Survival Rates", subtitle = "Based on: Edward Tufte, Beautiful Evidence, 174, 176.", caption = "https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk" )

Done for now

I hope you’ve found this useful. I am always open to comments,
corrections and suggestions.

Chuck (ibecav at gmail dot
com)

License


This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License
.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Safe Disposal of Unexploded WWII Bombs

Fri, 05/25/2018 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

Unexploded WWII bombs are ticking threats despite being dropped more than 70 years ago. In this post we explain how statistical methods are used to plan the search and disposal of unexploded WWII bombs. In particular we consider and exemplify the non-parametric nearest neighbour distance (NND) method implemented in the R package highriskzone. The method analyses the spatial pattern of exploded bombs to determine so called risk-zones, that is regions with a high likelihood of containing unexploded bombs. The coverage of such risk-zones is investigated through both non-parametric and parametric point process simulation.



NCAP aerial photo from 1944 showing the bombing of the V2 rocket facility at Peenemünde, Germany. Image is available under a custom NCAP license – higher resolution images are available from NCAP.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

\[
\newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}}
\]

Introduction

During WWII Germany was pounded with about 1.5 mio tons of bombs of which about 10-15% did not explode. More than 70 years after the end of WWII these unexploded bombs (UXBs) still pose a threat and are the frequent cause of large scale evacuations to secure their safe disposal when found. Luckily, lethal incidents are rare thanks to a huge effort to localise and safely dismantle UXBs. As part of this effort, aerial photos taken by the allies after the attacks provide valuable information about the possible locations of UXBs. Some UXBs are directly visible in the photos – see for example the green circles in this NCAP image or p. 6 in the following information flyer by one of the companies offering such aerial identification services (featured in this news article). In other cases the photos only provide information about the location of the exploded bombs. This information can be used to identify areas where there is a high likelihood of UXBs. Such areas would then be carefully scrutinized using on-the-ground search methods, for example, electromagnetic and magnetic detectors.

The aim of Mahling, Höhle, and Küchenhoff (2013) was to develop statistical methods for the identification of such risk-zones in co-operation with Oberfinanzdirektion Niedersachsen, which supports the removal of unexploded ordnance in federal properties in Germany. In what follows we will discuss one of the methods used in the paper, the so called nearest neighbourhood distance method and illustrate its implementation in the R package highriskzone originally created by Heidi Seibold and now maintained by Felix Günther.

Mathematical Setup

Casting matters into mathematical notation: Let \(X\) be a point process denoting the spatial locations of all bombs dropped in the particular window of interest \(W \subseteq \mathbb{R}^2\). Furthermore, let \(Y\) denote the observed point process of exploded bomb and \(Z=X\backslash Y\) the point process of unexploded bombs. Note that only the process \(Y\) is observed; \(Z\) is not observed and the target of our inference. We assume that the probability \(q\) of a dropped bomb not exploding is homogeneous in \(W\). Thus if \(X\) is a inhomogeneous Poisson point process with intensity function \(\lambda_X(\bm{s})\), \(\bm{s}\in W\), then

\[
\lambda_Y(\bm{s}) = (1-q) \lambda_X(\bm{s})
\quad \text{and}\quad
\lambda_Z(\bm{s}) = q \lambda_X(\bm{s})
\]

are the intensity functions of \(Y\) and \(Z\), respectively.

The figure below shows \(Y\) for an actual observed WWII bomb point consisting of n=443 bombs available from the R package highriskzone (Seibold et al. 2017). The observation window contains a particular area of interest for which a risk assessment needs to be done – often these contain a known WWII military target, e.g., an airport, an arms factory or a military casern. In order to not disclose the exact location of the considered area, coordinates are given relative to an arbitrary origo. In reality one would closely link such digitized aerial images to other terrain features using a GIS system.

library(highriskzone) library(spatstat) data(craterA) summary(craterA) ## Planar point pattern: 443 points ## Average intensity 0.0001082477 points per square unit ## ## Coordinates are given to 4 decimal places ## ## Window: polygonal boundary ## single connected closed polygon with 208 vertices ## enclosing rectangle: [0, 2334.3758] x [0, 2456.4061] units ## Window area = 4092470 square units ## Fraction of frame area: 0.714

The point pattern craterA corresponding to an instance of the process \(Y\) is provided in R as an object of class ppp from the R package spatstat (Baddeley, Rubak, and Turner 2015). Instead of inferring the locations in \(Z\) directly, we shall be interested in determining a region within \(W\), a so called high risk zone, which with a high likelihood contains the points of \(Z\). We shall consider two methods for this job: a non-parametric method based on nearest neighbour distances in \(Y\) and an intensity function based inhomogeneous Poisson process approach incorporating \(q\).

High Risk Zones

A heuristic way to determine a high-risk zone is the following: Determine the distribution function \(D\) of the nearest neighbour distance (NND) distribution based on the 443 points in the point pattern. Use the distribution to determine a desired quantile, say \(0 \leq p \leq 1\) of the NND distribution. Denoting the \(p\) sample quantile of the NND distribution by \(Q(p)\), a \(p\)-quantile NND based high-risk zone is then obtained as the union of putting a disc of radius \(x_q\) around each observed exploded bomb in \(Y\) – in mathematical terms:

\[
R_p = \left(\bigcup_{\bm{y} \in Y} B(\bm{y}, Q(p))\right) \bigcap W =
\left\{\bm{s} \in W : \min_{\bm{y}\in Y} || \bm{s} − \bm{y} || \leq Q_Y(p) \right\},
\]

where \(B(\bm{s}, r)\) denotes a disc of radius \(r\) around the point \(\bm{s}\) and \(||\bm{s} – \bm{y}||\) is the distance between the two points \(\bm{s}\) and \(\bm{y}\). The intersection with \(W\) is done in order to guarantee that the risk zone lies entirely within \(W\). As an example, we would determine the 99%-quantile NND zone for craterA using spatstat functionality as follows:

(Qp <- quantile(nndist(craterA), p = 0.99, type = 8)) ## 99% ## 142.1391 dmap <- distmap(craterA) zone_dist <- eval.im( dmap < Qp )

The above can also be done directly using highriskzone::det_hrz function:

zone_dist <- det_hrz(craterA, type="dist", criterion = "indirect", cutoff=0.99)

Either way, the resulting risk-zone is as follows:

summary(zone_dist) ## high-risk zone of type dist ## criterion: indirect ## cutoff: 0.99 ## ## threshold: 142.1391 ## area of the high-risk zone: 2574507

Mahling, Höhle, and Küchenhoff (2013) show that risk zones constructed by the NND method work surprisingly well despite lacking a clear theoretical justification. One theoretical issue is, for example, that the NND distribution function determined by the above method is for the \((1-q)\) thinned process \(Y\), even though the actual interest is in the process \(X=Y\cup Z\). Because of the thinning one would typically have that \(D_Y(r) \leq D_X(r)\) and thus \(Q_Y(p) > Q_X(p)\). Using \(Q_Y(p)\) to make statements about \(X\) (and thus \(Z\)) is therefore slightly wrong. However, this error cancels, because we then use the points in \(Y\) to add a buffer of radius \(Q_Y(p)\). Had we instead used the smaller, but true, \(Q_X(p)\) the risk zone would have gotten a too small, because \(X\) would also have contained more points to form discs around than \(Y\). The method thus implicitly takes \(q\) non-parametrically into account, because its NND is determined based on \(Y\) and subsequently discs of radius \(Q_Y(p)\) are formed around the points of \(Y\).

Technical details you might want to skip: The above feature is most easily illustrated if \(X\) is a homogeneous Poisson process with intensity \(\lambda_X\). In this case we have that the NND distribution function is (p.68, Illian et al. 2008)

\[
D_X(r) = 1 – \exp(-\lambda_X \pi r^2), \quad r>0.
\]

Also note that \(D_Y(r) = 1 – \exp(-(1-q)\lambda_X \pi r^2)\) and therefore \(D_Y(r) > D_X(r)\). Now solving for the \(p\)-quantile of the NND in this homogeneous Poisson case means solving

\[
Q_Y(p) = \min_{r\geq 0} \{ D_Y(r) \geq p \} \Leftrightarrow Q_Y(p) =
\sqrt{ \frac{\log(1-p)}{\lambda_Y \pi}}.
\]

From this it becomes clear than in the homogeneous Poisson case \(Q_ Y(p)\) is a factor \(\sqrt{1/(1-q)}\) larger than \(Q_X(p)\), which is the actual target of interest.

Assessing the coverage of a risk-zone

Two criterion appear immediate in order to assess the coverage of a risk-zone \(R\):

  1. The probability \(p_{\text{out}}\) that there will be at least one bomb outside the risk zone, i.e. \(P( N( Z \backslash R) > 0)\), where \(N(A)\) denotes the number of events in the set \(A \subseteq W\). Note: this probability is depending heavily on the amount of points in \(Z\), the more points there are, the higher is \(p_{\text{out}}\). However, it reflects the idea "one miss is all it takes to get in trouble".

  2. The proportion of events in \(Z\) not located in \(R\), i.e. \(N( Z \backslash R) / N(Z)\), we shall denote this criterion by \(p_{\text{miss}}\). Note: This probability is taking possible different sizes of \(Z\) into account, but also takes a more relative approach towards how many bombs are not covered by the zone.

Under the assumption of independence between whether \(Z\)-events are within or outside the risk-zone one can convert back and forth between \(p_{\text{miss}}\) and \(p_{\text{out}}\) by

\[
p_{\text{out}} = P( N( Z \backslash R) > 0) = 1- P(N(Z
\backslash R) = 0) \approx 1 – (1-p_{\text{miss}})^{N(Z)},
\]

where one in a simulation setup would know \(Z\) and thus also \(N(Z)\). Note that for a \(p\)-quantile NND risk-zone we expect \(1-p_{\text{miss}}\) to be approximately equal to \(p\). We can investigate the behaviour of risk-zones according to the two above criterion through the use of simulation. Either by simply \(q\)-thinning of the existing point pattern \(Y\) and then use this thinned pattern to determine a risk-zone, which is then evaluated. An alternative approach is to estimate the intensity surface from \(Y\), upscale it to get the intensity of \(X\), simulate \(X\) as an inhomogeneous Poisson point process with this intensity surface, thin this pattern to get a simulated instance of \(Y\), construct the risk-zone based on this pattern and then evaluate the coverage of the zone (Mahling, Höhle, and Küchenhoff 2013). Note that this type of simulation is based on more assumptions compared to the non-parametric thinning simulation approach.

We generate 1,000 realizations of \(Y\) and \(Z\) through \(q\)-thinning of the original craterA pattern while computing the coverage measures for the NND method as follows:

suppressPackageStartupMessages(library(doParallel)) registerDoParallel(cores=4) simulate_method <- "thinning" #"intensity" # "cond_intensity" sim <- foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist"), criterion=c("indirect"), cutoff=0.99, numit = 250, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 1 x 5 ## p_out p_miss `1-p_miss` p_out_derived nZ ## ## 1 0.051 0.00118 0.999 0.0509 44.3

The numbers state the average p_out and p_missin the 1000 simulations. Furthermore, nZ denotes the average number of events in \(Z\). We see that the NND method performs even a little better than intended, because \(1-p_{\text{miss}}\) is even higher than the intended \(p\)=99%. The probability that the risk-zone misses at least one bomb lies as low as 0.051. This is quite close to the above described approximate conversion from \(p_{\text{miss}}\) (0.051 vs. 0.051). Changing the simulation method for \(X\) to that of an inhomogeneous Poisson process with intensity \(1/(1-q) \cdot \hat{\lambda}_Y(\bm{s})\) yields similar results:

## # A tibble: 1 x 5 ## p_out p_miss `1-p_miss` p_out_derived nZ ## ## 1 0.47 0.0143 0.986 0.511 49.6

We note that the probability of missing at least one bomb is much higher under this parametric simulation method. Only a small fraction of this is explained by \(Z\) now consisting of more points. A likely explanation is that the parametric model is only semi-adequate to describe how the point patterns form. Therefore, the new \(X\) might have a somewhat different neighbourhood distribution than anticipated.

To compare more specifically with the intensity function based risk-zone method of Mahling, Höhle, and Küchenhoff (2013) we use a specification, where the risk-zone derived by the NND method or the intensity method have the same area (250 hectare).

sim_area <- foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist","intens"), criterion=rep("area",2), cutoff=rep(2500000,2), numit = 100, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 2 x 6 ## Type p_out p_miss `1-p_miss` p_out_derived area_zone ## ## 1 dist 0.123 0.00278 0.997 0.117 2500009. ## 2 intens 0.55 0.0172 0.983 0.539 2499994.

For the particular example we see an advantage of using the NND method, because both p_out p_miss are much lower for the intensity based method. Again, this might be due to the intensity method being based on assumptions, which for the particular example do not appear to be so adequate. Results in Mahling (2013) were, however, much better for this example (c.f. Tab 2), which could be an indication that there is a problem in the highriskzone package implementation of this method?

Discussion

Being a statistician is fascinating, because the job is the entry ticket to so many diverse research fields. The proposed methods and evaluations helped the Oberfinanzdirektion obtain a quantitative framework to decide which methods to use in their routine risk-assessment. Further details on the above application can be found in Mahling, Höhle, and Küchenhoff (2013) as well as in Monia’s Ph.D. dissertation (Mahling 2013). Note also that the techniques are not limited to UXB detection: Infer-unknown-points-from-a-thinned-process problems occur both in 1D and 2D point processes in a range of other fields, e.g., under-reporting of infectious disease locations or in the calculation of animal abundance in ecology.

As a personal anecdote: When finishing the work on Mahling, Höhle, and Küchenhoff (2013) I was in transition from university to working at a public health institute. The deal was to finish the UXB work partly in spare-time and partly in the new work time. To honour this I added my new work place as second affiliation before submitting, but as part of the institution’s internal clearing procedure of the publication, I was asked to remove this affiliation again by the higher management, because the work ‘had nothing to do with public health’. While its questionable whether exploding bombs really do not have a public health impact, a few months later, I ended up using very similar statistical techniques to model occurred-but-not-yet-reported cases during a critical infectious disease outbreak (Höhle and an der Heiden 2014).



Literature

Baddeley, A., E. Rubak, and R. Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press.

Höhle, M., and M. an der Heiden. 2014. “Bayesian Nowcasting During the STEC O104:H4 Outbreak in Germany, 2011.” Biometrics 70 (4): 993–1002. doi:10.1111/biom.12194.

Illian, J., A. Penttinen, H. Stoyan, and D. Stoyan. 2008. Stistical Analysis and Modelling of Spatial Point Patterns. Wiley.

Mahling, M. 2013. “Determining High-Risk Zones by Using Spatial Point Process Methodology.” PhD thesis, Department of Statistics, University of Munich. https://edoc.ub.uni-muenchen.de/15886/1/Mahling_Monia.pdf.

Mahling, M., M. Höhle, and H. Küchenhoff. 2013. “Determining high-risk zones for unexploded World War II bombs by using point process methodology.” Journal of the Royal Statistical Society, Series C 62 (2): 181–99. doi:10.1111/j.1467-9876.2012.01055.x.

Seibold, H., M. Mahling, S. Linne, and F. Günther. 2017. Highriskzone: Determining and Evaluating High-Risk Zones. https://cran.r-project.org/web/packages/highriskzone/index.html.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2018-05 Selective Raster Graphics

Thu, 05/24/2018 - 23:57

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This report explores ways to render specific components of an R plot in raster format, when the overall format of the plot is vector. For example, we demonstrate ways to draw raster data symbols within a PDF scatter plot. A general solution is provided by the grid.rasterize function from the R package ‘rasterize’.

Paul Murrell

Download

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages