Scico and the Colour Conundrum
(This article was first published on Data Imaginist, and kindly contributed to Rbloggers)
I’m happy to once again announce the release of a package. This time it happens to be a rather unplanned and quick new package, which is fun for a change. The package in question is scico which provides access to the colour palettes developed by Fabio Crameri as well as scale functions for ggplot2 so they can be used there. As there is not a lot to talk about in such a simple package I’ll also spend some time discussing why choice of colour is important beyond aesthtic considerations, and discuss how the quality of a palette might be assesed.
An overview of the packagescico provides a total of 17 different continuous palettes, all of which are available with the scico() function. For anyone having used viridis() the scico() API is very familiar:
library(scico) scico(15, palette = 'oslo') ## [1] "#000000" "#09131E" "#0C2236" "#133352" "#19456F" "#24588E" "#3569AC" ## [8] "#4D7CC6" "#668CCB" "#7C99CA" "#94A8C9" "#ABB6C7" "#C4C7CC" "#E1E1E1" ## [15] "#FFFFFF"In order to get a quick overview of all the available palettes use the scico_palette_show() function:
scico_palette_show()As can be seen, the collection consists of both sequential and diverging palettes, both of which have their own uses depending on the data you want to show. A special mention goes to the oleron palette which is intended for topographical height data in order to produce the well known atlas look. Be sure to center this palette around 0 or else you will end up with very misleading maps.
ggplot2 support is provided with the scale_[colourcolorfill]_scico() functions, which works as expected:
library(ggplot2) volcano < data.frame( x = rep(seq_len(ncol(volcano)), each = nrow(volcano)), y = rep(seq_len(nrow(volcano)), ncol(volcano)), Altitude = as.vector(volcano) ) ggplot(volcano, aes(x = x, y = y, fill = Altitude)) + geom_raster() + theme_void() + scale_fill_scico(palette = 'turku')This is more or less all there is for this package… Now, let’s get to the meat of the discussion.
What’s in a colour?If you’ve ever wondered why we are not all just using rainbow colours in our plots (after all, rainbows are pretty…) it’s because our choice of colour scale have a deep impact on what changes in the underlying data our eyes can percieve. The rainbow colour scale is still very common and notoriously bad – see e.g. Borland & Taylor (2007), The Rainbow Colour Map (repeatedly) considered harmful, and How The Rainbow Color Map Misleads – due to two huge problems that are fundamental to designing good colour scales: Perceptual Uniformity and Colour Blind Safe. Both of these issues have been taken into account when designing the scico palettes, but let’s tackle them one by one:
Colour blindnessUp to 10% of north european males have the most common type of colour blindness (deuteranomaly, also known as redgreen colour blindness), while the number is lower for other population groups. In addition, other, rarer, types of colour blindness exists as well. In any case, the chance that a person with a color vision deficiency will look at your plots is pretty high.
As we have to assume that the plots we produce will be looked at by people with color vision deficiency, we must make sure that the colours we use to encode data can be clearly read by them (ornamental colours are less important as they – hopefully – don’t impact the conclusion of the graphic). Thanksfully there are ways to simulate how colours are percieved by people with various types of colour blindness. Let’s look at the rainbow colour map:
library(pals) pal.safe(rainbow, main = 'Rainbow scale')As can be seen, there are huge areas of the scale where key tints disappears, making it impossible to correctly map colours back to their original data values. Put this in contrast to one of the scico palettes:
pal.safe(scico(100, palette = 'tokyo'), main = 'Tokyo scale')While colour blindness certainly have an effect here, it is less detrimental as changes along the scale can still be percieved and the same tint is not occuring at multiple places.
Perceptual uniformityWhile lack of colour blind safety “only” affects a subgroup of your audience, lack of perceptual uniformity affects everyone – even you. Behind the slightly highbrow name lies the criteria that equal jumps in the underlying data should result in equal jumps in percieved colour difference. Said in another way, every step along the palette should be percieved as giving the same amount of difference in colour.
One way to assess perceptual uniformity is by looking at small oscillations inside the scale. Let’s return to our favourite worst rainbow scale:
pal.sineramp(rainbow, main = 'Rainbow scale')We can see that there are huge differences in how clearly the oscilations appear along the scale and around the green area they even disappears. In comparison the scico palettes produces much more even resuls:
pal.sineramp(scico(100, palette = 'tokyo'), main = 'Tokyo scale') But wait – there’s more!This is just a very short overview into the world of colour perception and how it affects information visualisation. The pals package contains more functions to assess the quality of colour palettes, some of which has been collected in an ensemble function:
pal.test(scico(100, palette = 'broc'), main = 'Broc scale')It also has a vignette that explains in more detail how the different plots can be used to look into different aspects of the palette.
scico is also not the only package that provides welldesigned, safe, colour palettes. RColorBrewer has been a beloved utility for a long time, as well as the more recent viridis. Still, choice is good and using the same palettes for prolonged time can make them seem old and boring, so the more the merrier.
A last honerable mention is the overview of palettes in R that Emil Hvitfeldt has put together. Not all of the palettes in it (the lions share actually) have been designed with the issues discussed above in mind, but sometimes thats OK – at least you now know how to assess the impact of your choice and weigh it out with the other considerations you have.
Always be weary of colours
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
User authentication in R Shiny – sneak peek of shiny.users and shiny.admin packages
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
At Appsilon we frequently build advanced R/Shiny dashboards that need user authentication. I would like to share with you how we implement user management – user accounts, the authorization process and gather usage statistics to have a better understanding of how the app is used.
For user authentication, we created the shiny.users package. For user administration and usage statistics, we created the shiny.admin package. We will show you a sneak peek of how these packages work. They are not released yet – but we opened an Early Adopter Waitlist – sign up and we will contact you as soon as these packages will be ready for production testing!
Shiny.users – convenient authentication on the application levelHere is an example of how shiny.users works when you add it to your shiny application:
Below you can view the code of this example:
library(shiny) library(shiny.users) demo_users < list( list( username = "demoappsilon", password = "QZt85C46Ab", roles = list("basic", "admin") ), list( username = "john", password = "C4SRx3xJcy", roles = list("basic") ) ) ui < shinyUI(fluidPage( div(class = "container", style = "padding: 4em", login_screen_ui('login_screen'), uiOutput("authorized_content") ) )) server < shinyServer(function(input, output) { users < initialize_users(demo_users) callModule(login_screen, 'login_screen', users) output$authorized_content < renderUI({ if (!is.null(users$user()) { ... # application content } }) }) shinyApp(ui, server)This is a simple demo with users defined in a list. In a realworld scenario, you would store user credentials in a database or authenticate users with an external API. This is also covered in shiny.users.
In bigger apps with more users, you will also need user management, authorization management (different roles e.g. admin, writer, reader, etc) and usage statistics. For such advanced use cases, we created shiny.admin package.
Shiny.admin – user management and usage statisticsThis package is an extension to the shiny.users package. When you add it to your Shiny app, you instantly get admin panel under the url http://applicationurl/#!/admin. This panel is only accessible by the users with “admin” role.
Features of shiny.admin:
 Adding and removing users
 Managing existing users – their passwords, roles and metadata
 Statistics – counting unique users and sessions (similar to Google Analytics)
 Session recording – recording defined actions e.g. provided input values and clicked elements. You can view all sessions details (similar to Hotjar)
We observed huge gains from above features. We exactly understood how our apps were used by our users. Here are some screenshots from shiny.admin:
I hope you like the idea of shiny.users and shiny.admin packages. Let us know what would you expect from user accounts and admin panel features. Do you think there is a need for developing such packages? Tell us about your use cases and needs in the comments or contact us at dev@appsilondatascience.com.
Also, don’t forget to sign up to shiny.users and shiny.admin waitlist!
Subscribe to shiny.users & shiny.admin waitlist!
Read the original post at
Appsilon Data Science Blog.
 Follow us on Twitter!
 Check us out on Facebook page!
 Follow us on LinkedIn!
 Star our GitHub packages, including shiny.semantic, shiny.router, shiny.collections, shiny.i18n and semantic.dashboard!
 Sign up for our newsletter and get free ebook!
To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A recipe for recipes
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
If you build statistical or machine learning models, the recipes package can be useful for data preparation. A recipe object is a container that holds all the steps that should be performed to go from the raw data set to the set that is fed into model a algorithm. Once your recipe is ready it can be executed on a data set at once, to perform all the steps. Not only on the train set on which the recipe was created, but also on new data, such as test sets and data that should be scored by the model. This assures that new data gets the exact same preparation as the train set, and thus can be validly scored by the learned model. The author of recipes, Max Kuhn, has provided abundant material to familiarize yourself with the richness of the package, see here, here, and here. I will not dwell on how to use the package. Rather, I’d like to share what in my opinion is a good way to create new steps and checks to the package1. Use of the package is probably intuitive. Developing new steps and checks, however, does require a little more understanding of the package inner workings. With this procedure, or recipe if you like, I hope you will find adding (and maybe even contributing) your own steps and checks becomes easier and more organized.
S3 classes in recipesLets build a very simple recipe:
library(tidyverse) library(recipes) rec < recipe(mtcars) %>% step_center(everything()) %>% step_scale(everything()) %>% check_missing(everything()) rec %>% class() ## [1] "recipe"As mentioned a recipe is a container for the steps and checks. It is a list of class recipe on which the prep, bake, print and tidy methods do the work as described in their respective documentation files. The steps and checks added to the recipe are stored inside this list. As you can see below, each of them are of their own subclass, as well as of the generic step or check classes.
rec$steps %>% map(class) ## [[1]] ## [1] "step_center" "step" ## ## [[2]] ## [1] "step_scale" "step" ## ## [[3]] ## [1] "check_missing" "check"Each subclass has the same four methods defined as the recipe class. As one of the methods is called on the recipe, it will call the same method on all the steps and checks that are added to the recipe. (Exception is prep.recipe, which does not only callprep on all the steps and checks, but also bake). This means that when implementing a new step or check, you should provide these four methods. Additionally, we’ll need the function that is called by the user to add it to the recipe object and a constructor (if you are not sure what this is, see 15.3.1 of Advanced R by Hadley Wickham).
The recipes for recipesWhen writing a new step or check you will probably be inclined to copypaste an existing step and start tweaking it from the top. Thus, first writing the function, then the constructor, and then the methods one by one. I think this is suboptimal and can get messy quickly. My preferred way is to start by not bothering about recipes at all, but write a general function that does the preparation work on a single vector. Once you are happy with this function you sit and think about which arguments to this function should be provided with upfront. These should be added as arguments to the function that is called by the user. Next you think about which function arguments are statistics that should be learned on the train set in the prep part of the recipe. You’ll then go on and do the constructor that is called by both the main function and the prep method. Only then you’ll write the function that is called by the user. You’ll custom step or check is completed by writing the four methods, since the functionality is already created, these are mostly bookkeeping.
For both checks and steps I made a skeleton available, in which all the “overhead” that should be in a new step or check is present. This is more convenient than copypaste and existing example and try to figure out what is step specific and should be erased. You’ll find the skeletons on github. Note that for creating your own steps and checks you should clone the package source code, since the helper functions that are used are not exported by the package.
Putting it to practiceWe are going to do two examples in which the recipe for recipes is applied.
Example 1: A signed logFirst up is a signedlog (inspired by Practical Data Science with R), which is taking the log over the absolute value of a variable, multiplied by its original sign. Thus, enabling us to take logs over negative values. If a variable is between 1 and 1, we’ll set it to 0, otherwise things get messy.
1) preparing the functionThis is what the function on a vector could look like, if we did not bother about recipes:
signed_log < function(x, base = exp(1)) { ifelse(abs(x) < 1, 0, sign(x) * log(abs(x), base = base)) } 2) think about the argumentsThe only argument of the function is base, that should be provided upfront when adding the step to a recipe object. There are no statistics to be learned in the prep step.
3) the constructorNow we are going to write the first of the recipes functions. This is the constructor that produces new instances of the object of class step_signed_log. The first four arguments are present in each step or check, they are therefore part of the skeletons. The terms argument will hold the information on which columns the step should be performed. For role, train and skip, please see the documentation in one of the skeletons. base is step_signed_log specific, as used in 1). In prep.step_signed_log the tidyselect arguments will be converted to a character vector holding the actual column names. columns will store these names in the step_signed_log object. This container argument will not be necessary if the columns names are also present in another way. For instance, step_center has the argument means, that will be populated by the means of the variables of the train set by its prep method. In the bake method the names of the columns to be prepared are already provided by the names of the means and there is need to use the columns argument.
step_signed_log_new < function(terms = NULL, role = NA, skip = FALSE, trained = FALSE, base = NULL, columns = NULL) { step( subclass = "signed_log", terms = terms, role = role, skip = skip, trained = trained, base = base, columns = columns ) } 4) the function to add it to the recipeNext up is the function that will be called by the user to add the step to its recipe. You’ll see the internal helper function add_step is called. It will expand the recipe with the step_signed_log object that is produced by the constructor we just created.
step_signed_log < function(recipe, ..., role = NA, skip = FALSE, trained = FALSE, base = exp(1), columns = NULL) { add_step( recipe, step_signed_log_new( terms = ellipse_check(...), role = role, skip = skip, trained = trained, base = base, columns = columns ) ) } 5) the prep methodAs recognized in 2) we don’t have to do much in the prep method of this particular step, since the preparation of new sets does not depend on statistics learned on the train set. The only thing we do here is applying the internal function terms_select function to replace the tidyselect selections, by the actual names of the columns on which step_signed_log should be applied. We call the constructor again, indicating that the step is trained and we supplying the column names at the columns argument.
prep.step_signed_log < function(x, training, info = NULL, ...) { col_names < terms_select(x$terms, info = info) step_signed_log_new( terms = x$terms, role = x$role, skip = x$skip, trained = TRUE, base = x$base, columns = col_names ) } 6) the bake methodWe are now ready to apply the baking function, designed at 1), inside the recipes framework. We loop through the variables, apply the function and return the updated data frame.
bake.step_signed_log < function(object, newdata, ...) { col_names < object$columns for (i in seq_along(col_names)) { col < newdata[[ col_names[i] ]] newdata[, col_names[i]] < ifelse(abs(col) < 1, 0, sign(col) * log(abs(col), base = object$base)) } as_tibble(newdata) } 7) the print methodThis assures pretty printing of the recipe object to which step_signed_log is added. You use the internal printer function with a message specific for the step.
print.step_signed_log < function(x, width = max(20, options()$width  30), ...) { cat("Taking the signed log for ", sep = "") printer(x$columns, x$terms, x$trained, width = width) invisible(x) } 8) the tidy methodFinally, tidy will add a line for this step to the data frame when the tidy method is called on a recipe.
tidy.step_signed_log < function(x, ...) { if (is_trained(x)) { res < tibble(terms = x$columns) } else { res < tibble(terms = sel2char(x$terms)) } res }Lets do a quick check to see if it works as expected
recipe(data_frame(x = 1)) %>% step_signed_log(x) %>% prep() %>% bake(data_frame(x = 3:3)) ## # A tibble: 7 x 1 ## x ## ## 1 1.0986123 ## 2 0.6931472 ## 3 0.0000000 ## 4 0.0000000 ## 5 0.0000000 ## 6 0.6931472 ## 7 1.0986123 Example 2: A range checkModel predictions might be invalid when the range of a variable in new data is shifted from the range of the variable in the train set. Lets do a second example in which we check if the range of a numeric variable is approximately equal to the range of the variable in the train set. We do so by checking if the variable’s minimum value in the new data is not smaller than its minimum value in the train set. The variable’s maximum value in the test set should not exceed the maximum in the train set. We allow for some slack (proportion of the variable range in the train set) to account for natural variation.
1) preparing the functionAs mentioned, checks are about throwing informative errors if assumptions are not met. This is a function we could apply on new variables, without bothering about recipes:
range_check_func < function(x, lower, upper, slack_prop = 0.05, colname = "x") { min_x < min(x) max_x < max(x) slack < (upper  lower) * slack_prop if (min_x < (lower  slack) & max_x > (upper + slack)) { stop("min ", colname, " is ", min_x, ", lower bound is ", lower  slack, "\n", "max x is ", max_x, ", upper bound is ", upper + slack, call. = FALSE) } else if (min_x < (lower  slack)) { stop("min ", colname, " is ", min_x, ", lower bound is ", lower  slack, call. = FALSE) } else if (max_x > (upper + slack)) { stop("max ", colname, " is ", max_x, ", upper bound is ", upper + slack, call. = FALSE) } } 2) think about the argumentsThe slack_prop is a choice that the user of the check should make upfront. This is thus an argument of check_range. Then there are two statistics to be learned in the prep method: lower and upper. These should be arguments of the function and the constructor as well. However, when calling the function these are always NULL, they are container arguments that will filled when calling prep.check_range.
3) the constructorWe start again with the four arguments present in every step or check. Subsequently we add the three arguments that we recognized to be part of the check.
check_range_new < function(terms = NULL, role = NA, skip = FALSE, trained = FALSE, lower = NULL, upper = NULL, slack_prop = NULL) { check(subclass = "range", terms = terms, role = role, trained = trained, lower = lower, upper = upper, slack_prop = slack_prop) } 4) the function to add it to the recipeAs we know by now, it is just calling the constructor and adding it to the recipe.
check_range < function(recipe, ..., role = NA, skip = FALSE, trained = FALSE, lower = NULL, upper = NULL, slack_prop = 0.05) { add_check( recipe, check_range_new( terms = ellipse_check(...), role = role, skip = skip, trained = trained, lower = lower, upper = upper, slack_prop = slack_prop ) ) } 5) the prep methodHere the method is getting a lot more interesting, because we actually have work to do. We are calling vapply on each of the columns the check should be applied on, to derive the minimum and maximum. Again the constructor is called and the learned statistics are now populating the lower and upper arguments.
prep.check_range < function(x, training, info = NULL, ...) { col_names < terms_select(x$terms, info = info) lower_vals < vapply(training[ ,col_names], min, c(min = 1), na.rm = TRUE) upper_vals < vapply(training[ ,col_names], max, c(max = 1), na.rm = TRUE) check_range_new( x$terms, role = x$role, trained = TRUE, lower = lower_vals, upper = upper_vals, slack_prop = x$slack_prop ) } 6) the bake methodThe hard work has been done already. We just get the columns on which to apply the check and check them with the function we created at 1).
bake.check_range < function(object, newdata, ...) { col_names < names(object$lower) for (i in seq_along(col_names)) { colname < col_names[i] range_check_func(newdata[[ colname ]], object$lower[colname], object$upper[colname], object$slack_prop, colname) } as_tibble(newdata) } 7) the print method print.check_range < function(x, width = max(20, options()$width  30), ...) { cat("Checking range of ", sep = "") printer(names(x$lower), x$terms, x$trained, width = width) invisible(x) } 8) the tidy method tidy.check_range < function(x, ...) { if (is_trained(x)) { res < tibble(terms = x$columns) } else { res < tibble(terms = sel2char(x$terms)) } res }Again, we check quickly if it works
cr1 < data_frame(x = 0:100) cr2 < data_frame(x = 1:101) cr3 < data_frame(x = 6:100) cr4 < data_frame(x = 0:106) recipe_cr < recipe(cr1) %>% check_range(x) %>% prep() cr2_baked < recipe_cr %>% bake(cr2) cr3_baked < recipe_cr %>% bake(cr3) ## Error: min x is 6, lower bound is 5 cr4_baked < recipe_cr %>% bake(cr4) ## Error: max x is 106, upper bound is 105 ConclusionIf you like to add your own data preparation steps and data checks to the recipes package, I advise you to do this in a structured way so you are not distracting by the bookkeeping while implementing the functionality. I propose eight subsequent parts to develop a new step or check:
1) Create a function on a vector that could be applied in the bake method, but does not bother about recipes yet.
2) Recognize which function arguments should be provided upfront and which should be learned in the prep method.
3) Create a constructor in the form step__new or check__new.
4) Create the actual function to add the step or check to a recipe, in the form step_ or check_.
5) Write the prep method.
6) Write the bake method.
7) Write the print method.
8) Write the tidy method.
As mentioned, the source code is maintained here. Make sure to clone the latest version to access the package internal functions.

Together with Max I added the checks framework to the package. Where steps are transformations of variables, checks are assertions of them. If a check passes nothing happens. If it fails, it will break the bake method of the recipe and throws an informative error.
To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
EARL Boston 2018 Keynote Speaker announcement: Bob Rudis
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
Mango Solutions are delighted to announce our Keynote Speaker for the EARL Boston Conference: Bob Rudis, [Master] Chief Data Scientist at Rapid7.
Bob has more than 20 years’experience using data to help defend global Fortune 100 companies and in his current role at Rapid7, he specializes in research on internetscale exposure. He was formerly a Security Data Scientist and Managing Principal at Verizon, overseeing the team that produces the annual Data Breach Investigations Report.
If you’re on Twitter, you’ll know Bob as a serial tweeter (he’s even got the blue tick) – if you don’t already follow him, you can find him at @hrbrmstr. He’s also an avid blogger at rud.is, author (DataDriven Security) and speaker.
As a prominent and avuncular participant of the twitter #rstats conversation and prolific package author, Bob is a respected and valued member of the global R community.
We’re looking forward to seeing him speak on 13 November in Boston!
Take the stage with BobAbstract submissions for the US Roadshow close on 30 April 2018. You could be on the agenda with Bob in Boston as one of our speakers if you would like to share the R successes in your organisation.
Early bird tickets now availableTickets for all EARL Conferences are now available:
London: 1113 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November
To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Fix Is In: Finding infix functions inside contributed R package “utilities” files
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
Regular readers will recall the “utility belt” post from back in April of this year. This is a followup to a request made asking for a list of all the % infix functions in those files.
We’re going to:
 collect up all of the sources
 parse them
 find all the definitions of % infix functions
 write them to a file
We’ll start by grabbing the data from the previous post and look at it as a refresher:
library(stringi) library(tidyverse) utils < read_rds(url("https://rud.is/dl/utilitybelt.rds")) utils ## # A tibble: 1,746 x 13 ## permsissions links owner group size month day year_hr path date pkg fil file_src ## 1 rwrr 0 hornik users 1658 Jun 05 2016 AHR/R… 20160605 AHR util… "## \\int f(x)dg(x) … ## 2 rwrr 0 ligges users 12609 Dec 13 2016 ALA4R… 20161213 ALA4R util… "## some utility fun… ## 3 rwrr 0 hornik users 0 Feb 24 2017 AWR.K… 20170224 AWR.… util… "" ## 4 rwrr 0 ligges users 4127 Aug 30 2017 Alpha… 20170830 Alph… util… "#\n#' Assign API ke… ## 5 rwrr 0 ligges users 121 Jan 19 2017 Amylo… 20170119 Amyl… util… "make_decision < fu… ## 6 rwrr 0 herbrandt herbrandt 52 Aug 10 2017 BANES… 20170810 BANE… util… "#' @importFrom dply… ## 7 rwrr 0 ripley users 36977 Jan 06 2015 BEQI2… 20150106 BEQI2 util… "#' \tRemove Redunda… ## 8 rwrr 0 hornik users 34198 May 10 2017 BGDat… 20170510 BGDa… util… "# A more memoryeff… ## 9 rwxrxrx 0 ligges users 3676 Aug 14 2016 BGLR/… 20160814 BGLR util… "\n readBinMat=funct… ## 10 rwrr 0 ripley users 2547 Feb 04 2015 BLCOP… 20150204 BLCOP util… "###################… ## # ... with 1,736 more rowsNote that we somewhat expected the file source to potentially come in handy at a later date and also expected the need to revisit that post, so the R data file [←direct link to RDS] included a file_src column.
Now, let’s find all the source files with at least one infix definition, collect them together and parse them so we can do more code spelunking:
filter(utils, stri_detect_fixed(file_src, "`%")) %>% # only find sources with infix definitions pull(file_src) %>% paste0(collapse="\n\n") %>% parse(text = ., keep.source=TRUE) > infix_src str(infix_src, 1) ## length 1364 expression(dplyr::`%>%`, `%%` < function(a, b) if (is.null(a)) b else a, get_pkg_path < function(ctx) { pkg_ __truncated__ ... ##  attr(*, "srcref")=List of 1364 ##  attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' ##  attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 15768 0 0 0 1 15768 ## .. attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'We can now take all of that lovely parsed source and tokenize it to work with the discrete elements in a very tidy manner:
infix_parsed < tbl_df(getParseData(infix_src)) # tbl_df() is mainly for pretty printing infix_parsed ## # A tibble: 118,242 x 9 ## line1 col1 line2 col2 id parent token terminal text ## 1 1 1 1 24 1 10 COMMENT TRUE #' @impor… ## 2 2 1 2 10 4 10 COMMENT TRUE #' @export ## 3 3 1 3 12 10 0 expr FALSE "" ## 4 3 1 3 5 7 10 SYMBOL_PACKAGE TRUE dplyr ## 5 3 6 3 7 8 10 NS_GET TRUE :: ## 6 3 8 3 12 9 10 SYMBOL TRUE `%>%` ## 7 5 1 5 49 51 0 expr FALSE "" ## 8 5 1 5 6 16 18 SYMBOL TRUE `%%` ## 9 5 1 5 6 18 51 expr FALSE "" ## 10 5 8 5 9 17 51 LEFT_ASSIGN TRUE < ## # ... with 118,232 more rowsWe just need to find a sequence of tokens that make up a function definition, then whittle those down to ones that look like our % infix names:
pat < c("SYMBOL", "expr", "LEFT_ASSIGN", "expr", "FUNCTION") # pattern for function definition # find all of ^^ sequences (there's a good twitter discussion on this abt a month ago) idx < which(infix_parsed$token == pat[1]) # find location of match of start of seq # look for the rest of the sequences starting at each idx position map_lgl(idx, ~{ all(infix_parsed$token[.x:(.x+(length(pat)1))] == pat) }) > found f_defs < idx[found] # starting indices of all the places where functions are defined # filter ^^ to only find infix ones infix_defs < f_defs[stri_detect_regex(infix_parsed$text[f_defs], "^`\\%")] # there aren't too many, but remember we're just searching `util` functions length(infix_defs) ## [1] 106Now, write it out to a file so we can peruse the infix functions:
# nuke a file and fill it with the function definition cat("", sep="", file="infix_functions.R") walk2( getParseText(infix_parsed, infix_parsed$id[infix_defs]), # extract the infix name getParseText(infix_parsed, infix_parsed$id[infix_defs + 3]), # extract the function definition body ~{ cat(.x, " < ", .y, "\n\n", sep="", file="infix_functions.R", append=TRUE) } )There are 106 of them so you can find the extracted ones in this gist.
Here's an overview of what you can expect to find:
# A tibble: 39 x 2 name n 1 `%%` 47 2 `%+%` 7 3 `%AND%` 4 4 `%notin%` 4 5 `%:::%` 3 6 `%==%` 3 7 `%!=%` 2 8 `%*diag%` 2 9 `%diag*%` 2 10 `%nin%` 2 11 `%OR%` 2 12 `%::%` 1 13 `%??%` 1 14 `%.%` 1 15 `%@%` 1 16 `%&&%` 1 17 `%&%` 1 18 `%+&%` 1 19 `%++%` 1 20 `%+%` 1 21 `%<<%` 1 22 `%>>%` 1 23 `%~~%` 1 24 `%assert_class%` 1 25 `%contains%` 1 26 `%din%` 1 27 `%fin%` 1 28 `%identical%` 1 29 `%In%` 1 30 `%inr%` 1 31 `%M%` 1 32 `%notchin%` 1 33 `%or%` 1 34 `%p%` 1 35 `%pin%` 1 36 `%R%` 1 37 `%s%` 1 38 `%sub_in%` 1 39 `%sub_nin%` 1 FINIf any of those are useful, feel free to PR them in to https://github.com/hrbrmstr/freebase/blob/master/inst/templates/infixhelpers.R (and add yourself to the DESCRIPTION if you do).
Hopefully this provided some further inspiration to continue to use R not only as your language of choice but also as a fun data source.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Daily Streamflow Trend Analysis
(This article was first published on The USGS OWI blog , and kindly contributed to Rbloggers)
IntroductionThis document describes how to produce a set of graphics and perform the associated statistical tests that describe trends in daily streamflow at a single streamgage. The trends depicted cover the full range of quantiles of the streamflow distribution, from the lowest discharge of the year, through the median discharge, up through the highest discharge of the year. The method is built around the R package EGRET Exploration and Graphics for RivEr Trends. It makes it possible to consider trends over any portion of the time period for which there are daily streamflow records, and any period of analysis (the portion of the year to be considered, e.g. the entire year or just the summer months).
Getting StartedFirst, the EGRET package needs to be installed and loaded. Then, you’ll need need to create an eList object, which contains site information and daily discharge data for streamgage. Page 13 of the EGRET user guide here.
For this post, we will use the Choptank River in Maryland as our first example. There is an example data set included in EGRET. The data set consists of site information, daily discharge data, and water quality data, but this application does not use the water quality data. Refer to the section near the end of this document called Downloading the data for your site of interest to see how you can set up an eList object for any USGS streamgage.
There are two limitations that users should know about this application before proceeding any further.

The code was designed for discharge records that are complete (no gaps). This is usually not a problem with USGS discharge records unless there is a major gap (typically years in length) when the streamgage was not operating. If records have a number of small gaps, users could use some established method for filling in missing data to create a gapfree record.

The discharge on every day should be a positive value (not zero or negative). The EGRET code that is used here to read in new data has a “work around” for situations where there are a very small number of nonpositive discharge values. It adds a small constant to all the discharge data so they will all be positive. This should have almost no impact on the results provided the number of nonpositive days is very small, say less than 0.1% of all the days. That translates to about 11 days out of 30 years. For data sets with more zero or negative flow days different code would need to be written (we would appreciate it if an user could work on developing such a set of code).
To start, the following R commands are needed.
library(EGRET) eList < Choptank_eListJust to get some sense about the data we will look a portion of the metadata (gage ID number, name, and drainage area in square kilometers) and also see a summary of the discharge data (discharge in in cubic meters per second).
print(eList$INFO$site.no) ## [1] "01491000" print(eList$INFO$shortName) ## [1] "Choptank River" print(eList$INFO$drainSqKm) ## [1] 292.6687 print(summary(eList$Daily$Date)) ## Min. 1st Qu. Median Mean 3rd Qu. ## "19791001" "19870930" "19950930" "19950930" "20030930" ## Max. ## "20110930" print(summary(eList$Daily$Q)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00991 0.93446 2.40693 4.08658 4.61565 246.35656 Loading the necessary packages and other R codeTo run the analysis and produce the graphs you will need a few R functions in addition to the EGRET package. You can copy the entire block of code shown below here and paste it into your workspace (all as a single copy and paste) or you can create an .R file from the code that you will source each time you want to use it. Let’s say you call it flowTrends.R, then each time you want to use it you need to “source” the object flowTrends.R.
In addition to EGRET the functions use the packages rkt, zyp, lubridate, and Kendall. Make sure you have installed all of them.
Show/Hide Code
library(rkt) library(zyp) library(lubridate) ######################################################################################### ########## this is the function you will use to make a single trend graph ############## ######################################################################################### plotFlowTrend < function (eList, istat, startDate = NA, endDate = NA, paStart = 4, paLong = 12, window = 30, qMax = NA, printTitle = TRUE, tinyPlot = FALSE, customPar = FALSE, runoff = FALSE, qUnit = 2, printStaName = TRUE, printPA = TRUE, printIstat = TRUE, cex = 0.8, cex.axis = 1.1, cex.main = 1.1, lwd = 2, col = "black", ...){ localDaily < getDaily(eList) localINFO < getInfo(eList) localINFO$paStart < paStart localINFO$paLong < paLong localINFO$window < window start < as.Date(startDate) end < as.Date(endDate) if(is.na(startDate)){ start < as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end < as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily < subset(localDaily, Date >= start & Date <= end) eList < as.egret(localINFO,localDaily) localAnnualSeries < makeAnnualSeries(eList) qActual < localAnnualSeries[2, istat, ] qSmooth < localAnnualSeries[3, istat, ] years < localAnnualSeries[1, istat, ] Q < qActual time < years LogQ < log(Q) mktFrame < data.frame(time,LogQ) mktFrame < na.omit(mktFrame) mktOut < rkt::rkt(mktFrame$time,mktFrame$LogQ) zypOut < zyp::zyp.zhang(mktFrame$LogQ,mktFrame$time) slope < mktOut$B slopePct < 100 * (exp(slope))  100 slopePct < format(slopePct,digits=2) pValue < zypOut[6] pValue < format(pValue,digits = 3) if (is.numeric(qUnit)) { qUnit < qConst[shortCode = qUnit][[1]] } else if (is.character(qUnit)) { qUnit < qConst[qUnit][[1]] } qFactor < qUnit@qUnitFactor yLab < qUnit@qUnitTiny if (runoff) { qActual < qActual * 86.4/localINFO$drainSqKm qSmooth < qSmooth * 86.4/localINFO$drainSqKm yLab < "Runoff in mm/day" } else { qActual < qActual * qFactor qSmooth < qSmooth * qFactor } localSeries < data.frame(years, qActual, qSmooth) yInfo < generalAxis(x = qActual, maxVal = qMax, minVal = 0, tinyPlot = tinyPlot) xInfo < generalAxis(x = localSeries$years, maxVal = decimal_date(end), minVal = decimal_date(start), padPercent = 0, tinyPlot = tinyPlot) line1 < localINFO$shortName nameIstat < c("minimum day", "7day minimum", "30day minimum", "median daily", "mean daily", "30day maximum", "7day maximum", "maximum day") line2 < paste0("\n", setSeasonLabelByUser(paStartInput = paStart, paLongInput = paLong), " ", nameIstat[istat]) line3 < paste0("\nSlope estimate is ",slopePct,"% per year, MannKendall pvalue is ",pValue) if(tinyPlot){ title < paste(nameIstat[istat]) } else { title < paste(line1, line2, line3) } if (!printTitle){ title < "" } genericEGRETDotPlot(x = localSeries$years, y = localSeries$qActual, xlim = c(xInfo$bottom, xInfo$top), ylim = c(yInfo$bottom, yInfo$top), xlab = "", ylab = yLab, customPar = customPar, xTicks = xInfo$ticks, yTicks = yInfo$ticks, cex = cex, plotTitle = title, cex.axis = cex.axis, cex.main = cex.main, tinyPlot = tinyPlot, lwd = lwd, col = col, ...) lines(localSeries$years, localSeries$qSmooth, lwd = lwd, col = col) } ######################################################################################### ###### this the the function you will use to make the Quantile Kendall Plot ############# ######################################################################################### plotQuantileKendall < function(eList, startDate = NA, endDate = NA, paStart = 4, paLong = 12, legendLocation = "topleft", legendSize = 1.0, yMax = NA, yMin = NA) { localDaily < eList$Daily localINFO < eList$INFO localINFO$paStart < paStart localINFO$paLong < paLong start < as.Date(startDate) end < as.Date(endDate) if(is.na(startDate)){ start < as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end < as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily < subset(localDaily, Date >= start & Date <= end) eList < as.egret(localINFO,localDaily) eList < setPA(eList, paStart=paStart, paLong=paLong) v < makeSortQ(eList) sortQ < v[[1]] time < v[[2]] results < trendSortQ(sortQ, time) pvals < c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999) zvals < qnorm(pvals) name < eList$INFO$shortName # ymax < trunc(max(results$slopePct)*10) # ymax < max(ymax + 2, 5) # ymin < floor(min(results$slopePct)*10) # ymin < min(ymin  2, 5) # yrange < c(ymin/10, ymax/10) # yticks < axisTicks(yrange, log = FALSE) ymax < max(results$slopePct + 0.5, yMax, na.rm = TRUE) ymin < min(results$slopePct  0.5, yMin, na.rm = TRUE) yrange < c(ymin, ymax) yticks < axisTicks(yrange, log = FALSE, nint =7) p < results$pValueAdj color < ifelse(p <= 0.1,"black","snow3") color < ifelse(p < 0.05, "red", color) pvals < c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999) zvals < qnorm(pvals) name < paste0("\n", eList$INFO$shortName,"\n", start," through ", end, "\n", setSeasonLabelByUser(paStartInput = paStart, paLongInput = paLong)) plot(results$z,results$slopePct,col = color, pch = 20, cex = 1.0, xlab = "Daily nonexceedance probability", ylab = "Trend slope in percent per year", xlim = c(3.2, 3.2), ylim = yrange, yaxs = "i", las = 1, tck = 0.02, cex.lab = 1.2, cex.axis = 1.2, axes = FALSE, frame.plot=TRUE) mtext(name, side =3, line = 0.2, cex = 1.2) axis(1,at=zvals,labels=pvals, las = 1, tck = 0.02) axis(2,at=yticks,labels = TRUE, las = 1, tck = 0.02) axis(3,at=zvals,labels=FALSE, las = 1, tck=0.02) axis(4,at=yticks,labels = FALSE, tick = TRUE, tck = 0.02) abline(h=0,col="blue") legend(legendLocation,c("> 0.1","0.05  0.1","< 0.05"),col = c("snow3", "black","red"),pch = 20, title = "pvalue", pt.cex=1.0, cex = legendSize * 1.5) } ######################################################################################### ############ This next function combines four individual trend graphs (for mimimum day, ########### median day, mean day, and maximum day) along with the quantile kendall graph ######################################################################################### plotFiveTrendGraphs < function(eList, startDate = NA, endDate = NA, paStart = 4, paLong = 12, qUnit = 2, window = 30, legendLocation = "topleft", legendSize = 1.0) { localDaily < eList$Daily localINFO < eList$INFO localINFO$paStart < paStart localINFO$paLong < paLong localINFO$window < window start < as.Date(startDate) end < as.Date(endDate) if(is.na(startDate)){ start < as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end < as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily < subset(localDaily, Date >= start & Date <= end) eList < as.egret(localINFO,localDaily) eList < setPA(eList, paStart=paStart, paLong=paLong, window=window) # this next line of code is inserted so that when paLong = 12, we always use the # climate year when looking at the trends in the annual minimum flow paStart1 < if(paLong == 12) 4 else paStart plotFlowTrend(eList, istat = 1, qUnit = qUnit, paStart = paStart1, paLong = paLong, window = window) plotFlowTrend(eList, istat = 4, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) plotFlowTrend(eList, istat = 8, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) plotFlowTrend(eList, istat = 5, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) # now the quantile kendall plotQuantileKendall(eList, startDate = startDate, endDate = endDate, paStart = paStart, paLong = paLong, legendLocation = legendLocation, legendSize = legendSize) } ######################################################################################### ########### makeSortQ creates a matrix called Qsort. ############It sorted from smallest to largest over dimDays ############(if working with full year dimDays=365), #############and also creates other vectors that contain information about this array. ######################################################################################### makeSortQ < function(eList){ localINFO < getInfo(eList) localDaily < getDaily(eList) paStart < localINFO$paStart paLong < localINFO$paLong # determine the maximum number of days to put in the array numDays < length(localDaily$DecYear) monthSeqFirst < localDaily$MonthSeq[1] monthSeqLast < localDaily$MonthSeq[numDays] # creating a data frame (called startEndSeq) of the MonthSeq values that go into each year Starts < seq(paStart, monthSeqLast, 12) Ends < Starts + paLong  1 startEndSeq < data.frame(Starts, Ends) # trim this list of Starts and Ends to fit the period of record startEndSeq < subset(startEndSeq, Ends >= monthSeqFirst & Starts <= monthSeqLast) numYearsRaw < length(startEndSeq$Ends) # set up some vectors to keep track of years good < rep(0, numYearsRaw) numDays < rep(0, numYearsRaw) midDecYear < rep(0, numYearsRaw) Qraw < matrix(nrow = 366, ncol = numYearsRaw) for(i in 1: numYearsRaw) { startSeq < startEndSeq$Starts[i] endSeq < startEndSeq$Ends[i] startJulian < getFirstJulian(startSeq) # startJulian is the first julian day of the first month in the year being processed # endJulian is the first julian day of the month right after the last month in the year being processed endJulian < getFirstJulian(endSeq + 1) fullDuration < endJulian  startJulian yearDaily < localDaily[localDaily$MonthSeq >= startSeq & (localDaily$MonthSeq <= endSeq), ] nDays < length(yearDaily$Q) if(nDays == fullDuration) { good[i] < 1 numDays[i] < nDays midDecYear[i] < (yearDaily$DecYear[1] + yearDaily$DecYear[nDays]) / 2 Qraw[1:nDays,i] < yearDaily$Q } else { numDays[i] < NA midDecYear[i] < NA } } # now we compress the matrix down to equal number of values in each column j < 0 numGoodYears < sum(good) dayCounts < ifelse(good==1, numDays, NA) lowDays < min(dayCounts, na.rm = TRUE) highDays < max(dayCounts, na.rm = TRUE) dimYears < numGoodYears dimDays < lowDays sortQ < matrix(nrow = dimDays, ncol = dimYears) time < rep(0,dimYears) for (i in 1:numYearsRaw){ if(good[i]==1) { j < j + 1 numD < numDays[i] x < sort(Qraw[1:numD, i]) # separate odd numbers from even numbers of days if(numD == lowDays) { sortQ[1:dimDays,j] < x } else { sortQ[1:dimDays,j] < if(odd(numD)) leapOdd(x) else leapEven(x) } time[j] < midDecYear[i] } } sortQList = list(sortQ,time) return(sortQList) } ######################################################################################### ########## Another function trendSortQ needed for Quantile Kendall ######################################################################################### trendSortQ < function(Qsort, time){ # note requires packages zyp and rkt nFreq < dim(Qsort)[1] nYears < length(time) results < as.data.frame(matrix(ncol=9,nrow=nFreq)) colnames(results) < c("slopeLog","slopePct","pValue","pValueAdj","tau","rho1","rho2","freq","z") for(iRank in 1:nFreq){ mkOut < rkt::rkt(time,log(Qsort[iRank,])) results$slopeLog[iRank] < mkOut$B results$slopePct[iRank] < 100 * (exp(mkOut$B)  1) results$pValue[iRank] < mkOut$sl outZYP < zyp.zhang(log(Qsort[iRank,]),time) results$pValueAdj[iRank] < outZYP[6] results$tau[iRank] < mkOut$tau # I don't actually use this information in the current outputs, but the code is there # if one wanted to look at the serial correlation structure of the flow series serial < acf(log(Qsort[iRank,]), lag.max = 2, plot = FALSE) results$rho1[iRank] < serial$acf[2] results$rho2[iRank] < serial$acf[3] frequency < iRank / (nFreq + 1) results$freq[iRank] < frequency results$z[iRank] < qnorm(frequency) } return(results) } ######################################################################################### ################################## getFirstJulian finds the julian date of first day ################################## of a given month ######################################################################################### getFirstJulian < function(monthSeq){ year < 1850 + trunc((monthSeq  1) / 12) month < monthSeq  12 * (trunc((monthSeq1)/12)) charMonth < ifelse(month<10, paste0("0",as.character(month)), as.character(month)) theDate < paste0(year,"",charMonth,"01") Julian1 < as.numeric(julian(as.Date(theDate),origin=as.Date("18500101"))) return(Julian1) } ######################################################################################### ########### leapOdd is a function for deleting one value ############when the period that contains Februaries has a length that is an odd number ######################################################################################### leapOdd < function(x){ n < length(x) m < n  1 mid < (n + 1) / 2 mid1 < mid + 1 midMinus < mid  1 y < rep(NA, m) y[1:midMinus] < x[1:midMinus] y[mid:m] < x[mid1:n] return(y)} ######################################################################################### ########### leapEven is a function for deleting one value ############when the period that contains Februaries has a length that is an even number ######################################################################################### leapEven < function(x){ n < length(x) m < n  1 mid < n / 2 y < rep(NA, m) mid1 < mid + 1 mid2 < mid + 2 midMinus < mid  1 y[1:midMinus] < x[1:midMinus] y[mid] < (x[mid] + x[mid1]) / 2 y[mid1:m] < x[mid2 : n] return(y) } ######################################################################################### ####### determines if the length of a vector is an odd number ########################### ######################################################################################### odd < function(x) {(!(x %% 2) == 0)} ######################################################################################### ########### calcWY calculates the water year and inserts it into a data frame ######################################################################################### calcWY < function (df) { df$WaterYear < as.integer(df$DecYear) df$WaterYear[df$Month >= 10] < df$WaterYear[df$Month >= 10] + 1 return(df) } ######################################################################################### ##### calcCY calculates the climate year and inserts it into a data frame ######################################################################################### calcCY < function (df){ df$ClimateYear < as.integer(df$DecYear) df$ClimateYear[df$Month >= 4] < df$ClimateYear[df$Month >= 4] + 1 return(df) } ######################################################################################### ######## smoother is a function does the trend in real discharge units and not logs. ######## It is placed here so that users wanting to run this alternative have it available ######## but it is not actually called by any function in this document ######################################################################################### smoother < function(xy, window){ edgeAdjust < TRUE x < xy$x y < xy$y n < length(y) z < rep(0,n) x1 < x[1] xn < x[n] for (i in 1:n) { xi < x[i] distToEdge < min((xi  x1), (xn  xi)) close < (distToEdge < window) thisWindow < if (edgeAdjust & close) (2 * window)  distToEdge else window w < triCube(x  xi, thisWindow) mod < lm(xy$y ~ x, weights = w) new < data.frame(x = x[i]) z[i] < predict(mod, new) } return(z) } Making a graph of the trend in a single flow statisticNow all we need to do is to run the plotFlowTrend function that was the first part of the code we just read in. First we will run it in its simplest form, we will use the entire discharge record and our period of analysis will be the full climatic year. A climatic year is the year that starts on April 1 and ends on March 31. It is our default approach because it tends to avoid breaking a longlow flow period into two segments, one in each of two adjacent years. We will first run it for the annual minimum daily discharge.
To use it you will need to know about the index of flow statistics used in EGRET. They are called istat and the statistics represented by istat are: (1) 1day minimum, (2) 7day minimum, (3) 30day minimum, (4) median (5) mean, (6) 30day maximum, (7) 7day maximum, and (8) 1day maximum. So to run the annual minimum daily discharge statistic we would set istat = 1. To run the plotFlowTrend function it in its simplest form the only arguments we need are the eList (which contains the metadata and the discharge data), and istat.
plotFlowTrend(eList, istat = 1) Explanation of the FlowTrend Plot, annual minimum dayThe dots indicate the discharge on the minimum day of each climate year in the period of record. The solid curve is a smoothed representation of those data. It is specifically the smoother that is defined in the EGRET user guide (pages 1618) with a 30year window. For record as short as this one (only 32 years) it will typically look like a straight line or a very smooth curve. For longer records it can display some substantial changes in slope and even be nonmonotonic. At the top of the graph we see two pieces of information. A trend slope expressed in percent per year and a pvalue for the MannKendall trend test of the data. The slope is computed using the ThielSen slope estimator. It is discussed on pages 266274 of Helsel and Hirsch, 2002, which can be found here although it is called the “KendallTheil Robust Line” in that text. It is calculated on the logarithms of the discharge data and then transformed to express the trend in percent per year. The pvalue for the MannKendall test is computed using the adjustment for serial correlation introduced in the zyp R package (David Bronaugh and Arelia Werner for the Pacific Climate Impacts Consortium (2013). zyp: Zhang + YuePilon trends package. R package version 0.101. https://CRAN.Rproject.org/package=zyp).
A few more FlowTrend PlotsWe can do some other similar plots. In this case we will do the annual median day istat = 4, the annual maximum day istat = 8, and the annual mean day istat = 5.
The annual median dayThe median day is computed for each year in the record.It is the middle day, that is 182 values had discharges lower than it and 182 values had discharges greater than it (for a leap year it is the average of the 183rd and 184th ranked values). Everything else about it is the same as the first plot. Here is the annual median day discharge record.
plotFlowTrend(eList, istat = 4) The annual maximum dayThe third plot is for the maximum day of the year. Otherwise it is exactly the same in its construction as the first two plots. Note that this is the maximum daily average discharge and in general it will be smaller than the annual peak discharge, which is the maximum instantaneous discharge for the year, whereas this is the highest daily average discharge. For very large rivers the annual maximum day discharge tends to be very close to the annual peak discharge and can serve as a rough surrogate for it in a trend study. For a small stream, where discharges may rise or fall by a factor of 2 or more in the course of a day, these maximimum day values can be very different from the annual peak discharge.
At this point we will add one more concept, which is the period of analysis. In the first two plots the data were organized by climate year. But, when we look at annual maximum or annual mean discharge the standard approach is to use the water year. To do this we need to designate that we want our period of analysis to start with the month of October (month 10). To do this we need to add one more argument to our call to the function. paStart = 10. We did not need to specify the paStart in the first two plots, because the default is paStart = 4. Notice that this change in the period of analysis is noted in the text on the graphic.
plotFlowTrend(eList, istat = 8, paStart = 10)The fourth plot is the annual mean discharge. It is constructed in exactly the same manner as the previous three, but it represents the mean streamflow for all of the days in the year rather than a single order statistic such as minimum, median, or maximum. We will also use the water year for this analysis.
plotFlowTrend(eList, istat = 5, paStart = 10) Some observations about these figuresIt is interesting to note that the minimum discharge values have been trending downwards, although not a statistically significant trend. The other three discharge statistics are all trending upwards and this is most notable in terms of the annual maximum daily discharge, both in terms of slope +4.7% per year, and a pvalue that is well below 0.001. This raises many questions about how the variability of discharge might be changing over time and what the drivers are of the trends that are happening in various parts of the probability distribution of discharge (which we commonly call the flow duration curve). We will turn our focus to a way of summarizing these changes after we look at a few more variations on this plotFlowTrend graphic.
Variations on the simplest exampleThe call to the function plotFlowTrend has many arguments (you can see them all in the code shown above) but here we will just focus on a few that may be most useful to the data analyst:
plotFlowTrend(eList, istat, startDate = NA, endDate = NA, paStart = 4, paLong = 12, window = 30, qMax = NA, qUnit = 2, runoff = FALSE)Here is a list of the optional arguments that are available to the user.
startDate If we want to evaluate the trend for a shorter period than what is contained in the entire Daily data frame, then we can specify a different starting date. For example, if we wanted the analysis to start with Water Year 1981, then we would say: startDate = “19801001”. By leaving out the startDate argment we are requesting the analysis to start where the data starts (which in this case is 19791001).
endDate If we want to evalute the trend for a period that ends before the end of the data set we can specify that with endDate. So if we wanted to end with Water Year 2009 we would say: endDate = “20090930”.
paStart is the starting month of the period of analysis. For example if we were interested in the trends in streamflow in a series of months starting with August, we would say paStart = 8. The default is paStart = 4, which starts in April.
paLong is the duration of the period of analysis. So, an analysis that covers all 12 months would have paLong = 12 (the default). A period that runs for the months of August – November would have paStart = 8 and paLong = 4. See pages 14 and 15 of the user guide for more detail on paStart and paLong.
window The smoothness of the curve plotted depends on the size of the window (which is measured in years). If window were much smaller (say 10) then the curve shown would be rather jagged, and as it becomes longer it causes the curve to converge to a straight line. The default value of 30 works well and is also roughly consistant with the convention used in the climate science community, when they discuss a 30year normal period. The precise meaning of window is described in the User Guide (pages 16 – 18). It has no effect on the reported trend slope or pvalue.
qMax is an argument that allows the user to set the maximum value on the vertical axis. If it is left out, the graphic scales itself. But for a set of similar graphics it may be useful to set qMax.
qUnit is an argument that allows the user to pick different discharge units. The default is qUnit = 2 which is cubic meters per second. The other most common unit to use is qUnit = 1 which is cubic feet per second.
runoff is a logical argument. The default is runoff = FALSE. In some studies it is nice to express discharge in runoff terms (discharge per unit area) amd setting runoff = TRUE can cause it to be expressed in terms of runoff.
Here is an example that shows the use of several of these arguments. Say we want to see what would happen if we only looked at annual maximum daily discharge from 1983 – 2010 (trimming off three low years at the start and one high year at the end), and used a shorter smoothing window, and expressed our results as runoff in mm/day
plotFlowTrend(eList, istat = 8, startDate = "19821001", endDate = "20100930", window = 15, runoff = TRUE)It is worth noting here that compared to the previous plot of the annual maximum daily discharges, the removal of four particular years brought about a substantial change in the slope and the significance of the trend. This is a common issue in trend analysis with relatively short records.
The QuantileKendall PlotNow we will look at a new kind of plot called the QuantileKendall plot. Each plotted point on the plot is a trend slope (computed in the same manner as in the previous plots) for a given order statistic. The point at the far left edge is the first order statistic, which is the annual minimum daily discharge. This is the result described on the first plot. The next point to its right is the trend slope for the second order statistic (second lowest daily discharge for each year), and it continues to the far right being the trend slope for the 365th order statistic (annual maximum daily discharge). Their placement with respect to the xaxis is based on the zstatistic (standard normal deviate) associated with that order statistic. It is called the daily nonexceedance probability. It is a scale used for convenience. It in noway assumes that the data follow any particular distribution. It is simply used to provide the greatest resolution near the tails of the daily discharge distribution. The color represents the p value for the MannKendall test for trend as described above. Red indicates a trend that is significant at alpha = 0.05. Black indicates an attained significance between 0.05 and 0.1. The grey dots are trends that are not significant at the alpha level of 0.1.
Here is what it looks like for the Choptank data set. What we see is that generally across the probability distribution the changes are generally not statistically significant except for the highest and second highest days of the year. The trends are slightly negative at the low end of the distribution (the lowest 10%). Near the median they are positive and modestly large (around 1.5% per year). Then near the 90th percentile they drop to near zero slope and only above about the 99th percentile do they seem to be particularly large.
plotQuantileKendall(eList)There is one special manipulation of the data that is needed to account for leap years (this is a detail about the computation but is not crucial to understanding the plots). The 366 daily values observed in a leap year are reduced by one so that all years have 365 values. The one value eliminated is accomplished by replacing the two middle values in the ranked list of values by a single value which is the average of the two. A similar approach is used when the period of analysis is any set of months that contains the month of February. The number of leap year values are reduced by one and the reduction takes place at the median value for the year.
Variations on the Quantile KendallWe might be interested in particularly looking at trends in a certain season of the year. In the case of the Choptank we know that tropical storm generated floods have been particularly significant in the later part of this record and these have been mostly in August, September and October. Or, if our interest were related to the flows happening in the spring and early summer, which is the period of time during which the delivery to the Chesapeake Bay (into which the Choptank River flows) then we might want to run our analysis on the period March through June. To do that we can use paStart = 3 and paLong = 4.
plotQuantileKendall(eList, paStart = 3, paLong = 4)What we can see is that for this portion of the year, the discharge trends are rather minimal and certainly not statistically significant, except that the 5 or so highest flow days in the season do show rather large trends and the annual maximum trend is the largest of all (in percent per year) and is the only one that is significcant.
There are a few other arguments for the plotQuantileKendall function. As mentioned in the discussion of the plotFlowTrend function, it has the startDate and endDate arguments. If they aren’t specified then it is assumed that the analysis covers the whole period in the Daily data frame.
There are two arguments that relate to the appearance of the graph.
legendLocation this argument simply determines where in the graph the legend is placed. It sometimes happens that the legend obscures the data so we may need to move it to another part of the graph. The argument can take on names such as “bottomleft”, “topright” or “bottomright”. The default is “topleft”. If you don’t want a legend (say you are making many Quantile Kendall plots in a single job and can’t go in and adjust each one), then you can specify legendLocation = NA.
legendSize this argument determines the size of the legend. The default is legendSize = 1.0. If you want to decrease the dimensions of the legend you decrease this value. For example a 25% reduction would be achieved by setting legendSize = 0.75. Typically one wouldn’t want to go below about 0.4.
yMax and yMin are two more optional arguments to set up the limits on the yaxis. If you don’t specify them (just leave them out) then the data will dictate the minimum and maximum values on the yaxis (it will extend slightly beyond the highest and lowest values). But, if you are doing many graphs you may want them to have consistent axes, so a quick look at the graph will tell you if the trends are large or small, then you can set these two values. For example, say you want every graph to have a maximum value of 5 and a minimum value of 5, then the arguments would just be: yMax = 5, yMin = 5. There is a “fail safe” mechanism in the code if the trends are bigger than the specified values. Let’s say we set yMax = 5 but in one particular graph there is a trend slope of 7.2 % per year. In this case the yaxis would extend out slightly beyond a value of 7.2. Thus, no value is ever out of range. All of the slopes will be plotted.
Here is an example bringing all of the arguments into play. It is for a case where we want to consider the time period 19820801 through 20101130, for only the months of August through October, and we want the legend to go in the bottom right and be only half the size of what we saw in the first example.
plotQuantileKendall(eList, startDate = "19820801", endDate = "20101130", paStart = 8, paLong = 3, legendLocation = "bottomright", legendSize = 0.5, yMax = 4, yMin = 4)What this graph shows us is that unlike the year as a whole, this late summer to fall period has been one of very substantially increasing discharge. From about the median of the distribution to the top, the trends are greater than 3% per year and many are significant at least at the alpha level of 0.1. There is one feature of this figure that may look odd. That is the fact that at the lower end of the distribution there are many trend slopes that are exactly zero. Using the gray dot at the far left as an example, the trend slope estimation method uses all pairwise comparisons of the lowest discharge of each year. Because of the USGS reporting conventions, many of these discharge values cluster around a few specific values, so that in the comparisons, there are many ties. The slope estimate is the median of all the pairwise slopes and since a significant number of these slopes are zero, the slope estimate is zero.
Putting it all together into a set of five graphsThis post also provides an additional function that produces a total of five graphics from a single command. It does the four trend plots and then the QuantileKendall. It is called plotFiveTrendGraphs. The arguments are all ones that we have seen before: eList, startDate, endDate, paStart, paLong, qUnit, window, legendLocation, and legendSize. In its simplest form it would look like this (output not plotted here).
plotFiveTrendGraphs(eList) Downloading the data for your site of interestThe steps for downloading data from USGS web services (or obtaining it from user supplied information) and also for creating and saving the eList are described on pages 413 of the EGRET user guide. Here is a simple example:
Say we want to look at USGS station number 01646500 (Sugar River near Brodhead, WI) and we want to consider data from Climate Years 1921 through 2016. Note that this much longer period of record causes the computations to take a good deal more time than the previous cases, but it should still take well under a minute to complete the whole job. The following commands would do what is needed.
library(EGRET) sta < "05436500" param < "00060" # this is the parameter code for daily discharge startDate < "19200401" endDate < "20160331" INFO < readNWISInfo(siteNumber = sta, parameterCd = param, interactive = FALSE) Daily < readNWISDaily(siteNumber = sta, parameterCd = param, startDate = startDate, endDate = endDate, verbose = FALSE) eList < as.egret(INFO, Daily) plotFiveTrendGraphs(eList, legendLocation = "bottomleft") Final thoughtsThis last set of plots is particularly interesting. What we see is that for the lowest 90 percent of the distribution, discharges have been rising over most of this record, and particularly so since about 1950. But in the highest 1% of the distribution discharges have been falling thoughout the record. As a consequence the overall variablity of streamflow in this agricultural watershed has generally been declining over time. This is generally thought to be related to the increasing use of conservation practices in the watershed.
It is worth noting that we can express the percentage changes per year in other ways than as percent per year, for example percentage change per decade or percentage change over the entire period of record. Take, for example, the estimated trend slope for the median in this last example. It is 0.68% per year. If we were to express that as percentage change per decade it would be 7% per decade (the change can be computed as 1.0068^10, or 1.070, which is a 7% increase). If we expressed it for the entire 96 year record it would be about a 92% increase over that period (computed as 1.0068^96, which is 1.9167 or and increase of about 92%).
These graphical tools are one way of summarizing what we might call a signature of change for any given watershed. It shows the changes, or lack of changes, that have taken place through all parts of the probability distribution of streamflow. This has potential as a tool for evaluating hydrologic or linked climate and hydrologic models. Observing the patterns on these graphs for the actual data versus the patterns seen when simulated data are used, can provide insights on the ability of the models used to project hydrologic changes into the future.
If you have questions or comments on the concepts or the implementation please contact the author rhirsch@usgs.gov.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Native scoring in SQL Server 2017 using R
(This article was first published on R – TomazTsql, and kindly contributed to Rbloggers)
Native scoring is a much overlooked feature in SQL Server 2017 (available only under Windows and only onprem), that provides scoring and predicting in prebuild and stored machine learning models in near realtime.
Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY
Depending on the definition of realtime, and what does it mean for your line of business, I will not go into the definition of realtime, but for sure, we can say scoring 10.000 rows in a second from a mediocre client computer (similar to mine) .
Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Limitations are:
 currently supports only SQL server 2017 and Windows platform
 trained model should not exceed 100 MiB in size
 Native scoring with PREDICT function supports only following algorithms from RevoScaleR library:
 rxLinMod (linear model as linear regression)
 rxLogit (logistic regression)
 rxBTrees (Parallel external memory algorithm for Stochastic Gradient Boosted Decision Trees)
 rxDtree (External memory algorithm for Classification and Regression Trees
 rxDForest (External memory algorithm for Classification and Regression Decision Trees)
Benefits of using PREDICT function for native scoring are:
 No configuration of R or ML environment is needed (assuming that the trained models are already stored in the database),
 Code is cleaner, more readable, and no additional R code is needed when performing scoring,
 No R engine is called in the runtime, so tremendous deduction of CPU and I/O costs as well as, no external calls,
 Client or server running Native scoring with PREDICT function does not need R engine installed, because it uses C++ libraries from Microsoft, that can read serialized model stored in a table and unserialize it and generate predictions, all without the need of R
Overall, if you are looking for a faster predictions in your enterprise and would love to have a faster code and solution deployment, especially integration with other applications or building API in your ecosystem, native scoring with PREDICT function will surely be advantage to you. Although not all of the predictions/scores are supported, majority of predictions can be done using regression models or decision trees models (it is estimated that both type (with derivatives of regression models and ensemble methods) of algorithms are used in 85% of the predictive analytics).
To put the PREDICT function to the test, I have deliberately taken the semilarger dataset, available in RevoScaleR package in R – AirlineDemoSmall.csv. Using a simple BULK INSERT, we get the data into the database:
BULK INSERT ArrivalDelay FROM 'C:\Program Files\Microsoft SQL Server\140\R_SERVER\library\RevoScaleR\SampleData\AirlineDemoSmall.csv' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR = '0x0a', FIRSTROW = 2, CODEPAGE = 'RAW');Once data is in the database, I will split the data into training and test subsets.
SELECT TOP 20000 * INTO ArrDelay_Train FROM ArrDelay ORDER BY NEWID()  (20000 rows affected) SELECT * INTO ArrDelay_Test FROM ArrDelay AS AR WHERE NOT EXISTS (SELECT * FROM ArrDelay_Train as ATR WHERE ATR.arrDelay = AR.arrDelay AND ATR.[DayOfWeek] = AR.[DayOfWeek] AND ATR.CRSDepTime = AR.CRSDepTime )  (473567 rows affectedAnd the outlook of the dataset is relatively simple:
ArrDelay CRSDepTime DayOfWeek 1 9,383332 3 4 18,983334 4 0 13,883333 4 65 21,499998 7 3 6,416667 1 Creating modelsSo we will create essentially two same models using rxLinMod function with same formula, but one with additional parameter for realtime scoring set to TRUE.
 regular model creation DECLARE @model VARBINARY(MAX); EXECUTE sp_execute_external_script @language = N'R' ,@script = N' arrDelay.LM < rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = InputDataSet) model < rxSerializeModel(arrDelay.LM)' ,@input_data_1 = N'SELECT * FROM ArrDelay_Train' ,@params = N'@model varbinary(max) OUTPUT' ,@model = @model OUTPUT INSERT [dbo].arrModels([model_name], [native_model]) VALUES('arrDelay.LM.V1', @model) ;  Model for Native scoring DECLARE @model VARBINARY(MAX); EXECUTE sp_execute_external_script @language = N'R' ,@script = N' arrDelay.LM < rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = InputDataSet) model < rxSerializeModel(arrDelay.LM, realtimeScoringOnly = TRUE)' ,@input_data_1 = N'SELECT * FROM ArrDelay_Train' ,@params = N'@model varbinary(max) OUTPUT' ,@model = @model OUTPUT INSERT [dbo].arrModels([model_name], [native_model]) VALUES('arrDelay.LM.NativeScoring.V1', @model) ;Both models will have same training set, and will be stored into a table for future scoring. Upon first inspection, we can see there is a difference in the model size:
Scoring ModelsBoth models took relatively the same amount of time to train and to store in the table. Both can also be created on R Machine Learning server and stored in the same way (with or without argument realtimeScoringOnly). The model size gives you an idea, why and how the realtime scoring can be achieved > is to keep your model as small as possible. Both models will give you exact same predictions scores, just that the native scoring will be much faster. Note also, if you are planning to do any text analysis with realtime scoring, keep in mind the 100 MiB limitation, as the text prediction models often exceed this limitation.
Comparing the execution of scoring models, I will compare using “traditional way” of using external procedure sp_execute_external_script and using PREDICT function.
  Using sp_execute_external_script  DECLARE @model VARBINARY(MAX) = (SELECT native_model FROM arrModels WHERE model_name = 'arrDelay.LM.V1') EXEC sp_execute_external_script @language = N'R' ,@script = N' modelLM < rxUnserializeModel(model) OutputDataSet < rxPredict( model=modelLM, data = ArrDelay_Test, type = "link", predVarNames = "ArrDelay_Pred", extraVarsToWrite = c("ArrDelay","CRSDepTime","DayOfWeek") )' ,@input_data_1 = N'SELECT * FROM dbo.ArrDelay_Test' ,@input_data_1_name = N'ArrDelay_Test' ,@params = N'@model VARBINARY(MAX)' ,@model = @model WITH RESULT SETS (( AddDelay_Pred FLOAT ,ArrDelay INT ,CRSDepTime NUMERIC(16,5) ,[DayOfWeek] INT ))  (473567 rows affected)  Duration 00:00:08   Using Real Time Scoring  DECLARE @model varbinary(max) = ( SELECT native_model FROM arrModels WHERE model_name = 'arrDelay.LM.NativeScoring.V1'); SELECT NewData.* ,p.* FROM PREDICT(MODEL = @model, DATA = dbo.ArrDelay_Test as newData) WITH(ArrDelay_Pred FLOAT) as p; GO  (473567 rows affected)  Duration 00:00:04Both examples are different from each other, but PREDICT function looks much more readable and neater. Time performance is also on the PREDICT function side, as the model returns the predictions much faster.
In addition, I have mentioned that PREDICT function does not need R engine or Launchpad Service to be running in the same environment, where the code will be executed. To put this to test, I will simply stop the SQL Server Launchpad Service:
After executing the first set of predictions using sp_execute_external_script, SQL Server or Machine Learning Server will notify you that the service is not running:
whereas, the PREDICT function will work flawlessly.
VerdictFor sure, faster predictions are the something that can be very welcoming in gaming industry, in transport, utility and metal industry, financial as well as any other types, where realtime predictions against OLTP systems will be much appreciated. With the lightweight models and good algorithm support, I would for sure give it an additional thought, especially, if you see a good potential in faster and near realtime predictions.
As always, complete code and data sample are available at Github. Happy coding!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Take Screenshot of Webpage using R
(This article was first published on ListenData, and kindly contributed to Rbloggers)
Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend. Take Screenshot from R
What is PhantomJS?
PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.
PhantomJS is an optimal solution for the following:
 Headless website testing
 Screen Capture
 Page Automation
 Network Monitoring
The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).
Install and Load PackageThe stable version of webshot is available on CRAN hence can be installed using the below code:
install.packages(‘webshot’)
library(‘webshot’)
Also, the latest development version of webshot is hosted on github and can be installed using the below code:
#install.packages(‘devtools’)
devtools::install_github(‘wch/webshot’)
Initial Setup
As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.
webshot::install_phantomjs()
The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.
Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.
Screenshot Function
webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like ‘.jpg’, ‘.png’, ‘.pdf’ based on which the output file is rendered. Below is the basic structure of how the function goes:
library(webshot)
#webshot(url, filename.extension)
webshot(“https://www.listendata.com/”, “listendata.png”)
If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().
Now that we understood the basics of the webshot() function, It is time for us to begin with our cases – starting with downloading/converting a webpage as a PDFcopy.
Case #1: PDF Copy of WebPage
Let us assume, we would like to download Bill Gates’ notes on Best Books of 2017 as a PDF copy.
#loading the required library
library(webshot)
#PDF copy of a web page / article
webshot(“https://www.gatesnotes.com/AboutBillGates/BestBooks2017”,
“billgates_book.pdf”,
delay = 2)
The above code generates a PDF whose (partial) screenshot is below:
Snapshot of PDF CopyDissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.
 URL from which the screenshot has to be taken.
 Output Filename along with its file extensions.
 Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.
Thus, a webpage can be converted/downloaded as a PDF programmatically in R.
Case #2: Webpage Screenshot (Viewport Size)
Now, I’d like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.
#Screenshot of Viewport
webshot(‘https://www.livemint.com/’,’livemint.png’, cliprect = ‘viewport’)
If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value ‘viewport‘ which clips only the viewport part of the browser, as below.
Screenshot of Viewport of BrowserCase #3: Multiple Selector Based Screenshots
All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.
#Multiple Selector Based Screenshots
webshot(“https://github.com/hadley”,
file = c(“organizations.png”,”contributions.png”),
selector = list(“div.bordertop.py3.clearfix”,”div.jscontributiongraph”))
Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.
References
Deepanshu founded ListenData with a simple objective – Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.
Let's Get Connected: LinkedIn
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ListenData. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ShinyProxy 1.1.1 released!
(This article was first published on Open Analytics, and kindly contributed to Rbloggers)
ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations.
ShinyProxy 1.1.1 is in essence a maintenance release, but there is one new feature that has been on the wish list of our users for a long time: the possibility of theming the landing page of ShinyProxy which displays the overview of the Shiny apps.
The standard display when using the ShinyProxy demo image from the Getting Started guide is a plain listing:
With an HTMLtemplate makeover, it can look like this:
or maybe you prefer to have the apps displayed in two columns?
As one can see, it is possible to use images or icons for the apps and the display and text can be entirely
customized with CSS. The templating technology used for the customization of the landing page is Thymeleaf, a mature and amply documented framework for natural HTML templates.
To help you get started customizing your own ShinyProxy instance, a fully worked theming example is availeble in our shinyproxyconfigurationexamples repository on Github.
Miscellaneous fixesIn our previous release we made a major leap to run Shiny apps at datacenter scale by adding support for a Kubernetes backend (see this blog post). We have used it for customers that roll out internetfacing Shiny apps with high numbers of concurrent users and needs for automated deployment. Following community feedback, this release has an important fix related to deletion of Kubernetes services when shutting down containers.
For the users that embed ShinyProxy in larger application frameworks, we released a fix for passing arguments to Shiny apps.
Some other small fixes include improved browser display of the apps and a more strict adherence to YAML for the ShinyProxy configuration file.
For the new features, documentation has been updated on https://shinyproxy.io and as always community support on this new release is available at
https://support.openanalytics.eu
Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Statistics Sunday: Two R Packages to Check Out
(This article was first published on Deeply Trivial, and kindly contributed to Rbloggers)
I’m currently out of town and not spending as much time on my computer as I have over the last couple months. (It’s what happens when you’re the only one in your department at work and also most of your hobbies involve a computer.) But I wanted to write up something for Statistics Sunday and I recently discovered two R packages I need to check out in the near future.
The first is called echor, which allows you to search and download data directly from the US Environmental Protection Agency (EPA) Environmental Compliance and History Online (ECHO), using the ECHOAPI. According to the vignette, linked above, “ECHO provides data for:
 Stationary sources permitted under the Clean Air Act, including data from the National Emissions Inventory, Greenhouse Gas Reporting Program, Toxics Release Inventory, and Clean Air Markets Division Acid Rain Program and Clean Air Interstate Rule.
 Public drinking water systems permitted under the Safe Drinking Water Act, including data from the Safe Drinking Water Information System.
 Hazardous Waste Handlers permitted under the Resource Conservation and Recovery Act, with data drawn from the RCRAInfo data system.
 Facilities permitted under the Clean Water Act and the National Pollutant Discharge Elimination Systems (NPDES) program, including data from EPA’s ICISNPDES system and possibly waterbody information from EPA’s ATTAINS data system.”
I have some fun analysis projects in the works! Stay tuned.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Intersecting points and overlapping polygons
(This article was first published on R – Insights of a PhD, and kindly contributed to Rbloggers)
I’ve been doing some spatial stuff of late and the next little step will involve intersecting points with possibly many overlapping polygons. The sp package has a function called over which returns the polygons that points intersects with. The catch though, is that it only returns the last (highest numerical value) polygon a point overlaps with. So it’s not so useful if you have many overlapping polygons. A little playing, and I’ve overcome that problem…
Here’s a toy example.
Create a couple of polygons and put them into a SpatialPolygons object.
library(sp) p1 < matrix(c(1,1, 2,1, 4,2, 3,2), ncol = 2, byrow = TRUE) p2 < matrix(c(2.2,1, 3,1, 3,2, 3,3, 2.8,3), ncol = 2, byrow = TRUE) p1s < Polygons(list(Polygon(p1)), 3) p2s < Polygons(list(Polygon(p2)), 4) sps < SpatialPolygons(list(p1s, p2s))Define a few points and put them in a SpatialPoints object
point < matrix(c(2.5, 1.5, 3.2, 1.75, 2,3, 1.5, 1.25, 2.75, 2.5), ncol = 2, byrow = TRUE) points < SpatialPoints(point)We can plot them…(not shown)
plot(sps, border = c("black", "blue")) plot(points, add = TRUE)As here we have the issue:
over(points, sps) 1 2 3 4 5 2 1 NA 1 2only returning a single “hit” per point (point 1 overlaps with both polygons 1 and 2).
To get around this, we can rip the individual polygons out of the SpatialPolygons object and put them in a list, converting the individual polygons into SpatialPolygons along the way:
sps2 < lapply(sps@polygons, function(x) SpatialPolygons(list(x)))Then lapplying over sps2 we can see which polygons each point intersects…
lapply(sps2, function(x) over(points, x)) [[1]] 1 2 3 4 5 1 1 NA 1 NA [[2]] 1 2 3 4 5 1 NA NA NA 1And now we see that point one overlaps with both polygons.
Code >>>here<<<
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Power of Standards and Consistency
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
I’m going to (eventually) write a full post on the package I’m mentioning in this one : osqueryr. The TLDR on osqueryr is that it is an R DBI wrapper (that has just enough glue to also be plugged into dbplyr) for osquery. The TLDR on osquery is that it “exposes an operating system as a highperformance relational database. This design allows you to write SQLbased queries efficiently and easily to explore operating systems.”
In short, osquery turns the metadata and state information of your local system (or remote system(s)) into a SQLcompliant database. It also works on Windows, Linux, BSD and macOS. This means you can query a fleet of systems with a (mostly) normalized set of tables and get aggregated results. Operations and information security staff use this to manage systems and perform incident response tasks, but you can use it to get just about anything and there are even more powerful modes of operation for osquery. But, more on all the features of osquery[r] in another post.
If you are skeptical, here’s some proof (which I need to show regardless of your skepticism state). First, a local “connection”:
library(DBI) library(osqueryr) con < DBI::dbConnect(Osquery()) head(dbListTables(con), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 2 x 2 ## name system_time ## 1 Firewall 3 ## 2 firefox 517846then, a remote "connection":
con2 < osqueryr::dbConnect(Osquery(), host = "hrbrmstr@osq1") head(dbListTables(con2), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con2, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con2, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 1 x 2 ## name system_time ## 1 firefox 1071992 "You're talking an awful lot about the package when you said this was a post on 'standards' and 'consistency'."True, but we needed that bit above for context. To explain what this post has to do with "standards" and "consistency" I also need to tell you a bit more about how both osquery and the osqueryr package are implemented.
You can read about osquery indepth starting at the link at the top of this post, but the authors of the tool really wanted a consistent idiom for accessing system metadata with usable, normalized output. They chose (to use a word they didn't but one that works for an R audience) a "data frame" as the output format and picked the universal language of "data frames"  SQL  as the inquiry interface. So, right there are examples of both standards and consistency: using SQL vs coming up with yetanotherquerylanguage and avoiding the chaos of the myriad of outputs from various system commands by making all results conform to a rectangular data structure.
Let's take this onestep further with a specific example. All modern operating systems have the concept of a "process" and said processes have (mostly) similar attributes. However, the commands used to get a detailed listing of those processes differ (sometimes wildly) from OS to OS. The authors of osquery came up with a set of schemas to ensure a common, rectangular output and naming conventions (note that some schemas are unique to a particular OS since some elements of operating systems have no useful counterparts on other operating systems).
The osquery authors also took consistency and standards to yetanotherlevel by taking advantage of a feature of SQLite called virtual tables. That enables them to have C/C++/ObjectiveC "glue" that gets called when a query is made so they can dispatch the intent to the proper functions or shell commands and then send all the results back  or  use the SQLite engine capabilities to do joining, filtering, UDFcalling, etc to produce rich, targeted rectangular output back.
By not reinventing the wheel and relying on wellaccepted features like data frames, SQL and SQLite the authors could direct all their focus on solving the problem they posited.
"Um, you're talking alot about everything but R now."We're getting to the good (i.e. "R") part now.
Because the authors didn't try to become SQL parser writer experts and relied on the standard SQL offerings of SQLite, the queries made are "real" SQL (if you've worked with more than one database engine, you know how they all implement different flavours of SQL).
Because these queries are "real" SQL, we can write an R DBI driver for it. The DBI package aims "[to define] a common interface between R and database management systems (DBMS). The interface defines a small set of classes and methods similar in spirit to Perl's DBI, Java's JDBC, Python's DBAPI, and Microsoft's ODBC. It defines a set of classes and methods defines what operations are possible and how they are performed."
If you look at the osqueryr package source, you'll see a bunch of DBI boilerplate code (which is in the rdbi organization example code) and only a handful of "touch points" for the actual calls to osqueryi (the command that processes SQL). No handling of anything but passing on SQL to the osqueryi engine and getting rectangular results back. By abstracting the system call details, R users can work with a familiar, consistent, standard interface and have full access to the power of osquery without firing up a terminal.
But it gets even better.
As noted above, one design aspect of osquery was to enable remote usage. Rather than come up with yetanotherdaemonandcustomprotocol, the osquery authors suggest ssh as one way of invoking the command on remote systems and getting the rectangular results back.
Because the osqueryr package used the sys package for making local system calls, there was only a tiny bit of extra effort required to switch from sys::exec_internal() to a sibling call in the ssh package  ssh::ssh_exec_internal() when remote connections were specified. (Said effort could have been zero if I chose a slightly different function in sys, too.)
Relying on wellaccepted standards made both osqueryi and the R DBIdriver work seamlessly without much code at all and definitely without a rats nest of nested if/else statements and custom httr functions.
But it gets even more bettererSome folks like & grok SQL, others don't. (Humans have preferences, go figure.)
A few years ago, Hadley (do I even need to use his last name at this point in time?) came up with the idea to have a more expressive and consistent way to work with data frames. We now know this as the tidyverse but one core element of the tidyverse is dplyr, which can really levelup your data frame game (no comments about data.table, or the beauty of base R, please). Not too long after the birth of dplyr came the ability to work with remote, rectangular, SQLbased data sources with (mostly) the same idioms.
And, not too long after that, the remote dplyr interface (now, dbplyr) got up close and personal with DBI. Which ultimately means that if you make a nearfullycompliant DBI interface to a SQL backend you can now do something like this:
library(DBI) library(dplyr) library(osqueryr) con < DBI::dbConnect(Osquery()) osqdb < src_dbi(con) procs < tbl(osqdb, "processes") listen < tbl(osqdb, "listening_ports") left_join(procs, listen, by="pid") %>% filter(port != "", protocol == "17") %>% # 17 == TCP distinct(name, port, address, pid) ## # Source: lazy query [?? x 4] ## # Database: OsqueryConnection ## address name pid port ## 1 0.0.0.0 BetterTouchTool 46317 57183 ## 2 0.0.0.0 Dropbox 1214 17500 ## 3 0.0.0.0 SystemUIServer 429 0 ## 4 0.0.0.0 SystemUIServer 429 62240 ## 5 0.0.0.0 UserEventAgent 336 0 ## 6 0.0.0.0 WiFiAgent 493 0 ## 7 0.0.0.0 WiFiProxy 725 0 ## 8 0.0.0.0 com.docker.vpnkit 732 0 ## 9 0.0.0.0 identityservicesd 354 0 ## 10 0.0.0.0 loginwindow 111 0 ## # ... with more rowsThe src_dbi() call wires up everything for us because d[b]plyr can rely on DBI doing it's standard & consistent job and DBI can rely on the SQLite processing crunchy goodness of osqueryi to ultimately get us a list of really dangerous (if not firewalled off) processes that are listening on all network interfaces. (Note to self: find out why the BetterTouchTool and Dropbox authors feel the need to bind to 0.0.0.0…)
FIN
What did standards and consistency get us?
 The osquery authors spent time solving a hard problem vs creating new data formats and protocols
 Rectangular data (i.e. "data frame") provides consistency and structure which ends up causing more freedom
 "Standard" SQL enables a consistent means to work with rectangular data
 ssh normalizes (secure) access across systems with a consistent protocol
 A robust, welldefined standard mechanism for working with SQL databases enabled nigh instantaneous wiring up of a whole new backend to R
 ssh and sys common idioms made working with the new backend on remote systems as easy as is on a local system
 Another robust, welldefined modern mechanism for working with rectangular data got wired up to this new backend with (pretty much) one line of code because of the defined standard and expectation of consistency (and works for local and remote)
Standards and consistency are pretty darned cool.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Extracting Specific Lines from a Large (Compressed) Text File
(This article was first published on R on Yixuan's Blog, and kindly contributed to Rbloggers)
A few days ago a friend asked me the following question: how to efficiently extract some specific lines from a large text file, possibily compressed by Gzip? He mentioned that he tried some R functions such as read.table(skip = …), but found that reading the data was too slow. Hence he was looking for some alternative ways to extracting the data.
This is a common task in preprocessing large data sets, since in data exploration, very often we want to peek at a small subset of the whole data to gain some insights.
To leave a comment for the author, please follow the link and comment on their blog: R on Yixuan's Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio:addins part 3 – View objects, files, functions and more with 1 keypress
(This article was first published on Jozef's Rblog, and kindly contributed to Rbloggers)
IntroductionIn this post in the RStudio:addins series we will try to make our work more efficient with an addin for better inspection of objects, functions and files within RStudio. RStudio already has a very useful View function and a Go To Function / File feature with F2 as the default keyboard shortcut and yes, I know I promised automatic generation of @importFrom roxygen tags in the previous post, unfortunately we will have to wait a bit longer for that one but I believe this one more than makes up for it in usefulness.
The addin we will create in this article will let us use RStudio to View and inspect a wide range of objects, functions and files with 1 keypress. Contents Retrieving objects from sys.frames
 Viewing files, objects and functions and more efficiently
 The addin function, updating the .dcf file and key binding
 The addin in action
 TL;DR – Just give me the package
 References
As a first step, we need to be able to retrieve the value of the object we are looking for based on a character string from a frame within the currently present sys.frames() for our session. This may get tricky, as it is not sufficient to only look at parent frames, because we may easily have multiple sets of “parallel” call stacks, especially when executing addins.
An example can be seen in the following screenshot, where we have a browser() call executed during the Addin execution itself. We can see that our current frame is 18 and browsing through its parent would get us to frames 17 > 16 > 15 > 14 > 0 (0 being the .GlobalEnv). The object we are looking for is however most likely in one of the other frames (9 in this particular case):
getFromSysframes < function(x) { if (!(is.character(x) && length(x) == 1 && nchar(x) > 0)) { warning("Expecting a nonempty character of length 1. Returning NULL.") return(invisible(NULL)) } validframes < c(sys.frames()[sys.nframe()], .GlobalEnv) res < NULL for (i in validframes) { inherits < identical(i, .GlobalEnv) res < get0(x, i, inherits = inherits) if (!is.null(res)) { return(res) } } return(invisible(res)) } Viewing files, objects, functions and more efficientlyAs a second step, we write a function to actually view our object in RStudio. We have quite some flexibility here, so as a first shot we can do the following:
 Open a file if the selection (or the selection with quotes added) is a path to an existing file. This is useful for viewing our scripts, data files, etc. even if they are not quoted, such as the links in your Rmd files
 Attempt to retrieve the object by the name and if found, try to use View to view it
 If we did not find the object, we can optionally still try to retrieve the value by evaluating the provided character string. This carries some pitfalls, but is very useful for example for
 viewing elements of lists, vectors, etc. where we need to evaluate [, [[ or $ to do so.
 viewing operation results directly in the viewer, as opposed to writing them out into the console, useful for example for wide matrices that (subjectively) look better in the RStudio viewer, compared to the console output
 If the View fails, we can still show useful information by trying to View its structure, enabling us to inspect objects that cannot be coerced to a data.frame and therefore would fail to be viewed.
This function can of course be improved and updated in many ways, for example using the summary method instead of str for selected object classes, or showing contents of .csv (or other data) files already read into a data.frame.
The addin function, updating the .dcf file and key bindingIf you followed the previous posts in the series, you most likely already know what is coming up next. First, we need a function serving as a binding for the addin that will execute out viewObject function on the active document’s selections:
viewSelection < function() { context < rstudioapi::getActiveDocumentContext() lapply(X = context[["selection"]] , FUN = function(thisSel) { viewObject(thisSel[["text"]]) } ) return(invisible(NULL)) }Secondly, we update the inst/rstudio/addins.dcf file by adding the binding for the newly created addin:
Name: viewSelection Description: Tries to use View to View the object defined by a text selected in RStudio Binding: viewSelection Interactive: falseFinally, we reinstall the package and assign the keyboard shortcut in the Tools > Addins > Browse Addins... > Keyboard Shortcuts... menu. Personally I assigned a single F4 keystroke for this, as I use it very often:
The addin in actionNow, let’s view a few files, a data.frame, a function and a tryerror class object just pressing F4.
TL;DR – Just give me the package get the status of the package after this article
 or use git clone from https://gitlab.com/jozefhajnala/jhaddins.git
 Environments chapter of Advanced R
 Using RStudio’s Data Viewer
To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tips for great graphics
(This article was first published on R – Insights of a PhD, and kindly contributed to Rbloggers)
R is a great program for generating topnotch graphics. But to get the best out of it, you need to put in a little more work. Here are a few tips for adapting your R graphics to make them look a little better.
1) Dont use the “File/Save as…/” menu.
If you set up your graphic in the first place then theres no need to postprocess (eg crop, scale etc) the graphic in other software.
Use the graphic devices (jpeg(), tiff(), postscript(), etc), set your height and width to whatever you want the finished product to be and then create the graph.
tiff("~/Manuscript/Figs/Fig1.tiff", width =2, height =2, units ="in", res = 600) plot(dist ~ speed, cars) # cars is a base R dataset  data(cars) dev.off()The first argument to a graphic device such as tiff or jpeg is the filename, to which you can include the path. So that I dont have to worry about the order of arguments, I include the argument name. Width and height specify the width and height of the graphic in the units provided, in this case inches, but pixels, cm or mm can also be used. The resolution tells R how higher quality the graphic should be, the higher the better, but if you go too high, you could find that you have problems running out of space. I find 600 a nice compromise. Nice crisp lines, smooth curves and sharp letters. You can then import that straight into MS Word or whatever other word processor you use, or upload to go with that manuscript youve worked so hard on. Although, you could find yourself with BIG files. A 4×6 inch figure I made recently was 17.1MB! And for the web, 100 or 200 is probably enough.
This technique also provides you with the same output every time, which is not the case if you adjust the size of the default device window produced by plot()
2) Dont be afraid to change the default settings!
Personally, I find that a 1 inch margin at the base of the graphic is a bit generous. I also find the ticks a bit long and the gap between tick and the axis label a bit big. Luckily, these things are easy to change!
jpeg("t1.jpeg", width=3, height=3, units="in", res=100) plot(dist~speed, cars) dev.off()The above code produces this figure.
See what I mean about the margins?
Heres how to change it!
par(mai=c(0.5, 0.5, 0.1, 0.1) ) plot(dist ~ speed, cars, tck = 0.01, las=1, mgp=c(1.4,0.2,0))That call to par changes the “MArgin in Inches” setting. The tck argument to par deals with TiCK length (negative for outside, positive for inside) while mgp controls on which line certain things are printed (titles are the first argument, labels are the second and the axis itself is the third). The las argument controls the rotation of the labels (1 for horizontal, 2 for perpendicular and 3 for vertical).
This leads me nicely to number 3: Dont be afraid to have separate lines for different parts of your plot.
This allows far more freedom and flexibility than setting par arguments in the plot argument. You can have different mgp settings for each axis for instance.
par(mai=c(0.4, 0.5, 0.1, 0.1)) plot(dist ~ speed, cars, xaxt="n", mgp=c(1.4,0.2,0), las=1, tck=0.01) axis(side=1, tck = 0.01, las=1, mgp=c(0.5,0.2,0)) mtext("speed", side=1, line= 1)This plots the same graph, but allows different distances for the x and y axes, in terms of margin and where the title is situated. The axis function places an axis on the side determined by its side argument and mtext places Margin TEXT, again at the side in its argument and in this case on line 1.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to plot with patchwork
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
INTRODUCTION
The goal of patchwork is to make it simple to combine separate ggplots into the same graphic. As such it tries to solve the same problem as gridExtra::grid.arrange() and cowplot::plot_grid but using an API that incites exploration and iteration.
Installation
You can install patchwork from github with:
# install.packages("devtools")
devtools::install_github("thomasp85/patchwork")
The usage of patchwork is simple: just add plots together!
library(ggplot2)
library(patchwork)
p1 < ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 < ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))
p1 + p2
you are of course free to also add the plots together as part of the same plotting operation:
ggplot(mtcars) +
geom_point(aes(mpg, disp)) +
ggplot(mtcars) +
geom_boxplot(aes(gear, disp, group = gear))
layouts can be specified by adding a plot_layout() call to the assemble. This lets you define the dimensions of the grid and how much space to allocate to the different rows and columns
p1 + p2 + plot_layout(ncol = 1, heights = c(3, 1))
If you need to add a bit of space between your plots you can use plot_spacer() to fill a cell in the grid with nothing
p1 + plot_spacer() + p2
You can make nested plots layout by wrapping part of the plots in parentheses – in these cases the layout is scoped to the different nesting levels
p3 < ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 < ggplot(mtcars) + geom_bar(aes(carb))
p4 + {
p1 + {
p2 +
p3 +
plot_layout(ncol = 1)
}
} +
plot_layout(ncol = 1)
Advanced features
In addition to adding plots and layouts together, patchwork defines some other operators that might be of interest. – will behave like + but put the left and right side in the same nesting level (as opposed to putting the right side into the left sides nesting level). Observe:
p1 + p2 + p3 + plot_layout(ncol = 1)
this is basically the same as without braces (just like standard math arithmetic) – the plots are added sequentially to the same nesting level. Now look:
p1 + p2  p3 + plot_layout(ncol = 1)
Now p1 + p2 and p3 is on the same level.
Often you are interested in just putting plots besides or on top of each other. patchwork provides both  and / for horizontal and vertical layouts respectively. They can of course be combined for a very readable layout syntax:
(p1  p2  p3) /
p4
There are two additional operators that are used for a slightly different purpose, namely to reduce code repetition. Consider the case where you want to change the theme for all plots in an assemble. Instead of modifying all plots individually you can use & or * to add elements to all subplots. The two differ in that * will only affect the plots on the current nesting level:
(p1 + (p2 + p3) + p4 + plot_layout(ncol = 1)) * theme_bw()
whereas & will recurse into nested levels:
p1 + (p2 + p3) + p4 + plot_layout(ncol = 1) & theme_bw()
Note that parenthesis is required in the former case due to higher precedence of the * operator. The latter case is the most common so it has deserved the easiest use.
Related exercise sets: How to Display Multivariate Relationship Graphs With Lattice
 Lattice Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Programmatically creating text output in R – Exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
In the age of Rmarkdown and Shiny, or when making any custom output from your data you want your output to look consistent and neat. Also, when writing your output you often want it to obtain a specific (decorative) format defined by the html or LaTeX engine. These exercises are an opportunity to refresh our memory on functions such as paste, sprintf, formatC and others that are convenient tools to achieve these ends. All of the solutions rely partly on the ultra flexible sprintf() but there are nodoubt many ways to solve the exercises with other functions, feel free to share your solutions in the comment section.
Example solutions are available here.
Exercise 1
Print out the following vector as prices in dollars (to the nearest cent):
c(14.3409087337707, 13.0648270623048, 3.58504267621646, 18.5077076398145,
16.8279241011882). Example: $14.34
Exercise 2
Using these numbers c(25, 7, 90, 16) make a vector of filenames in the following format: file_025.txt. That is, left pad the numbers so they are all three digits.
Exercise 3
Actually, if we are only dealing numbers less than hundred file_25.txt would have been enough. Change the code from last exercise so that the padding is progammatically decided by the biggest number in the vector.
Exercise 4
Print out the following haiku on three lines, right aligned, with the help of cat. c("Stay the patient course.", "Of little worth is your ire.", "The network is down.").
Exercise 5
Write a function that converts a number to its hexadecimal representation. This is a useful skill when converting bmp colors from one representation to another. Example output:
tohex(12) [1] "12 is c in hexadecimal"Exercise 6
Take a string and programmatically surround it with the html header tag h1
Exercise 7
Back to the poem from exercise 4, let R convert to html unordered list. So that it would appear like the following in a browser:
 Stay the patient course.
 Of little worth is your ire.
 The network is down.
Exercise 8
Here is a list of the currently top 5 movies on imdb.com in terms of rating c("The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "The Dark Knight", "12 Angry Men", "Schindler's List") convert them into a list compatible with written text.
Example output:
[1] "The top ranked films on imdb.com are The Shawshank Redemption, The Godfather, The Godfather: Part II, The Dark Knight, 12 Angry Men and Schindler's List"
Exercise 9
Now you should be able to solve this quickly: write a function that converts a proportion to a percentage that takes as input the number of decimal places. Input of 0.921313 and 2 decimal places should return "92.13%"
Exercise 10
Improve the function from last exercise so the percentage take consistently 10 characters by doing some left padding. Raise an error if percentage already happens to be longer than 10.
(Image by Daniel Friedman).
Related exercise sets: Parallel Computing Exercises: Foreach and DoParallel (Part2)
 Parallel Computing Exercises: Snow and Rmpi (Part3)
 Building Shiny App exercises part 4
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Slopegraphs and R – A pleasant diversion – May 26, 2018
(This article was first published on Chuck Powell, and kindly contributed to Rbloggers)
I try to at least scan the Rbloggers
feed everyday. Not every article is of interest to me, but I often have
one of two different reactions to at least one article. Sometimes it is
an “ah ha” moment because the article is right on point for a problem
I have now or have had in the past and the article provides a (better)
solution. Other times my reaction is more of an “oh yeah”, because it
is something I have been meaning to investigate, or something I once
knew, but the article brings a different perspective to it.
The second case happened to me this week. I’ve been aware of slopegraphs
and bumpcharts for quite some time, and I certainly am aware of Tufte’s
work. As an amateur military
historian I’ve always loved, for example, his
poster depicting Napoleon’s
Russian Campaign. So when I saw the article from Murtaza
Haider titled
“Edward Tufte’s Slopegraphs and political fortunes in Ontario” I
just had to take a peek and revisit the topic.
The article does a good job of looking at slopegraphs in both R (via
plotrix) and Stata, even providing the code to do the work. My
challenge was that even though I’m adequate at plotting in base R, I
much prefer using ggplot2 wherever and whenever possible. My memory
was that I had seen another article on the related topic of a
bumpchart on Rbloggers in the not too distant past. A little
sleuthing turned up this earlier
article from Dominik
Koch who wrote some code to
compare national performance at the Winter Olympics, “Bump Chart –
Track performance over time”.
Finally, I wound up at this Github
repository for a project called
“Edward TufteInspired Slopegraphs” from Thomas J.
Leeper who has been building code to make
slopegraphs using both base plotting functions and ggplot2.
My post today will draw a little bit from all their work and hopefully
provide some useful samples for others to draw on if they share some of
my quirks about data layout and a preference for ggplot2 versus base
plot. I’m going to focus almost exclusively on slopegraphs, although
much of the work could be extended to bumpcharts as well.
We’re going to make occasional use of dplyr to manipulate the data,
extensive use of ggplot2 to do the plotting and ggrepel to solve one
specific labeling problem. We’ll load them and I am suppressing the
message from dplyr about namespace overrides.
The original
post
is about plotting the data from some polling results in Ontario. For the
reader’s convenience I’ve made the data available via a structure
command. We have data about two different polling dates, for 5 political
parties, and the measured variable is percent of people supporting
expressed as x.x (i.e. already multiplied by
100).
Let’s just take the data as we have it and feed it to ggplot in a nice
simple fashion and see what we get with very little effort.
The nice thing about ggplot is once you get used to the syntax it
becomes very “readable”. We’ve identified our dataset, the x & y
variables and our grouping variable. Lines too big? An adjustment to
size = 2 does it. Don’t like colors? Pull the color = Party clause.
So we’re already pretty close to what we need. Things are scaled
properly and the basic labeling of titles etc. is accomplished. Our
biggest “problem” is that ggplot has been a little too helpful and
adding some things we’d like to remove to give it a more “Tuftesque”
look. So what we’ll do in the next few steps is add lines of code – but
they are mainly designed to remove unwanted elements. This is in
contrast to a base plot where we have to write the code to add elements.
So lets:
 Move the x axis labels to the top with scale_x_discrete(position =
"top")  Change to a nice clean black and white theme theme_bw()
 Not display any legend(s) theme(legend.position = "none")
 Remove the default border from our plot theme(panel.border =
element_blank())
Nice progress! Continuing to remove things that can be considered
“clutter” we add some additional lines that all end in
element_blank() and are invoked to remove default plot items such as
the plot grid, the y axcis text, etc..
Very nice! We’re almost there! The “almost” is because now that we
have removed both the legend and all scales and tick marks we no longer
know who is who, and what the numbers are! Plus, I’m a little unhappy
with the way the titles are formatted, so we’ll play with that. Later,
I’ll get fancy but for now let’s just add some simple text labels on
the left and right to show the party name and their percentage. The code
geom_text(aes(label = Party)) will place the party name right on top
of the points that anchor either end of the line. If we make that
geom_text(aes(label = paste0(Party, "  ", Pct, "%"))) then we’ll get
labels that have both the party and the percent all neatly formatted,
but still right on top of the points that anchor the ends of the line.
hjust controls horizontal justification so if we change it to
geom_text(aes(label = paste0(Party, "  ", Pct, "%")), hjust = 1.35)
both sets of labels will slide to the left which is exactly what we want
for the May 11 labels but not the May 18 labels. If we feed hjust a
negative number they’ll go the other way. So what we’ll do is filter the
data using the filter function from dplyr and place the left hand
labels differently than the right hand labels. While we’re at it we’ll
make it bold face font and a little larger…
Eureka! Not perfect yet but definitely looking good.
Adding complexityI’m feeling pretty good about the solution so far but there are three
things I’d like to make better.
 How well will this solution work when we have more than two time
periods? Need to make sure it generalizes to a more complex case.  As Murtaza
Haider notes in
his post we’ll have issues if the data points are identical or very
close together. Our very neat little labels will overlap each other.
In his post I believe he mentions that he manually moved them in
some cases. Let’s try and fix that.  Oh my, that’s a lot of code to keep cutting and pasting, can we
simplify?
To test #1 and #2 I have “invented”” a new dataset called moredata.
It is fictional it’s labelled May 25th but today is actually May
24th. But I created it to add a third polling date and to make sure that
we had a chance to test what happens when we have two identical
datapoints on the same day. Notice that on May 25th the polling numbers
for the Liberals and the NDP are identical at
26.8%.
You’ll notice at the beginning of this post I loaded the ggrepel
library. ggrepel works with ggplot2 to repel things that overlap,
in this case our geom_text labels. The invocation is geom_text_repel
and it is very similar to geom_text but allows us to deconflict the
overlaps. We’ll use hjust = "left" and hjust = "right" to control
justifying the labels. We’ll use a fixed nudge left and right nudge_x =
.45 and nudge_x = .5 to move the labels left and right off the
plotted data points and we will explicitly tell geom_text_repel to
only move the labels vertically to avoid overlap with direction = "y".
Everything else remains the same.
Very nice! We have confirmed that our solution works for more than two
dates without any additional changes and we have found a solution to the
label overlap issue. In a little while we’ll talk about labeling the
data points in the center (if we want to).
Before we move on let’s make our life a little simpler. While the output
plot is good it’s a lot of code to produce one graph. Let’s see if we
can simplify…
Since ggplot2 objects are just regular R objects, you can put them in a
list. This means you can apply all of R’s great functional programming
tools. For example, if you wanted to add different geoms to the same
base plot, you could put them in a list and use lapply().
But for now let’s at least take all the invariant lines of code and put
them in a list. Then when we go to plot we can just invoke the list and
remain confident we get the right formatting. For now let’s name this
list something quaint and obvious like MySpecial.
MySpecial is actually an incredibly complex structure so I used the
summary function. What’s important to us is that in the future all we
need to do is include it in the ggplot command and magic happens.
Perhaps another day I’ll make it a proper function but for now I can
change little things like line size or titles and labels without
worrying about the rest. So here it is with some little things changed.
Feeling good about the solution so far I decided to press on to a much
more complex problem. Thomas J. Leeper has
a nice plot of Tufte’s Cancer survival
slopegraph
N.B. that the original Tufte is not accurate on the vertical scale.
Look at Prostate and Thyroid for example since visually I would argue
they should cross to reflect the data.
Let’s grab the data as laid out by
Tufte.
Year.5
Year.10
Year.15
Year.20
Prostate
99
95
87
81
Thyroid
96
96
94
95
Testis
95
94
91
88
Melanomas
89
87
84
83
Breast
86
78
71
75
Hodgkin’s
85
80
74
67
Uterus
84
83
81
79
Urinary
82
76
70
68
Cervix
71
64
63
60
Larynx
69
57
46
38
There, we have it in a neat data frame but not organized as we need it.
Not unusual, and an opportunity to use some other tools from broom and
reshape2. Let’s do the following:
 Let’s transpose the data with t
 Let’s use broom::fix_data_frame to get valid column names and
convert rownames to a proper column all in one function. Right now
the types of cancer are nothing but rownames.  Use reshape2::melt to take our transposed dataframe and convert it
to long format so we can send it off to ggplot. Along the way
we’ll rename the resulting dataframe newcancer with columns
named Year, Type and Survival.
Now we have whipped the data into the shape we need it. 96 rows with the
three columns we want to plot, Year, Type, and Survival. If you
look at the data though, you’ll notice two small faults. First, Year
is not a factor. The plot will work but have an annoying limitation.
Since “Year.5” is a character string it will be ordered after all the
other years. We could fix that on the fly within our ggplot call but I
find it cleaner and more understandable if I take care of that first.
I’ll use the factor function from base R to accomplish that and
while I’m at it make the values nicer looking. Second in three cases R
changed cancer type names because they couldn’t be column names in a
dataframe. I’ll use forcats::fct_recode to make them look better.
Now that we have the data the way we want it we can make our slopegraph.
Some of the necessary changes are obvious x = Year, y = Survival and
group = Type for example. Since there are a lot of plotted lines I’ve
reduced the weight or size of the individual lines. We no longer want to
plot the big round points, we’re going to substitute in the actual
numbers, so that line gets commented out. The left and right labels
require no change and geom_text_repel will keep them from overlapping
which is almost inevitable given the data. To put the actual survival
numbers on the plot we’ll turn to geom_label. It’s like geom_text
only it puts a label box around the text. We’ll choose a smallish size,
minimize the amount of padding, and make the border of the box
invisible. The end result is what we want. It overlays on top of the
lines we’ve already plotted and the invisible padding gives us just
enough room.
I hope you’ve found this useful. I am always open to comments,
corrections and suggestions.
Chuck (ibecav at gmail dot
com)
This
work is licensed under a
Creative
Commons AttributionShareAlike 4.0 International License.
To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Safe Disposal of Unexploded WWII Bombs
(This article was first published on Theory meets practice..., and kindly contributed to Rbloggers)
AbstractUnexploded WWII bombs are ticking threats despite being dropped more than 70 years ago. In this post we explain how statistical methods are used to plan the search and disposal of unexploded WWII bombs. In particular we consider and exemplify the nonparametric nearest neighbour distance (NND) method implemented in the R package highriskzone. The method analyses the spatial pattern of exploded bombs to determine so called riskzones, that is regions with a high likelihood of containing unexploded bombs. The coverage of such riskzones is investigated through both nonparametric and parametric point process simulation.
NCAP aerial photo from 1944 showing the bombing of the V2 rocket facility at Peenemünde, Germany. Image is available under a custom NCAP license – higher resolution images are available from NCAP.
This work is licensed under a Creative Commons AttributionShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.
\[
\newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}}
\]
During WWII Germany was pounded with about 1.5 mio tons of bombs of which about 1015% did not explode. More than 70 years after the end of WWII these unexploded bombs (UXBs) still pose a threat and are the frequent cause of large scale evacuations to secure their safe disposal when found. Luckily, lethal incidents are rare thanks to a huge effort to localise and safely dismantle UXBs. As part of this effort, aerial photos taken by the allies after the attacks provide valuable information about the possible locations of UXBs. Some UXBs are directly visible in the photos – see for example the green circles in this NCAP image or p. 6 in the following information flyer by one of the companies offering such aerial identification services (featured in this news article). In other cases the photos only provide information about the location of the exploded bombs. This information can be used to identify areas where there is a high likelihood of UXBs. Such areas would then be carefully scrutinized using ontheground search methods, for example, electromagnetic and magnetic detectors.
The aim of Mahling, Höhle, and Küchenhoff (2013) was to develop statistical methods for the identification of such riskzones in cooperation with Oberfinanzdirektion Niedersachsen, which supports the removal of unexploded ordnance in federal properties in Germany. In what follows we will discuss one of the methods used in the paper, the so called nearest neighbourhood distance method and illustrate its implementation in the R package highriskzone originally created by Heidi Seibold and now maintained by Felix Günther.
Mathematical SetupCasting matters into mathematical notation: Let \(X\) be a point process denoting the spatial locations of all bombs dropped in the particular window of interest \(W \subseteq \mathbb{R}^2\). Furthermore, let \(Y\) denote the observed point process of exploded bomb and \(Z=X\backslash Y\) the point process of unexploded bombs. Note that only the process \(Y\) is observed; \(Z\) is not observed and the target of our inference. We assume that the probability \(q\) of a dropped bomb not exploding is homogeneous in \(W\). Thus if \(X\) is a inhomogeneous Poisson point process with intensity function \(\lambda_X(\bm{s})\), \(\bm{s}\in W\), then
\[
\lambda_Y(\bm{s}) = (1q) \lambda_X(\bm{s})
\quad \text{and}\quad
\lambda_Z(\bm{s}) = q \lambda_X(\bm{s})
\]
are the intensity functions of \(Y\) and \(Z\), respectively.
The figure below shows \(Y\) for an actual observed WWII bomb point consisting of n=443 bombs available from the R package highriskzone (Seibold et al. 2017). The observation window contains a particular area of interest for which a risk assessment needs to be done – often these contain a known WWII military target, e.g., an airport, an arms factory or a military casern. In order to not disclose the exact location of the considered area, coordinates are given relative to an arbitrary origo. In reality one would closely link such digitized aerial images to other terrain features using a GIS system.
library(highriskzone) library(spatstat) data(craterA) summary(craterA) ## Planar point pattern: 443 points ## Average intensity 0.0001082477 points per square unit ## ## Coordinates are given to 4 decimal places ## ## Window: polygonal boundary ## single connected closed polygon with 208 vertices ## enclosing rectangle: [0, 2334.3758] x [0, 2456.4061] units ## Window area = 4092470 square units ## Fraction of frame area: 0.714The point pattern craterA corresponding to an instance of the process \(Y\) is provided in R as an object of class ppp from the R package spatstat (Baddeley, Rubak, and Turner 2015). Instead of inferring the locations in \(Z\) directly, we shall be interested in determining a region within \(W\), a so called high risk zone, which with a high likelihood contains the points of \(Z\). We shall consider two methods for this job: a nonparametric method based on nearest neighbour distances in \(Y\) and an intensity function based inhomogeneous Poisson process approach incorporating \(q\).
High Risk ZonesA heuristic way to determine a highrisk zone is the following: Determine the distribution function \(D\) of the nearest neighbour distance (NND) distribution based on the 443 points in the point pattern. Use the distribution to determine a desired quantile, say \(0 \leq p \leq 1\) of the NND distribution. Denoting the \(p\) sample quantile of the NND distribution by \(Q(p)\), a \(p\)quantile NND based highrisk zone is then obtained as the union of putting a disc of radius \(x_q\) around each observed exploded bomb in \(Y\) – in mathematical terms:
\[
R_p = \left(\bigcup_{\bm{y} \in Y} B(\bm{y}, Q(p))\right) \bigcap W =
\left\{\bm{s} \in W : \min_{\bm{y}\in Y}  \bm{s} − \bm{y}  \leq Q_Y(p) \right\},
\]
where \(B(\bm{s}, r)\) denotes a disc of radius \(r\) around the point \(\bm{s}\) and \(\bm{s} – \bm{y}\) is the distance between the two points \(\bm{s}\) and \(\bm{y}\). The intersection with \(W\) is done in order to guarantee that the risk zone lies entirely within \(W\). As an example, we would determine the 99%quantile NND zone for craterA using spatstat functionality as follows:
(Qp < quantile(nndist(craterA), p = 0.99, type = 8)) ## 99% ## 142.1391 dmap < distmap(craterA) zone_dist < eval.im( dmap < Qp )The above can also be done directly using highriskzone::det_hrz function:
zone_dist < det_hrz(craterA, type="dist", criterion = "indirect", cutoff=0.99)Either way, the resulting riskzone is as follows:
summary(zone_dist) ## highrisk zone of type dist ## criterion: indirect ## cutoff: 0.99 ## ## threshold: 142.1391 ## area of the highrisk zone: 2574507Mahling, Höhle, and Küchenhoff (2013) show that risk zones constructed by the NND method work surprisingly well despite lacking a clear theoretical justification. One theoretical issue is, for example, that the NND distribution function determined by the above method is for the \((1q)\) thinned process \(Y\), even though the actual interest is in the process \(X=Y\cup Z\). Because of the thinning one would typically have that \(D_Y(r) \leq D_X(r)\) and thus \(Q_Y(p) > Q_X(p)\). Using \(Q_Y(p)\) to make statements about \(X\) (and thus \(Z\)) is therefore slightly wrong. However, this error cancels, because we then use the points in \(Y\) to add a buffer of radius \(Q_Y(p)\). Had we instead used the smaller, but true, \(Q_X(p)\) the risk zone would have gotten a too small, because \(X\) would also have contained more points to form discs around than \(Y\). The method thus implicitly takes \(q\) nonparametrically into account, because its NND is determined based on \(Y\) and subsequently discs of radius \(Q_Y(p)\) are formed around the points of \(Y\).
Technical details you might want to skip: The above feature is most easily illustrated if \(X\) is a homogeneous Poisson process with intensity \(\lambda_X\). In this case we have that the NND distribution function is (p.68, Illian et al. 2008)
\[
D_X(r) = 1 – \exp(\lambda_X \pi r^2), \quad r>0.
\]
Also note that \(D_Y(r) = 1 – \exp((1q)\lambda_X \pi r^2)\) and therefore \(D_Y(r) > D_X(r)\). Now solving for the \(p\)quantile of the NND in this homogeneous Poisson case means solving
\[
Q_Y(p) = \min_{r\geq 0} \{ D_Y(r) \geq p \} \Leftrightarrow Q_Y(p) =
\sqrt{ \frac{\log(1p)}{\lambda_Y \pi}}.
\]
From this it becomes clear than in the homogeneous Poisson case \(Q_ Y(p)\) is a factor \(\sqrt{1/(1q)}\) larger than \(Q_X(p)\), which is the actual target of interest.
Assessing the coverage of a riskzone
Two criterion appear immediate in order to assess the coverage of a riskzone \(R\):

The probability \(p_{\text{out}}\) that there will be at least one bomb outside the risk zone, i.e. \(P( N( Z \backslash R) > 0)\), where \(N(A)\) denotes the number of events in the set \(A \subseteq W\). Note: this probability is depending heavily on the amount of points in \(Z\), the more points there are, the higher is \(p_{\text{out}}\). However, it reflects the idea "one miss is all it takes to get in trouble".

The proportion of events in \(Z\) not located in \(R\), i.e. \(N( Z \backslash R) / N(Z)\), we shall denote this criterion by \(p_{\text{miss}}\). Note: This probability is taking possible different sizes of \(Z\) into account, but also takes a more relative approach towards how many bombs are not covered by the zone.
Under the assumption of independence between whether \(Z\)events are within or outside the riskzone one can convert back and forth between \(p_{\text{miss}}\) and \(p_{\text{out}}\) by
\[
p_{\text{out}} = P( N( Z \backslash R) > 0) = 1 P(N(Z
\backslash R) = 0) \approx 1 – (1p_{\text{miss}})^{N(Z)},
\]
where one in a simulation setup would know \(Z\) and thus also \(N(Z)\). Note that for a \(p\)quantile NND riskzone we expect \(1p_{\text{miss}}\) to be approximately equal to \(p\). We can investigate the behaviour of riskzones according to the two above criterion through the use of simulation. Either by simply \(q\)thinning of the existing point pattern \(Y\) and then use this thinned pattern to determine a riskzone, which is then evaluated. An alternative approach is to estimate the intensity surface from \(Y\), upscale it to get the intensity of \(X\), simulate \(X\) as an inhomogeneous Poisson point process with this intensity surface, thin this pattern to get a simulated instance of \(Y\), construct the riskzone based on this pattern and then evaluate the coverage of the zone (Mahling, Höhle, and Küchenhoff 2013). Note that this type of simulation is based on more assumptions compared to the nonparametric thinning simulation approach.
We generate 1,000 realizations of \(Y\) and \(Z\) through \(q\)thinning of the original craterA pattern while computing the coverage measures for the NND method as follows:
suppressPackageStartupMessages(library(doParallel)) registerDoParallel(cores=4) simulate_method < "thinning" #"intensity" # "cond_intensity" sim < foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist"), criterion=c("indirect"), cutoff=0.99, numit = 250, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 1 x 5 ## p_out p_miss `1p_miss` p_out_derived nZ ## ## 1 0.051 0.00118 0.999 0.0509 44.3The numbers state the average p_out and p_missin the 1000 simulations. Furthermore, nZ denotes the average number of events in \(Z\). We see that the NND method performs even a little better than intended, because \(1p_{\text{miss}}\) is even higher than the intended \(p\)=99%. The probability that the riskzone misses at least one bomb lies as low as 0.051. This is quite close to the above described approximate conversion from \(p_{\text{miss}}\) (0.051 vs. 0.051). Changing the simulation method for \(X\) to that of an inhomogeneous Poisson process with intensity \(1/(1q) \cdot \hat{\lambda}_Y(\bm{s})\) yields similar results:
## # A tibble: 1 x 5 ## p_out p_miss `1p_miss` p_out_derived nZ ## ## 1 0.47 0.0143 0.986 0.511 49.6We note that the probability of missing at least one bomb is much higher under this parametric simulation method. Only a small fraction of this is explained by \(Z\) now consisting of more points. A likely explanation is that the parametric model is only semiadequate to describe how the point patterns form. Therefore, the new \(X\) might have a somewhat different neighbourhood distribution than anticipated.
To compare more specifically with the intensity function based riskzone method of Mahling, Höhle, and Küchenhoff (2013) we use a specification, where the riskzone derived by the NND method or the intensity method have the same area (250 hectare).
sim_area < foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist","intens"), criterion=rep("area",2), cutoff=rep(2500000,2), numit = 100, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 2 x 6 ## Type p_out p_miss `1p_miss` p_out_derived area_zone ## ## 1 dist 0.123 0.00278 0.997 0.117 2500009. ## 2 intens 0.55 0.0172 0.983 0.539 2499994.For the particular example we see an advantage of using the NND method, because both p_out p_miss are much lower for the intensity based method. Again, this might be due to the intensity method being based on assumptions, which for the particular example do not appear to be so adequate. Results in Mahling (2013) were, however, much better for this example (c.f. Tab 2), which could be an indication that there is a problem in the highriskzone package implementation of this method?
DiscussionBeing a statistician is fascinating, because the job is the entry ticket to so many diverse research fields. The proposed methods and evaluations helped the Oberfinanzdirektion obtain a quantitative framework to decide which methods to use in their routine riskassessment. Further details on the above application can be found in Mahling, Höhle, and Küchenhoff (2013) as well as in Monia’s Ph.D. dissertation (Mahling 2013). Note also that the techniques are not limited to UXB detection: Inferunknownpointsfromathinnedprocess problems occur both in 1D and 2D point processes in a range of other fields, e.g., underreporting of infectious disease locations or in the calculation of animal abundance in ecology.
As a personal anecdote: When finishing the work on Mahling, Höhle, and Küchenhoff (2013) I was in transition from university to working at a public health institute. The deal was to finish the UXB work partly in sparetime and partly in the new work time. To honour this I added my new work place as second affiliation before submitting, but as part of the institution’s internal clearing procedure of the publication, I was asked to remove this affiliation again by the higher management, because the work ‘had nothing to do with public health’. While its questionable whether exploding bombs really do not have a public health impact, a few months later, I ended up using very similar statistical techniques to model occurredbutnotyetreported cases during a critical infectious disease outbreak (Höhle and an der Heiden 2014).
Literature
Baddeley, A., E. Rubak, and R. Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press.
Höhle, M., and M. an der Heiden. 2014. “Bayesian Nowcasting During the STEC O104:H4 Outbreak in Germany, 2011.” Biometrics 70 (4): 993–1002. doi:10.1111/biom.12194.
Illian, J., A. Penttinen, H. Stoyan, and D. Stoyan. 2008. Stistical Analysis and Modelling of Spatial Point Patterns. Wiley.
Mahling, M. 2013. “Determining HighRisk Zones by Using Spatial Point Process Methodology.” PhD thesis, Department of Statistics, University of Munich. https://edoc.ub.unimuenchen.de/15886/1/Mahling_Monia.pdf.
Mahling, M., M. Höhle, and H. Küchenhoff. 2013. “Determining highrisk zones for unexploded World War II bombs by using point process methodology.” Journal of the Royal Statistical Society, Series C 62 (2): 181–99. doi:10.1111/j.14679876.2012.01055.x.
Seibold, H., M. Mahling, S. Linne, and F. Günther. 2017. Highriskzone: Determining and Evaluating HighRisk Zones. https://cran.rproject.org/web/packages/highriskzone/index.html.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
201805 Selective Raster Graphics
(This article was first published on R – Stat Tech, and kindly contributed to Rbloggers)
This report explores ways to render specific components of an R plot in raster format, when the overall format of the plot is vector. For example, we demonstrate ways to draw raster data symbols within a PDF scatter plot. A general solution is provided by the grid.rasterize function from the R package ‘rasterize’.
Paul Murrell
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...