Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 7 hours 3 min ago

The Power of Standards and Consistency

Sun, 05/27/2018 - 06:32

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m going to (eventually) write a full post on the package I’m mentioning in this one : osqueryr. The TLDR on osqueryr is that it is an R DBI wrapper (that has just enough glue to also be plugged into dbplyr) for osquery. The TLDR on osquery is that it “exposes an operating system as a high-performance relational database. This design allows you to write SQL-based queries efficiently and easily to explore operating systems.”

In short, osquery turns the metadata and state information of your local system (or remote system(s)) into a SQL-compliant database. It also works on Windows, Linux, BSD and macOS. This means you can query a fleet of systems with a (mostly) normalized set of tables and get aggregated results. Operations and information security staff use this to manage systems and perform incident response tasks, but you can use it to get just about anything and there are even more powerful modes of operation for osquery. But, more on all the features of osquery[r] in another post.

If you are skeptical, here’s some proof (which I need to show regardless of your skepticism state). First, a local “connection”:

library(DBI) library(osqueryr) con <- DBI::dbConnect(Osquery()) head(dbListTables(con), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 2 x 2 ## name system_time ## 1 Firewall 3 ## 2 firefox 517846

then, a remote "connection":

con2 <- osqueryr::dbConnect(Osquery(), host = "hrbrmstr@osq1") head(dbListTables(con2), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con2, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con2, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 1 x 2 ## name system_time ## 1 firefox 1071992 "You're talking an awful lot about the package when you said this was a post on 'standards' and 'consistency'."

True, but we needed that bit above for context. To explain what this post has to do with "standards" and "consistency" I also need to tell you a bit more about how both osquery and the osqueryr package are implemented.

You can read about osquery in-depth starting at the link at the top of this post, but the authors of the tool really wanted a consistent idiom for accessing system metadata with usable, normalized output. They chose (to use a word they didn't but one that works for an R audience) a "data frame" as the output format and picked the universal language of "data frames" -- SQL -- as the inquiry interface. So, right there are examples of both standards and consistency: using SQL vs coming up with yet-another-query-language and avoiding the chaos of the myriad of outputs from various system commands by making all results conform to a rectangular data structure.

Let's take this one-step further with a specific example. All modern operating systems have the concept of a "process" and said processes have (mostly) similar attributes. However, the commands used to get a detailed listing of those processes differ (sometimes wildly) from OS to OS. The authors of osquery came up with a set of schemas to ensure a common, rectangular output and naming conventions (note that some schemas are unique to a particular OS since some elements of operating systems have no useful counterparts on other operating systems).

The osquery authors also took consistency and standards to yet-another-level by taking advantage of a feature of SQLite called virtual tables. That enables them to have C/C++/Objective-C "glue" that gets called when a query is made so they can dispatch the intent to the proper functions or shell commands and then send all the results back -- or -- use the SQLite engine capabilities to do joining, filtering, UDF-calling, etc to produce rich, targeted rectangular output back.

By not reinventing the wheel and relying on well-accepted features like data frames, SQL and SQLite the authors could direct all their focus on solving the problem they posited.

"Um, you're talking alot about everything but R now."

We're getting to the good (i.e. "R") part now.

Because the authors didn't try to become SQL parser writer experts and relied on the standard SQL offerings of SQLite, the queries made are "real" SQL (if you've worked with more than one database engine, you know how they all implement different flavours of SQL).

Because these queries are "real" SQL, we can write an R DBI driver for it. The DBI package aims "[to define] a common interface between R and database management systems (DBMS). The interface defines a small set of classes and methods similar in spirit to Perl's DBI, Java's JDBC, Python's DB-API, and Microsoft's ODBC. It defines a set of classes and methods defines what operations are possible and how they are performed."

If you look at the osqueryr package source, you'll see a bunch of DBI boilerplate code (which is in the r-dbi organization example code) and only a handful of "touch points" for the actual calls to osqueryi (the command that processes SQL). No handling of anything but passing on SQL to the osqueryi engine and getting rectangular results back. By abstracting the system call details, R users can work with a familiar, consistent, standard interface and have full access to the power of osquery without firing up a terminal.

But it gets even better.

As noted above, one design aspect of osquery was to enable remote usage. Rather than come up with yet-another-daemon-and-custom-protocol, the osquery authors suggest ssh as one way of invoking the command on remote systems and getting the rectangular results back.

Because the osqueryr package used the sys package for making local system calls, there was only a tiny bit of extra effort required to switch from sys::exec_internal() to a sibling call in the ssh package -- ssh::ssh_exec_internal() when remote connections were specified. (Said effort could have been zero if I chose a slightly different function in sys, too.)

Relying on well-accepted standards made both osqueryi and the R DBI-driver work seamlessly without much code at all and definitely without a rats nest of nested if/else statements and custom httr functions.

But it gets even more better-er

Some folks like & grok SQL, others don't. (Humans have preferences, go figure.)

A few years ago, Hadley (do I even need to use his last name at this point in time?) came up with the idea to have a more expressive and consistent way to work with data frames. We now know this as the tidyverse but one core element of the tidyverse is dplyr, which can really level-up your data frame game (no comments about data.table, or the beauty of base R, please). Not too long after the birth of dplyr came the ability to work with remote, rectangular, SQL-based data sources with (mostly) the same idioms.

And, not too long after that, the remote dplyr interface (now, dbplyr) got up close and personal with DBI. Which ultimately means that if you make a near-fully-compliant DBI interface to a SQL back-end you can now do something like this:

library(DBI) library(dplyr) library(osqueryr) con <- DBI::dbConnect(Osquery()) osqdb <- src_dbi(con) procs <- tbl(osqdb, "processes") listen <- tbl(osqdb, "listening_ports") left_join(procs, listen, by="pid") %>% filter(port != "", protocol == "17") %>% # 17 == TCP distinct(name, port, address, pid) ## # Source: lazy query [?? x 4] ## # Database: OsqueryConnection ## address name pid port ## 1 0.0.0.0 BetterTouchTool 46317 57183 ## 2 0.0.0.0 Dropbox 1214 17500 ## 3 0.0.0.0 SystemUIServer 429 0 ## 4 0.0.0.0 SystemUIServer 429 62240 ## 5 0.0.0.0 UserEventAgent 336 0 ## 6 0.0.0.0 WiFiAgent 493 0 ## 7 0.0.0.0 WiFiProxy 725 0 ## 8 0.0.0.0 com.docker.vpnkit 732 0 ## 9 0.0.0.0 identityservicesd 354 0 ## 10 0.0.0.0 loginwindow 111 0 ## # ... with more rows

The src_dbi() call wires up everything for us because d[b]plyr can rely on DBI doing it's standard & consistent job and DBI can rely on the SQLite processing crunchy goodness of osqueryi to ultimately get us a list of really dangerous (if not firewalled off) processes that are listening on all network interfaces. (Note to self: find out why the BetterTouchTool and Dropbox authors feel the need to bind to 0.0.0.0…)

FIN

What did standards and consistency get us?

  • The osquery authors spent time solving a hard problem vs creating new data formats and protocols
  • Rectangular data (i.e. "data frame") provides consistency and structure which ends up causing more freedom
  • "Standard" SQL enables a consistent means to work with rectangular data
  • ssh normalizes (secure) access across systems with a consistent protocol
  • A robust, well-defined standard mechanism for working with SQL databases enabled nigh instantaneous wiring up of a whole new back-end to R
  • ssh and sys common idioms made working with the new back-end on remote systems as easy as is on a local system
  • Another robust, well-defined modern mechanism for working with rectangular data got wired up to this new back-end with (pretty much) one line of code because of the defined standard and expectation of consistency (and works for local and remote)

Standards and consistency are pretty darned cool.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Extracting Specific Lines from a Large (Compressed) Text File

Sun, 05/27/2018 - 02:00

(This article was first published on R on Yixuan's Blog, and kindly contributed to R-bloggers)

A few days ago a friend asked me the following question: how to efficiently extract some specific lines from a large text file, possibily compressed by Gzip? He mentioned that he tried some R functions such as read.table(skip = …), but found that reading the data was too slow. Hence he was looking for some alternative ways to extracting the data.
This is a common task in preprocessing large data sets, since in data exploration, very often we want to peek at a small subset of the whole data to gain some insights.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Yixuan's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio:addins part 3 – View objects, files, functions and more with 1 keypress

Sat, 05/26/2018 - 16:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

In this post in the RStudio:addins series we will try to make our work more efficient with an addin for better inspection of objects, functions and files within RStudio. RStudio already has a very useful View function and a Go To Function / File feature with F2 as the default keyboard shortcut and yes, I know I promised automatic generation of @importFrom roxygen tags in the previous post, unfortunately we will have to wait a bit longer for that one but I believe this one more than makes up for it in usefulness.

The addin we will create in this article will let us use RStudio to View and inspect a wide range of objects, functions and files with 1 keypress.

The addins in action

Contents
  1. Retrieving objects from sys.frames
  2. Viewing files, objects and functions and more efficiently
  3. The addin function, updating the .dcf file and key binding
  4. The addin in action
  5. TL;DR – Just give me the package
  6. References
Retrieving objects from sys.frames

As a first step, we need to be able to retrieve the value of the object we are looking for based on a character string from a frame within the currently present sys.frames() for our session. This may get tricky, as it is not sufficient to only look at parent frames, because we may easily have multiple sets of “parallel” call stacks, especially when executing addins.

An example can be seen in the following screenshot, where we have a browser() call executed during the Addin execution itself. We can see that our current frame is 18 and browsing through its parent would get us to frames 17 -> 16 -> 15 -> 14 -> 0 (0 being the .GlobalEnv). The object we are looking for is however most likely in one of the other frames (9 in this particular case):

Example of sys.frames

getFromSysframes <- function(x) { if (!(is.character(x) && length(x) == 1 && nchar(x) > 0)) { warning("Expecting a non-empty character of length 1. Returning NULL.") return(invisible(NULL)) } validframes <- c(sys.frames()[-sys.nframe()], .GlobalEnv) res <- NULL for (i in validframes) { inherits <- identical(i, .GlobalEnv) res <- get0(x, i, inherits = inherits) if (!is.null(res)) { return(res) } } return(invisible(res)) } Viewing files, objects, functions and more efficiently

As a second step, we write a function to actually view our object in RStudio. We have quite some flexibility here, so as a first shot we can do the following:

  1. Open a file if the selection (or the selection with quotes added) is a path to an existing file. This is useful for viewing our scripts, data files, etc. even if they are not quoted, such as the links in your Rmd files
  2. Attempt to retrieve the object by the name and if found, try to use View to view it
  3. If we did not find the object, we can optionally still try to retrieve the value by evaluating the provided character string. This carries some pitfalls, but is very useful for example for
    • viewing elements of lists, vectors, etc. where we need to evaluate [, [[ or $ to do so.
    • viewing operation results directly in the viewer, as opposed to writing them out into the console, useful for example for wide matrices that (subjectively) look better in the RStudio viewer, compared to the console output
  4. If the View fails, we can still show useful information by trying to View its structure, enabling us to inspect objects that cannot be coerced to a data.frame and therefore would fail to be viewed.
viewObject <- function(chr, tryEval = getOption("jhaddins_view_tryeval", default = TRUE) ) { if (!(is.character(chr) && length(chr) == 1 && nchar(chr) > 0)) { message("Invalid input, expecting a non-empty character of length 1") return(invisible(1L)) } ViewWrap <- get("View", envir = as.environment("package:utils")) # maybe it is an unquoted filename - if so, open it if (file.exists(chr)) { rstudioapi::navigateToFile(chr) return(invisible(0L)) } # or maybe it is a quoted filename - if so, open it if (file.exists(gsub("\"", "", chr, fixed = TRUE))) { rstudioapi::navigateToFile(gsub("\"", "", chr, fixed = TRUE)) return(invisible(0L)) } obj <- getFromSysframes(chr) if (is.null(obj)) { if (isTRUE(tryEval)) { # object not found, try evaluating try(obj <- eval(parse(text = chr)), silent = TRUE) } if (is.null(obj)) { message(sprintf("Object %s not found", chr)) return(invisible(1L)) } } # try to View capturing output for potential errors Viewout <- utils::capture.output(ViewWrap(obj, title = chr)) if (length(Viewout) > 0 && grepl("Error", Viewout)) { # could not view, try to at least View the str of the object strcmd <- sprintf("str(%s)", chr) message(paste(Viewout,"| trying to View", strcmd)) ViewWrap(utils::capture.output(utils::str(obj)), title = strcmd) } return(invisible(0L)) }

This function can of course be improved and updated in many ways, for example using the summary method instead of str for selected object classes, or showing contents of .csv (or other data) files already read into a data.frame.

The addin function, updating the .dcf file and key binding

If you followed the previous posts in the series, you most likely already know what is coming up next. First, we need a function serving as a binding for the addin that will execute out viewObject function on the active document’s selections:

viewSelection <- function() { context <- rstudioapi::getActiveDocumentContext() lapply(X = context[["selection"]] , FUN = function(thisSel) { viewObject(thisSel[["text"]]) } ) return(invisible(NULL)) }

Secondly, we update the inst/rstudio/addins.dcf file by adding the binding for the newly created addin:

Name: viewSelection Description: Tries to use View to View the object defined by a text selected in RStudio Binding: viewSelection Interactive: false

Finally, we re-install the package and assign the keyboard shortcut in the Tools -> Addins -> Browse Addins... -> Keyboard Shortcuts... menu. Personally I assigned a single F4 keystroke for this, as I use it very often:

Assigning a keyboard shortcut to use the Addin

The addin in action

Now, let’s view a few files, a data.frame, a function and a try-error class object just pressing F4.

TL;DR – Just give me the package References Did you find the article helpful or interesting? Help others find it by sharing

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tips for great graphics

Sat, 05/26/2018 - 09:13

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

R is a great program for generating top-notch graphics. But to get the best out of it, you need to put in a little more work. Here are a few tips for adapting your R graphics to make them look a little better.

1) Dont use the “File/Save as…/” menu.

If you set up your graphic in the first place then theres no need to post-process (eg crop, scale etc) the graphic in other software.

Use the graphic devices (jpeg(), tiff(), postscript(), etc), set your height and width to whatever you want the finished product to be and then create the graph.

tiff("~/Manuscript/Figs/Fig1.tiff", width =2, height =2, units ="in", res = 600) plot(dist ~ speed, cars) # cars is a base R dataset - data(cars) dev.off()

The first argument to a graphic device such as tiff or jpeg is the filename, to which you can include the path. So that I dont have to worry about the order of arguments, I include the argument name. Width and height specify the width and height of the graphic in the units provided, in this case inches, but pixels, cm or mm can also be used. The resolution tells R how higher quality the graphic should be, the higher the better, but if you go too high, you could find that you have problems running out of space. I find 600 a nice compromise. Nice crisp lines, smooth curves and sharp letters. You can then import that straight into MS Word or whatever other word processor you use, or upload to go with that manuscript youve worked so hard on. Although, you could find yourself with BIG files. A 4×6 inch figure I made recently was 17.1MB! And for the web, 100 or 200 is probably enough.

This technique also provides you with the same output every time, which is not the case if you adjust the size of the default device window produced by plot()

2) Dont be afraid to change the default settings!

Personally, I find that a 1 inch margin at the base of the graphic is a bit generous. I also find the ticks a bit long and the gap between tick and the axis label a bit big. Luckily, these things are easy to change!

jpeg("t1.jpeg", width=3, height=3, units="in", res=100) plot(dist~speed, cars) dev.off()

The above code produces this figure.

See what I mean about the margins?

Heres how to change it!

par(mai=c(0.5, 0.5, 0.1, 0.1) ) plot(dist ~ speed, cars, tck = -0.01, las=1, mgp=c(1.4,0.2,0))

That call to par changes the “MArgin in Inches” setting. The tck argument to par deals with TiCK length (negative for outside, positive for inside) while mgp controls on which line certain things are printed (titles are the first argument, labels are the second and the axis itself is the third). The las argument controls the rotation of the labels (1 for horizontal, 2 for perpendicular and 3 for vertical).

This leads me nicely to number 3: Dont be afraid to have separate lines for different parts of your plot.

This allows far more freedom and flexibility than setting par arguments in the plot argument. You can have different mgp settings for each axis for instance.

par(mai=c(0.4, 0.5, 0.1, 0.1)) plot(dist ~ speed, cars, xaxt="n", mgp=c(1.4,0.2,0), las=1, tck=-0.01) axis(side=1, tck = -0.01, las=1, mgp=c(0.5,0.2,0)) mtext("speed", side=1, line= 1)

This plots the same graph, but allows different distances for the x and y axes, in terms of margin and where the title is situated. The axis function places an axis on the side determined by its side argument and mtext places Margin TEXT, again at the side in its argument and in this case on line 1.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to plot with patchwork

Fri, 05/25/2018 - 15:18

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The goal of patchwork is to make it simple to combine separate ggplots into the same graphic. As such it tries to solve the same problem as gridExtra::grid.arrange() and cowplot::plot_grid but using an API that incites exploration and iteration.

Installation
You can install patchwork from github with:

# install.packages("devtools")
devtools::install_github("thomasp85/patchwork")

The usage of patchwork is simple: just add plots together!

library(ggplot2)
library(patchwork)

p1 <- ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 <- ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))

p1 + p2

you are of course free to also add the plots together as part of the same plotting operation:

ggplot(mtcars) +
geom_point(aes(mpg, disp)) +
ggplot(mtcars) +
geom_boxplot(aes(gear, disp, group = gear))

layouts can be specified by adding a plot_layout() call to the assemble. This lets you define the dimensions of the grid and how much space to allocate to the different rows and columns

p1 + p2 + plot_layout(ncol = 1, heights = c(3, 1))

If you need to add a bit of space between your plots you can use plot_spacer() to fill a cell in the grid with nothing

p1 + plot_spacer() + p2

You can make nested plots layout by wrapping part of the plots in parentheses – in these cases the layout is scoped to the different nesting levels

p3 <- ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 <- ggplot(mtcars) + geom_bar(aes(carb))

p4 + {
p1 + {
p2 +
p3 +
plot_layout(ncol = 1)
}
} +
plot_layout(ncol = 1)

Advanced features
In addition to adding plots and layouts together, patchwork defines some other operators that might be of interest. – will behave like + but put the left and right side in the same nesting level (as opposed to putting the right side into the left sides nesting level). Observe:

p1 + p2 + p3 + plot_layout(ncol = 1)

this is basically the same as without braces (just like standard math arithmetic) – the plots are added sequentially to the same nesting level. Now look:

p1 + p2 - p3 + plot_layout(ncol = 1)

Now p1 + p2 and p3 is on the same level.

Often you are interested in just putting plots besides or on top of each other. patchwork provides both | and / for horizontal and vertical layouts respectively. They can of course be combined for a very readable layout syntax:

(p1 | p2 | p3) /
p4

There are two additional operators that are used for a slightly different purpose, namely to reduce code repetition. Consider the case where you want to change the theme for all plots in an assemble. Instead of modifying all plots individually you can use & or * to add elements to all subplots. The two differ in that * will only affect the plots on the current nesting level:

(p1 + (p2 + p3) + p4 + plot_layout(ncol = 1)) * theme_bw()

whereas & will recurse into nested levels:

p1 + (p2 + p3) + p4 + plot_layout(ncol = 1) & theme_bw()

Note that parenthesis is required in the former case due to higher precedence of the * operator. The latter case is the most common so it has deserved the easiest use.

Related exercise sets:
  1. How to Display Multivariate Relationship Graphs With Lattice
  2. Lattice Exercises
  3. Explore all our (>1000) R exercises
  4. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Programmatically creating text output in R – Exercises

Fri, 05/25/2018 - 14:01

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the age of Rmarkdown and Shiny, or when making any custom output from your data you want your output to look consistent and neat. Also, when writing your output you often want it to obtain a specific (decorative) format defined by the html or LaTeX engine. These exercises are an opportunity to refresh our memory on functions such as paste, sprintf, formatC and others that are convenient tools to achieve these ends. All of the solutions rely partly on the ultra flexible sprintf() but there are no-doubt many ways to solve the exercises with other functions, feel free to share your solutions in the comment section.

Example solutions are available here.

Exercise 1

Print out the following vector as prices in dollars (to the nearest cent):
c(14.3409087337707, 13.0648270623048, 3.58504267621646, 18.5077076398145,
16.8279241011882). Example: $14.34

Exercise 2

Using these numbers c(25, 7, 90, 16) make a vector of filenames in the following format: file_025.txt. That is, left pad the numbers so they are all three digits.

Exercise 3

Actually, if we are only dealing numbers less than hundred file_25.txt would have been enough. Change the code from last exercise so that the padding is progammatically decided by the biggest number in the vector.

Exercise 4

Print out the following haiku on three lines, right aligned, with the help of cat. c("Stay the patient course.", "Of little worth is your ire.", "The network is down.").

Exercise 5

Write a function that converts a number to its hexadecimal representation. This is a useful skill when converting bmp colors from one representation to another. Example output:

tohex(12) [1] "12 is c in hexadecimal"

Exercise 6

Take a string and programmatically surround it with the html header tag h1

Exercise 7

Back to the poem from exercise 4, let R convert to html unordered list. So that it would appear like the following in a browser:

  • Stay the patient course.
  • Of little worth is your ire.
  • The network is down.

Exercise 8

Here is a list of the currently top 5 movies on imdb.com in terms of rating c("The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "The Dark Knight", "12 Angry Men", "Schindler's List") convert them into a list compatible with written text.

Example output:

[1] "The top ranked films on imdb.com are The Shawshank Redemption, The Godfather, The Godfather: Part II, The Dark Knight, 12 Angry Men and Schindler's List"

Exercise 9

Now you should be able to solve this quickly: write a function that converts a proportion to a percentage that takes as input the number of decimal places. Input of 0.921313 and 2 decimal places should return "92.13%"

Exercise 10

Improve the function from last exercise so the percentage take consistently 10 characters by doing some left padding. Raise an error if percentage already happens to be longer than 10.

(Image by Daniel Friedman).

Related exercise sets:
  1. Parallel Computing Exercises: Foreach and DoParallel (Part-2)
  2. Parallel Computing Exercises: Snow and Rmpi (Part-3)
  3. Building Shiny App exercises part 4
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Slopegraphs and R – A pleasant diversion – May 26, 2018

Fri, 05/25/2018 - 02:00

(This article was first published on Chuck Powell, and kindly contributed to R-bloggers)

I try to at least scan the R-bloggers
feed everyday. Not every article is of interest to me, but I often have
one of two different reactions to at least one article. Sometimes it is
an “ah ha” moment because the article is right on point for a problem
I have now or have had in the past and the article provides a (better)
solution. Other times my reaction is more of an “oh yeah”, because it
is something I have been meaning to investigate, or something I once
knew, but the article brings a different perspective to it.

The second case happened to me this week. I’ve been aware of slopegraphs
and bumpcharts for quite some time, and I certainly am aware of Tufte’s
work
. As an amateur military
historian I’ve always loved, for example, his
poster
depicting Napoleon’s
Russian Campaign. So when I saw the article from Murtaza
Haider
titled
“Edward Tufte’s Slopegraphs and political fortunes in Ontario” I
just had to take a peek and revisit the topic.

The article does a good job of looking at slopegraphs in both R (via
plotrix) and Stata, even providing the code to do the work. My
challenge was that even though I’m adequate at plotting in base R, I
much prefer using ggplot2 wherever and whenever possible. My memory
was that I had seen another article on the related topic of a
bumpchart on R-bloggers in the not too distant past. A little
sleuthing turned up this earlier
article
from Dominik
Koch
who wrote some code to
compare national performance at the Winter Olympics, “Bump Chart –
Track performance over time”
.

Finally, I wound up at this Github
repository
for a project called
“Edward Tufte-Inspired Slopegraphs” from Thomas J.
Leeper
who has been building code to make
slopegraphs using both base plotting functions and ggplot2.

My post today will draw a little bit from all their work and hopefully
provide some useful samples for others to draw on if they share some of
my quirks about data layout and a preference for ggplot2 versus base
plot. I’m going to focus almost exclusively on slopegraphs, although
much of the work could be extended to bumpcharts as well.

Setup and library loading

We’re going to make occasional use of dplyr to manipulate the data,
extensive use of ggplot2 to do the plotting and ggrepel to solve one
specific labeling problem. We’ll load them and I am suppressing the
message from dplyr about namespace overrides.

require(dplyr) require(ggplot2) require(ggrepel) require(kableExtra) Politics in Ontario

The original
post

is about plotting the data from some polling results in Ontario. For the
reader’s convenience I’ve made the data available via a structure
command. We have data about two different polling dates, for 5 political
parties, and the measured variable is percent of people supporting
expressed as x.x (i.e. already multiplied by
100).

data <- structure(list( Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("11-May-18", "18-May-18"), class = "factor"), Party = structure(c(5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L), .Label = c("Green", "Liberal", "NDP", "Others", "PC"), class = "factor"), Pct = c(42.3, 28.4, 22.1, 5.4, 1.8, 41.9, 29.3, 22.3, 5, 1.4)), class = "data.frame", row.names = c(NA, -10L)) str(data) ## 'data.frame': 10 obs. of 3 variables: ## $ Date : Factor w/ 2 levels "11-May-18","18-May-18": 1 1 1 1 1 2 2 2 2 2 ## $ Party: Factor w/ 5 levels "Green","Liberal",..: 5 3 2 1 4 5 3 2 1 4 ## $ Pct : num 42.3 28.4 22.1 5.4 1.8 41.9 29.3 22.3 5 1.4 head(data) ## Date Party Pct ## 1 11-May-18 PC 42.3 ## 2 11-May-18 NDP 28.4 ## 3 11-May-18 Liberal 22.1 ## 4 11-May-18 Green 5.4 ## 5 11-May-18 Others 1.8 ## 6 18-May-18 PC 41.9

Let’s just take the data as we have it and feed it to ggplot in a nice
simple fashion and see what we get with very little effort.

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

The nice thing about ggplot is once you get used to the syntax it
becomes very “readable”. We’ve identified our dataset, the x & y
variables and our grouping variable. Lines too big? An adjustment to
size = 2 does it. Don’t like colors? Pull the color = Party clause.

So we’re already pretty close to what we need. Things are scaled
properly and the basic labeling of titles etc. is accomplished. Our
biggest “problem” is that ggplot has been a little too helpful and
adding some things we’d like to remove to give it a more “Tuftesque”
look. So what we’ll do in the next few steps is add lines of code – but
they are mainly designed to remove unwanted elements. This is in
contrast to a base plot where we have to write the code to add elements.

So lets:

  • Move the x axis labels to the top with scale_x_discrete(position =
    "top")
  • Change to a nice clean black and white theme theme_bw()
  • Not display any legend(s) theme(legend.position = "none")
  • Remove the default border from our plot theme(panel.border =
    element_blank())

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Nice progress! Continuing to remove things that can be considered
“clutter” we add some additional lines that all end in
element_blank() and are invoked to remove default plot items such as
the plot grid, the y axcis text, etc..

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Remove just about everything from the y axis theme(axis.title.y = element_blank()) + theme(axis.text.y = element_blank()) + theme(panel.grid.major.y = element_blank()) + theme(panel.grid.minor.y = element_blank()) + # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()) + theme(panel.grid.major.x = element_blank()) + theme(axis.text.x.top = element_text(size=12)) + # Remove x & y tick marks theme(axis.ticks = element_blank()) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Very nice! We’re almost there! The “almost” is because now that we
have removed both the legend and all scales and tick marks we no longer
know who is who, and what the numbers are! Plus, I’m a little unhappy
with the way the titles are formatted, so we’ll play with that. Later,
I’ll get fancy but for now let’s just add some simple text labels on
the left and right to show the party name and their percentage. The code
geom_text(aes(label = Party)) will place the party name right on top
of the points that anchor either end of the line. If we make that
geom_text(aes(label = paste0(Party, " - ", Pct, "%"))) then we’ll get
labels that have both the party and the percent all neatly formatted,
but still right on top of the points that anchor the ends of the line.
hjust controls horizontal justification so if we change it to
geom_text(aes(label = paste0(Party, " - ", Pct, "%")), hjust = 1.35)
both sets of labels will slide to the left which is exactly what we want
for the May 11 labels but not the May 18 labels. If we feed hjust a
negative number they’ll go the other way. So what we’ll do is filter the
data using the filter function from dplyr and place the left hand
labels differently than the right hand labels. While we’re at it we’ll
make it bold face font and a little larger…

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + geom_text(data = data %>% filter(Date == "11-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = 1.35, fontface = "bold", size = 4) + geom_text(data = data %>% filter(Date == "18-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = -.35, fontface = "bold", size = 4) + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Remove just about everything from the y axis theme(axis.title.y = element_blank()) + theme(axis.text.y = element_blank()) + theme(panel.grid.major.y = element_blank()) + theme(panel.grid.minor.y = element_blank()) + # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()) + theme(panel.grid.major.x = element_blank()) + theme(axis.text.x.top = element_text(size=12)) + # Remove x & y tick marks theme(axis.ticks = element_blank()) + # Format title & subtitle theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)) + theme(plot.subtitle = element_text(hjust = 0.5)) + # Labelling as desired labs( title = "Voter's stated preferences for June 7 elections in Ontario", subtitle = "(Mainstreet Research)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Eureka! Not perfect yet but definitely looking good.

Adding complexity

I’m feeling pretty good about the solution so far but there are three
things I’d like to make better.

  1. How well will this solution work when we have more than two time
    periods? Need to make sure it generalizes to a more complex case.
  2. As Murtaza
    Haider
    notes in
    his post we’ll have issues if the data points are identical or very
    close together. Our very neat little labels will overlap each other.
    In his post I believe he mentions that he manually moved them in
    some cases. Let’s try and fix that.
  3. Oh my, that’s a lot of code to keep cutting and pasting, can we
    simplify?

To test #1 and #2 I have “invented”” a new dataset called moredata.
It is fictional it’s labelled May 25th but today is actually May
24th. But I created it to add a third polling date and to make sure that
we had a chance to test what happens when we have two identical
datapoints on the same day. Notice that on May 25th the polling numbers
for the Liberals and the NDP are identical at
26.8%.

moredata <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("11-May-18", "18-May-18", "25-May-18"), class = "factor"), Party = structure(c(5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L), .Label = c("Green", "Liberal", "NDP", "Others", "PC"), class = "factor"), Pct = c(42.3, 28.4, 22.1, 5.4, 1.8, 41.9, 29.3, 22.3, 5, 1.4, 41.9, 26.8, 26.8, 5, 1.4)), class = "data.frame", row.names = c(NA, -15L)) tail(moredata) ## Date Party Pct ## 10 18-May-18 Others 1.4 ## 11 25-May-18 PC 41.9 ## 12 25-May-18 NDP 26.8 ## 13 25-May-18 Liberal 26.8 ## 14 25-May-18 Green 5.0 ## 15 25-May-18 Others 1.4

You’ll notice at the beginning of this post I loaded the ggrepel
library. ggrepel works with ggplot2 to repel things that overlap,
in this case our geom_text labels. The invocation is geom_text_repel
and it is very similar to geom_text but allows us to deconflict the
overlaps. We’ll use hjust = "left" and hjust = "right" to control
justifying the labels. We’ll use a fixed nudge left and right nudge_x =
-.45 and nudge_x = .5 to move the labels left and right off the
plotted data points and we will explicitly tell geom_text_repel to
only move the labels vertically to avoid overlap with direction = "y".
Everything else remains the same.

ggplot(data = moredata, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 2) + geom_point(aes(color = Party, alpha = 1), size = 4) + geom_text_repel(data = moredata %>% filter(Date == "11-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = "left", fontface = "bold", size = 4, nudge_x = -.45, direction = "y") + geom_text_repel(data = moredata %>% filter(Date == "25-May-18"), aes(label = paste0(Party, " - ", Pct, "%")) , hjust = "right", fontface = "bold", size = 4, nudge_x = .5, direction = "y") + # move the x axis labels up top scale_x_discrete(position = "top") + theme_bw() + # Format tweaks # Remove the legend theme(legend.position = "none") + # Remove the panel border theme(panel.border = element_blank()) + # Remove just about everything from the y axis theme(axis.title.y = element_blank()) + theme(axis.text.y = element_blank()) + theme(panel.grid.major.y = element_blank()) + theme(panel.grid.minor.y = element_blank()) + # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()) + theme(panel.grid.major.x = element_blank()) + theme(axis.text.x.top = element_text(size=12)) + # Remove x & y tick marks theme(axis.ticks = element_blank()) + # Format title & subtitle theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)) + theme(plot.subtitle = element_text(hjust = 0.5)) + # Labelling as desired labs( title = "Bogus Data", subtitle = "(Chuck Powell)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Very nice! We have confirmed that our solution works for more than two
dates without any additional changes and we have found a solution to the
label overlap issue. In a little while we’ll talk about labeling the
data points in the center (if we want to).

Before we move on let’s make our life a little simpler. While the output
plot is good it’s a lot of code to produce one graph. Let’s see if we
can simplify…

Since ggplot2 objects are just regular R objects, you can put them in a
list. This means you can apply all of R’s great functional programming
tools. For example, if you wanted to add different geoms to the same
base plot, you could put them in a list and use lapply().

But for now let’s at least take all the invariant lines of code and put
them in a list. Then when we go to plot we can just invoke the list and
remain confident we get the right formatting. For now let’s name this
list something quaint and obvious like MySpecial.

MySpecial <- list( # move the x axis labels up top scale_x_discrete(position = "top"), theme_bw(), # Format tweaks # Remove the legend theme(legend.position = "none"), # Remove the panel border theme(panel.border = element_blank()), # Remove just about everything from the y axis theme(axis.title.y = element_blank()), theme(axis.text.y = element_blank()), theme(panel.grid.major.y = element_blank()), theme(panel.grid.minor.y = element_blank()), # Remove a few things from the x axis and increase font size theme(axis.title.x = element_blank()), theme(panel.grid.major.x = element_blank()), theme(axis.text.x.top = element_text(size=12)), # Remove x & y tick marks theme(axis.ticks = element_blank()), # Format title & subtitle theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)), theme(plot.subtitle = element_text(hjust = 0.5)) ) summary(MySpecial) ## Length Class Mode ## [1,] 17 ScaleDiscretePosition environment ## [2,] 57 theme list ## [3,] 1 theme list ## [4,] 1 theme list ## [5,] 1 theme list ## [6,] 1 theme list ## [7,] 1 theme list ## [8,] 1 theme list ## [9,] 1 theme list ## [10,] 1 theme list ## [11,] 1 theme list ## [12,] 1 theme list ## [13,] 1 theme list ## [14,] 1 theme list

MySpecial is actually an incredibly complex structure so I used the
summary function. What’s important to us is that in the future all we
need to do is include it in the ggplot command and magic happens.
Perhaps another day I’ll make it a proper function but for now I can
change little things like line size or titles and labels without
worrying about the rest. So here it is with some little things changed.

ggplot(data = moredata, aes(x = Date, y = Pct, group = Party)) + geom_line(aes(color = Party, alpha = 1), size = 1) + geom_point(aes(color = Party, alpha = 1), size = 3) + geom_text_repel(data = moredata %>% filter(Date == "11-May-18"), aes(label = paste0(Party, " : ", Pct, "%")) , hjust = "left", fontface = "bold", size = 4, nudge_x = -.45, direction = "y") + geom_text_repel(data = moredata %>% filter(Date == "25-May-18"), aes(label = paste0(Party, " : ", Pct, "%")) , hjust = "right", fontface = "bold", size = 4, nudge_x = .5, direction = "y") + MySpecial + labs( title = "Bogus Data", subtitle = "(Chuck Powell)", caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/" )

Even more complex

Feeling good about the solution so far I decided to press on to a much
more complex problem. Thomas J. Leeper has
a nice plot of Tufte’s Cancer survival
slopegraph

N.B. that the original Tufte is not accurate on the vertical scale.
Look at Prostate and Thyroid for example since visually I would argue
they should cross to reflect the data
.

Let’s grab the data as laid out by
Tufte.

cancer <- structure(list(Year.5 = c(99, 96, 95, 89, 86, 85, 84, 82, 71, 69, 63, 62, 62, 58, 57, 55, 43, 32, 30, 24, 15, 14, 8, 4), Year.10 = c(95, 96, 94, 87, 78, 80, 83, 76, 64, 57, 55, 54, 55, 46, 46, 49, 32, 29, 13, 19, 11, 8, 6, 3), Year.15 = c(87, 94, 91, 84, 71, 74, 81, 70, 63, 46, 52, 50, 54, 38, 38, 50, 30, 28, 7, 19, 7, 8, 6, 3), Year.20 = c(81, 95, 88, 83, 75, 67, 79, 68, 60, 38, 49, 47, 52, 34, 33, 50, 26, 26, 5, 15, 6, 5, 8, 3)), class = "data.frame", row.names = c("Prostate", "Thyroid", "Testis", "Melanomas", "Breast", "Hodgkin's", "Uterus", "Urinary", "Cervix", "Larynx", "Rectum", "Kidney", "Colon", "Non-Hodgkin's", "Oral", "Ovary", "Leukemia", "Brain", "Multiple myeloma", "Stomach", "Lung", "Esophagus", "Liver", "Pancreas")) str(cancer) ## 'data.frame': 24 obs. of 4 variables: ## $ Year.5 : num 99 96 95 89 86 85 84 82 71 69 ... ## $ Year.10: num 95 96 94 87 78 80 83 76 64 57 ... ## $ Year.15: num 87 94 91 84 71 74 81 70 63 46 ... ## $ Year.20: num 81 95 88 83 75 67 79 68 60 38 ... kable(head(cancer,10)) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Year.5

Year.10

Year.15

Year.20

Prostate

99

95

87

81

Thyroid

96

96

94

95

Testis

95

94

91

88

Melanomas

89

87

84

83

Breast

86

78

71

75

Hodgkin’s

85

80

74

67

Uterus

84

83

81

79

Urinary

82

76

70

68

Cervix

71

64

63

60

Larynx

69

57

46

38

There, we have it in a neat data frame but not organized as we need it.
Not unusual, and an opportunity to use some other tools from broom and
reshape2. Let’s do the following:

  1. Let’s transpose the data with t
  2. Let’s use broom::fix_data_frame to get valid column names and
    convert rownames to a proper column all in one function. Right now
    the types of cancer are nothing but rownames.
  3. Use reshape2::melt to take our transposed dataframe and convert it
    to long format so we can send it off to ggplot. Along the way
    we’ll rename the resulting dataframe newcancer with columns
    named Year, Type and Survival.

# stepping through for demonstration purposes t(cancer) # returns a matrix ## Prostate Thyroid Testis Melanomas Breast Hodgkin's Uterus Urinary ## Year.5 99 96 95 89 86 85 84 82 ## Year.10 95 96 94 87 78 80 83 76 ## Year.15 87 94 91 84 71 74 81 70 ## Year.20 81 95 88 83 75 67 79 68 ## Cervix Larynx Rectum Kidney Colon Non-Hodgkin's Oral Ovary ## Year.5 71 69 63 62 62 58 57 55 ## Year.10 64 57 55 54 55 46 46 49 ## Year.15 63 46 52 50 54 38 38 50 ## Year.20 60 38 49 47 52 34 33 50 ## Leukemia Brain Multiple myeloma Stomach Lung Esophagus Liver ## Year.5 43 32 30 24 15 14 8 ## Year.10 32 29 13 19 11 8 6 ## Year.15 30 28 7 19 7 8 6 ## Year.20 26 26 5 15 6 5 8 ## Pancreas ## Year.5 4 ## Year.10 3 ## Year.15 3 ## Year.20 3 broom::fix_data_frame( t(cancer), newcol = "Year") # make it a dataframe with Year as a proper column ## Year Prostate Thyroid Testis Melanomas Breast Hodgkin.s Uterus ## 1 Year.5 99 96 95 89 86 85 84 ## 2 Year.10 95 96 94 87 78 80 83 ## 3 Year.15 87 94 91 84 71 74 81 ## 4 Year.20 81 95 88 83 75 67 79 ## Urinary Cervix Larynx Rectum Kidney Colon Non.Hodgkin.s Oral Ovary ## 1 82 71 69 63 62 62 58 57 55 ## 2 76 64 57 55 54 55 46 46 49 ## 3 70 63 46 52 50 54 38 38 50 ## 4 68 60 38 49 47 52 34 33 50 ## Leukemia Brain Multiple.myeloma Stomach Lung Esophagus Liver Pancreas ## 1 43 32 30 24 15 14 8 4 ## 2 32 29 13 19 11 8 6 3 ## 3 30 28 7 19 7 8 6 3 ## 4 26 26 5 15 6 5 8 3 reshape2::melt( broom::fix_data_frame( t(cancer), newcol = "Year"), id="Year", variable.name="Type", value.name = "Survival") # melt it to long form ## Year Type Survival ## 1 Year.5 Prostate 99 ## 2 Year.10 Prostate 95 ## 3 Year.15 Prostate 87 ## 4 Year.20 Prostate 81 ## 5 Year.5 Thyroid 96 ## 6 Year.10 Thyroid 96 ## 7 Year.15 Thyroid 94 ## 8 Year.20 Thyroid 95 ## 9 Year.5 Testis 95 ## 10 Year.10 Testis 94 ## 11 Year.15 Testis 91 ## 12 Year.20 Testis 88 ## 13 Year.5 Melanomas 89 ## 14 Year.10 Melanomas 87 ## 15 Year.15 Melanomas 84 ## 16 Year.20 Melanomas 83 ## 17 Year.5 Breast 86 ## 18 Year.10 Breast 78 ## 19 Year.15 Breast 71 ## 20 Year.20 Breast 75 ## 21 Year.5 Hodgkin.s 85 ## 22 Year.10 Hodgkin.s 80 ## 23 Year.15 Hodgkin.s 74 ## 24 Year.20 Hodgkin.s 67 ## 25 Year.5 Uterus 84 ## 26 Year.10 Uterus 83 ## 27 Year.15 Uterus 81 ## 28 Year.20 Uterus 79 ## 29 Year.5 Urinary 82 ## 30 Year.10 Urinary 76 ## 31 Year.15 Urinary 70 ## 32 Year.20 Urinary 68 ## 33 Year.5 Cervix 71 ## 34 Year.10 Cervix 64 ## 35 Year.15 Cervix 63 ## 36 Year.20 Cervix 60 ## 37 Year.5 Larynx 69 ## 38 Year.10 Larynx 57 ## 39 Year.15 Larynx 46 ## 40 Year.20 Larynx 38 ## 41 Year.5 Rectum 63 ## 42 Year.10 Rectum 55 ## 43 Year.15 Rectum 52 ## 44 Year.20 Rectum 49 ## 45 Year.5 Kidney 62 ## 46 Year.10 Kidney 54 ## 47 Year.15 Kidney 50 ## 48 Year.20 Kidney 47 ## 49 Year.5 Colon 62 ## 50 Year.10 Colon 55 ## 51 Year.15 Colon 54 ## 52 Year.20 Colon 52 ## 53 Year.5 Non.Hodgkin.s 58 ## 54 Year.10 Non.Hodgkin.s 46 ## 55 Year.15 Non.Hodgkin.s 38 ## 56 Year.20 Non.Hodgkin.s 34 ## 57 Year.5 Oral 57 ## 58 Year.10 Oral 46 ## 59 Year.15 Oral 38 ## 60 Year.20 Oral 33 ## 61 Year.5 Ovary 55 ## 62 Year.10 Ovary 49 ## 63 Year.15 Ovary 50 ## 64 Year.20 Ovary 50 ## 65 Year.5 Leukemia 43 ## 66 Year.10 Leukemia 32 ## 67 Year.15 Leukemia 30 ## 68 Year.20 Leukemia 26 ## 69 Year.5 Brain 32 ## 70 Year.10 Brain 29 ## 71 Year.15 Brain 28 ## 72 Year.20 Brain 26 ## 73 Year.5 Multiple.myeloma 30 ## 74 Year.10 Multiple.myeloma 13 ## 75 Year.15 Multiple.myeloma 7 ## 76 Year.20 Multiple.myeloma 5 ## 77 Year.5 Stomach 24 ## 78 Year.10 Stomach 19 ## 79 Year.15 Stomach 19 ## 80 Year.20 Stomach 15 ## 81 Year.5 Lung 15 ## 82 Year.10 Lung 11 ## 83 Year.15 Lung 7 ## 84 Year.20 Lung 6 ## 85 Year.5 Esophagus 14 ## 86 Year.10 Esophagus 8 ## 87 Year.15 Esophagus 8 ## 88 Year.20 Esophagus 5 ## 89 Year.5 Liver 8 ## 90 Year.10 Liver 6 ## 91 Year.15 Liver 6 ## 92 Year.20 Liver 8 ## 93 Year.5 Pancreas 4 ## 94 Year.10 Pancreas 3 ## 95 Year.15 Pancreas 3 ## 96 Year.20 Pancreas 3 # all those steps in one long line saved to a new dataframe newcancer <- reshape2::melt(broom::fix_data_frame(t(cancer), newcol = "Year"), id="Year", variable.name="Type", value.name = "Survival")

Now we have whipped the data into the shape we need it. 96 rows with the
three columns we want to plot, Year, Type, and Survival. If you
look at the data though, you’ll notice two small faults. First, Year
is not a factor. The plot will work but have an annoying limitation.
Since “Year.5” is a character string it will be ordered after all the
other years. We could fix that on the fly within our ggplot call but I
find it cleaner and more understandable if I take care of that first.
I’ll use the factor function from base R to accomplish that and
while I’m at it make the values nicer looking. Second in three cases R
changed cancer type names because they couldn’t be column names in a
dataframe. I’ll use forcats::fct_recode to make them look better.

newcancer$Year <- factor(newcancer$Year, levels = c("Year.5", "Year.10", "Year.15", "Year.20"), labels = c("5 Year","10 Year","15 Year","20 Year"), ordered = TRUE) newcancer$Type <- forcats::fct_recode(newcancer$Type, "Hodgkin's" = "Hodgkin.s", "Non-Hodgkin's" = "Non.Hodgkin.s", "Multiple myeloma" = "Multiple.myeloma") head(newcancer) ## Year Type Survival ## 1 5 Year Prostate 99 ## 2 10 Year Prostate 95 ## 3 15 Year Prostate 87 ## 4 20 Year Prostate 81 ## 5 5 Year Thyroid 96 ## 6 10 Year Thyroid 96

Now that we have the data the way we want it we can make our slopegraph.
Some of the necessary changes are obvious x = Year, y = Survival and
group = Type for example. Since there are a lot of plotted lines I’ve
reduced the weight or size of the individual lines. We no longer want to
plot the big round points, we’re going to substitute in the actual
numbers, so that line gets commented out. The left and right labels
require no change and geom_text_repel will keep them from overlapping
which is almost inevitable given the data. To put the actual survival
numbers on the plot we’ll turn to geom_label. It’s like geom_text
only it puts a label box around the text. We’ll choose a smallish size,
minimize the amount of padding, and make the border of the box
invisible. The end result is what we want. It overlays on top of the
lines we’ve already plotted and the invisible padding gives us just
enough room.

ggplot(data = newcancer, aes(x = Year, y = Survival, group = Type)) + geom_line(aes(color = Type, alpha = 1), size = 1) + # geom_point(aes(color = Type, alpha = .1), size = 4) + geom_text_repel(data = newcancer %>% filter(Year == "5 Year"), aes(label = Type) , hjust = "left", fontface = "bold", size = 3, nudge_x = -.45, direction = "y") + geom_text_repel(data = newcancer %>% filter(Year == "20 Year"), aes(label = Type) , hjust = "right", fontface = "bold", size = 3, nudge_x = .5, direction = "y") + geom_label(aes(label = Survival), size = 2.5, label.padding = unit(0.05, "lines"), label.size = 0.0) + MySpecial + labs( title = "Estimates of Percent Survival Rates", subtitle = "Based on: Edward Tufte, Beautiful Evidence, 174, 176.", caption = "https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk" )

Done for now

I hope you’ve found this useful. I am always open to comments,
corrections and suggestions.

Chuck (ibecav at gmail dot
com)

License


This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License
.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Safe Disposal of Unexploded WWII Bombs

Fri, 05/25/2018 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

Unexploded WWII bombs are ticking threats despite being dropped more than 70 years ago. In this post we explain how statistical methods are used to plan the search and disposal of unexploded WWII bombs. In particular we consider and exemplify the non-parametric nearest neighbour distance (NND) method implemented in the R package highriskzone. The method analyses the spatial pattern of exploded bombs to determine so called risk-zones, that is regions with a high likelihood of containing unexploded bombs. The coverage of such risk-zones is investigated through both non-parametric and parametric point process simulation.



NCAP aerial photo from 1944 showing the bombing of the V2 rocket facility at Peenemünde, Germany. Image is available under a custom NCAP license – higher resolution images are available from NCAP.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

\[
\newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}}
\]

Introduction

During WWII Germany was pounded with about 1.5 mio tons of bombs of which about 10-15% did not explode. More than 70 years after the end of WWII these unexploded bombs (UXBs) still pose a threat and are the frequent cause of large scale evacuations to secure their safe disposal when found. Luckily, lethal incidents are rare thanks to a huge effort to localise and safely dismantle UXBs. As part of this effort, aerial photos taken by the allies after the attacks provide valuable information about the possible locations of UXBs. Some UXBs are directly visible in the photos – see for example the green circles in this NCAP image or p. 6 in the following information flyer by one of the companies offering such aerial identification services (featured in this news article). In other cases the photos only provide information about the location of the exploded bombs. This information can be used to identify areas where there is a high likelihood of UXBs. Such areas would then be carefully scrutinized using on-the-ground search methods, for example, electromagnetic and magnetic detectors.

The aim of Mahling, Höhle, and Küchenhoff (2013) was to develop statistical methods for the identification of such risk-zones in co-operation with Oberfinanzdirektion Niedersachsen, which supports the removal of unexploded ordnance in federal properties in Germany. In what follows we will discuss one of the methods used in the paper, the so called nearest neighbourhood distance method and illustrate its implementation in the R package highriskzone originally created by Heidi Seibold and now maintained by Felix Günther.

Mathematical Setup

Casting matters into mathematical notation: Let \(X\) be a point process denoting the spatial locations of all bombs dropped in the particular window of interest \(W \subseteq \mathbb{R}^2\). Furthermore, let \(Y\) denote the observed point process of exploded bomb and \(Z=X\backslash Y\) the point process of unexploded bombs. Note that only the process \(Y\) is observed; \(Z\) is not observed and the target of our inference. We assume that the probability \(q\) of a dropped bomb not exploding is homogeneous in \(W\). Thus if \(X\) is a inhomogeneous Poisson point process with intensity function \(\lambda_X(\bm{s})\), \(\bm{s}\in W\), then

\[
\lambda_Y(\bm{s}) = (1-q) \lambda_X(\bm{s})
\quad \text{and}\quad
\lambda_Z(\bm{s}) = q \lambda_X(\bm{s})
\]

are the intensity functions of \(Y\) and \(Z\), respectively.

The figure below shows \(Y\) for an actual observed WWII bomb point consisting of n=443 bombs available from the R package highriskzone (Seibold et al. 2017). The observation window contains a particular area of interest for which a risk assessment needs to be done – often these contain a known WWII military target, e.g., an airport, an arms factory or a military casern. In order to not disclose the exact location of the considered area, coordinates are given relative to an arbitrary origo. In reality one would closely link such digitized aerial images to other terrain features using a GIS system.

library(highriskzone) library(spatstat) data(craterA) summary(craterA) ## Planar point pattern: 443 points ## Average intensity 0.0001082477 points per square unit ## ## Coordinates are given to 4 decimal places ## ## Window: polygonal boundary ## single connected closed polygon with 208 vertices ## enclosing rectangle: [0, 2334.3758] x [0, 2456.4061] units ## Window area = 4092470 square units ## Fraction of frame area: 0.714

The point pattern craterA corresponding to an instance of the process \(Y\) is provided in R as an object of class ppp from the R package spatstat (Baddeley, Rubak, and Turner 2015). Instead of inferring the locations in \(Z\) directly, we shall be interested in determining a region within \(W\), a so called high risk zone, which with a high likelihood contains the points of \(Z\). We shall consider two methods for this job: a non-parametric method based on nearest neighbour distances in \(Y\) and an intensity function based inhomogeneous Poisson process approach incorporating \(q\).

High Risk Zones

A heuristic way to determine a high-risk zone is the following: Determine the distribution function \(D\) of the nearest neighbour distance (NND) distribution based on the 443 points in the point pattern. Use the distribution to determine a desired quantile, say \(0 \leq p \leq 1\) of the NND distribution. Denoting the \(p\) sample quantile of the NND distribution by \(Q(p)\), a \(p\)-quantile NND based high-risk zone is then obtained as the union of putting a disc of radius \(x_q\) around each observed exploded bomb in \(Y\) – in mathematical terms:

\[
R_p = \left(\bigcup_{\bm{y} \in Y} B(\bm{y}, Q(p))\right) \bigcap W =
\left\{\bm{s} \in W : \min_{\bm{y}\in Y} || \bm{s} − \bm{y} || \leq Q_Y(p) \right\},
\]

where \(B(\bm{s}, r)\) denotes a disc of radius \(r\) around the point \(\bm{s}\) and \(||\bm{s} – \bm{y}||\) is the distance between the two points \(\bm{s}\) and \(\bm{y}\). The intersection with \(W\) is done in order to guarantee that the risk zone lies entirely within \(W\). As an example, we would determine the 99%-quantile NND zone for craterA using spatstat functionality as follows:

(Qp <- quantile(nndist(craterA), p = 0.99, type = 8)) ## 99% ## 142.1391 dmap <- distmap(craterA) zone_dist <- eval.im( dmap < Qp )

The above can also be done directly using highriskzone::det_hrz function:

zone_dist <- det_hrz(craterA, type="dist", criterion = "indirect", cutoff=0.99)

Either way, the resulting risk-zone is as follows:

summary(zone_dist) ## high-risk zone of type dist ## criterion: indirect ## cutoff: 0.99 ## ## threshold: 142.1391 ## area of the high-risk zone: 2574507

Mahling, Höhle, and Küchenhoff (2013) show that risk zones constructed by the NND method work surprisingly well despite lacking a clear theoretical justification. One theoretical issue is, for example, that the NND distribution function determined by the above method is for the \((1-q)\) thinned process \(Y\), even though the actual interest is in the process \(X=Y\cup Z\). Because of the thinning one would typically have that \(D_Y(r) \leq D_X(r)\) and thus \(Q_Y(p) > Q_X(p)\). Using \(Q_Y(p)\) to make statements about \(X\) (and thus \(Z\)) is therefore slightly wrong. However, this error cancels, because we then use the points in \(Y\) to add a buffer of radius \(Q_Y(p)\). Had we instead used the smaller, but true, \(Q_X(p)\) the risk zone would have gotten a too small, because \(X\) would also have contained more points to form discs around than \(Y\). The method thus implicitly takes \(q\) non-parametrically into account, because its NND is determined based on \(Y\) and subsequently discs of radius \(Q_Y(p)\) are formed around the points of \(Y\).

Technical details you might want to skip: The above feature is most easily illustrated if \(X\) is a homogeneous Poisson process with intensity \(\lambda_X\). In this case we have that the NND distribution function is (p.68, Illian et al. 2008)

\[
D_X(r) = 1 – \exp(-\lambda_X \pi r^2), \quad r>0.
\]

Also note that \(D_Y(r) = 1 – \exp(-(1-q)\lambda_X \pi r^2)\) and therefore \(D_Y(r) > D_X(r)\). Now solving for the \(p\)-quantile of the NND in this homogeneous Poisson case means solving

\[
Q_Y(p) = \min_{r\geq 0} \{ D_Y(r) \geq p \} \Leftrightarrow Q_Y(p) =
\sqrt{ \frac{\log(1-p)}{\lambda_Y \pi}}.
\]

From this it becomes clear than in the homogeneous Poisson case \(Q_ Y(p)\) is a factor \(\sqrt{1/(1-q)}\) larger than \(Q_X(p)\), which is the actual target of interest.

Assessing the coverage of a risk-zone

Two criterion appear immediate in order to assess the coverage of a risk-zone \(R\):

  1. The probability \(p_{\text{out}}\) that there will be at least one bomb outside the risk zone, i.e. \(P( N( Z \backslash R) > 0)\), where \(N(A)\) denotes the number of events in the set \(A \subseteq W\). Note: this probability is depending heavily on the amount of points in \(Z\), the more points there are, the higher is \(p_{\text{out}}\). However, it reflects the idea "one miss is all it takes to get in trouble".

  2. The proportion of events in \(Z\) not located in \(R\), i.e. \(N( Z \backslash R) / N(Z)\), we shall denote this criterion by \(p_{\text{miss}}\). Note: This probability is taking possible different sizes of \(Z\) into account, but also takes a more relative approach towards how many bombs are not covered by the zone.

Under the assumption of independence between whether \(Z\)-events are within or outside the risk-zone one can convert back and forth between \(p_{\text{miss}}\) and \(p_{\text{out}}\) by

\[
p_{\text{out}} = P( N( Z \backslash R) > 0) = 1- P(N(Z
\backslash R) = 0) \approx 1 – (1-p_{\text{miss}})^{N(Z)},
\]

where one in a simulation setup would know \(Z\) and thus also \(N(Z)\). Note that for a \(p\)-quantile NND risk-zone we expect \(1-p_{\text{miss}}\) to be approximately equal to \(p\). We can investigate the behaviour of risk-zones according to the two above criterion through the use of simulation. Either by simply \(q\)-thinning of the existing point pattern \(Y\) and then use this thinned pattern to determine a risk-zone, which is then evaluated. An alternative approach is to estimate the intensity surface from \(Y\), upscale it to get the intensity of \(X\), simulate \(X\) as an inhomogeneous Poisson point process with this intensity surface, thin this pattern to get a simulated instance of \(Y\), construct the risk-zone based on this pattern and then evaluate the coverage of the zone (Mahling, Höhle, and Küchenhoff 2013). Note that this type of simulation is based on more assumptions compared to the non-parametric thinning simulation approach.

We generate 1,000 realizations of \(Y\) and \(Z\) through \(q\)-thinning of the original craterA pattern while computing the coverage measures for the NND method as follows:

suppressPackageStartupMessages(library(doParallel)) registerDoParallel(cores=4) simulate_method <- "thinning" #"intensity" # "cond_intensity" sim <- foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist"), criterion=c("indirect"), cutoff=0.99, numit = 250, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 1 x 5 ## p_out p_miss `1-p_miss` p_out_derived nZ ## ## 1 0.051 0.00118 0.999 0.0509 44.3

The numbers state the average p_out and p_missin the 1000 simulations. Furthermore, nZ denotes the average number of events in \(Z\). We see that the NND method performs even a little better than intended, because \(1-p_{\text{miss}}\) is even higher than the intended \(p\)=99%. The probability that the risk-zone misses at least one bomb lies as low as 0.051. This is quite close to the above described approximate conversion from \(p_{\text{miss}}\) (0.051 vs. 0.051). Changing the simulation method for \(X\) to that of an inhomogeneous Poisson process with intensity \(1/(1-q) \cdot \hat{\lambda}_Y(\bm{s})\) yields similar results:

## # A tibble: 1 x 5 ## p_out p_miss `1-p_miss` p_out_derived nZ ## ## 1 0.47 0.0143 0.986 0.511 49.6

We note that the probability of missing at least one bomb is much higher under this parametric simulation method. Only a small fraction of this is explained by \(Z\) now consisting of more points. A likely explanation is that the parametric model is only semi-adequate to describe how the point patterns form. Therefore, the new \(X\) might have a somewhat different neighbourhood distribution than anticipated.

To compare more specifically with the intensity function based risk-zone method of Mahling, Höhle, and Küchenhoff (2013) we use a specification, where the risk-zone derived by the NND method or the intensity method have the same area (250 hectare).

sim_area <- foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist","intens"), criterion=rep("area",2), cutoff=rep(2500000,2), numit = 100, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 2 x 6 ## Type p_out p_miss `1-p_miss` p_out_derived area_zone ## ## 1 dist 0.123 0.00278 0.997 0.117 2500009. ## 2 intens 0.55 0.0172 0.983 0.539 2499994.

For the particular example we see an advantage of using the NND method, because both p_out p_miss are much lower for the intensity based method. Again, this might be due to the intensity method being based on assumptions, which for the particular example do not appear to be so adequate. Results in Mahling (2013) were, however, much better for this example (c.f. Tab 2), which could be an indication that there is a problem in the highriskzone package implementation of this method?

Discussion

Being a statistician is fascinating, because the job is the entry ticket to so many diverse research fields. The proposed methods and evaluations helped the Oberfinanzdirektion obtain a quantitative framework to decide which methods to use in their routine risk-assessment. Further details on the above application can be found in Mahling, Höhle, and Küchenhoff (2013) as well as in Monia’s Ph.D. dissertation (Mahling 2013). Note also that the techniques are not limited to UXB detection: Infer-unknown-points-from-a-thinned-process problems occur both in 1D and 2D point processes in a range of other fields, e.g., under-reporting of infectious disease locations or in the calculation of animal abundance in ecology.

As a personal anecdote: When finishing the work on Mahling, Höhle, and Küchenhoff (2013) I was in transition from university to working at a public health institute. The deal was to finish the UXB work partly in spare-time and partly in the new work time. To honour this I added my new work place as second affiliation before submitting, but as part of the institution’s internal clearing procedure of the publication, I was asked to remove this affiliation again by the higher management, because the work ‘had nothing to do with public health’. While its questionable whether exploding bombs really do not have a public health impact, a few months later, I ended up using very similar statistical techniques to model occurred-but-not-yet-reported cases during a critical infectious disease outbreak (Höhle and an der Heiden 2014).



Literature

Baddeley, A., E. Rubak, and R. Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press.

Höhle, M., and M. an der Heiden. 2014. “Bayesian Nowcasting During the STEC O104:H4 Outbreak in Germany, 2011.” Biometrics 70 (4): 993–1002. doi:10.1111/biom.12194.

Illian, J., A. Penttinen, H. Stoyan, and D. Stoyan. 2008. Stistical Analysis and Modelling of Spatial Point Patterns. Wiley.

Mahling, M. 2013. “Determining High-Risk Zones by Using Spatial Point Process Methodology.” PhD thesis, Department of Statistics, University of Munich. https://edoc.ub.uni-muenchen.de/15886/1/Mahling_Monia.pdf.

Mahling, M., M. Höhle, and H. Küchenhoff. 2013. “Determining high-risk zones for unexploded World War II bombs by using point process methodology.” Journal of the Royal Statistical Society, Series C 62 (2): 181–99. doi:10.1111/j.1467-9876.2012.01055.x.

Seibold, H., M. Mahling, S. Linne, and F. Günther. 2017. Highriskzone: Determining and Evaluating High-Risk Zones. https://cran.r-project.org/web/packages/highriskzone/index.html.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2018-05 Selective Raster Graphics

Thu, 05/24/2018 - 23:57

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This report explores ways to render specific components of an R plot in raster format, when the overall format of the plot is vector. For example, we demonstrate ways to draw raster data symbols within a PDF scatter plot. A general solution is provided by the grid.rasterize function from the R package ‘rasterize’.

Paul Murrell

Download

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Comparative Review of the RKWard GUI for R

Thu, 05/24/2018 - 15:24

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

Introduction

RKWard is a free and open source Graphical User Interface for the R software, one that supports beginners looking to point-and-click their way through analyses, as well as advanced programmers. You can think of it as a blend of the menus and dialog boxes that R Commander offers combined with the programming support that RStudio provides. RKWard is available on Windows, Mac, and Linux.

This review is one of a series which aims to help non-programmers choose the Graphical User Interface (GUI) that is best for them. However, I do include a cursory overview of how RKWard helps you work with code. In most sections, I’ll begin with a brief description of the topic’s functionality and how GUIs differ in implementing it. Then I’ll cover how RKWard does it.

Figure 1. RKWard’s main control screen containing an open data editor window (big one), an open dialog box (right) and its output window (lower left).

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface specifically using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So GUI users are people who prefer using a GUI to perform their analyses. They often don’t have the time required to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or BlueSky Statistics, install in a single step. Others install in multiple steps, such as R Commander and Deducer. Advanced computer users often don’t appreciate how lost beginners can become while attempting even a single-step installation. I work at the University of Tennessee, and our HelpDesk is flooded with such calls at the beginning of each semester!

Installing RKWard on Windows is done in a single step since its installation file contains both R and RKWard. However, Mac and Linux users have a two-step process, installing R first, then download RKWard which links up to the most recent version of R that it finds. Regardless of their operating system, RKWard users never need to learn how to start R, then execute the install.packages function, and then load a library.  Installers for all three operating systems are available here.

The RKWard installer obtains the appropriate version of R, simplifying the installation and ensuring complete compatibility. However, if you already had a copy of R installed, depending on its version, you could end up with a second copy.

RKWard minimizes the size of its download by waiting to install some R packages until you actually try to use them for the first time. Then it prompts you, offering default settings that will get the package you need.

On Windows, the installation file is 136 megabytes in size.

 

Plug-ins

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling section of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, BlueSky, Deducer) through moderate (jamovi) to very active (R Commander).

Currently all plug-ins are included with the initial installation.  You can see them using the menu selection Settings> Configure Packages> Manage RKWard Plugins. There are only brief descriptions of what they do, but once installed, you can access the help files with a single click.

RKWard add-on modules are part of standard R packages and are distributed on CRAN. Their package descriptions include a field labeled, “enhances: rkward”. You can sort packages by that field in RKWard’s package installation dialog where they are displayed with the RKWard icon.

Continued here…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL London 2018 – Agenda announced

Thu, 05/24/2018 - 11:50

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

The EARL London 2018 agenda is now available. Explore the schedule of keynotes, presentations, and lightning talks that cover a huge range of topics, including the best uses of Shiny, R in healthcare, using R for good, and R in finance. The brilliant lineup of speakers-who represent a wide range of industries-is sure to provide inspiration for all R users of all levels.

We have surveyed the Mango team to find out what talks they are most looking forward to:

Lung cancer detection with deep learning in R – David Smith, Microsoft
David will be taking us through an end-to-end example of building a deep learning model to predict lung cancer from image data. Anything that helps to improve healthcare is a fascinating subject.

Using network analysis of colleague relationships to find interesting new investment managers – Robin Penfold, Willis Towers Watson
Using the tidyverse, tidygraph and ggraph, Robin has established a way to save time and money by researching the backgrounds of all institutional investors in one go, as opposed to one at a time.

As right as rain: developing a sales forecasting model for Dyson using R – Dan Erben, Dyson
We love hearing about how organisations have adopted a data-driven approach using R. Dan’s talk will summarise how Dyson developed a statistical model for sales forecasting and the path they have taken to adopt it in the business.

Interpretable Machine Learning with LIME: now and tomorrow – Kasia Kulma, Aviva
Explaining models is a really valuable tool in communicating with the business. Kasia will be using breast cancer data as a specific case scenario.

Text mining in child social care – James Lawrence, The Behavioural Insight Team
James will be sharing how The Behavioural Insight Team have been able to use R’s text mining tools on unstructured data to improve child social services actions, significantly reducing the potential for harm without overburdening an already stretched social service provider.

Experience of using R in the productive environment of Boehringer Ingelheim – Matthias Trampisch, Boehringer Ingelheim
Boehringer Ingelheim have been making headway with R and Matthias will be sharing their journey.

Using R to Drive Revenue for your Online Business – Catherine Gamble, Marks and Spencer
Online business is fast-paced and competitive, Catherine shares how Marks and Spencer have been using R to gain the upperhand and a make difference to their bottomline.

Don’t miss out on early bird tickets

Buy your tickets before 31 July to make the most of the early bird ticket prices – a full conference pass–which includes workshops, access to the conference presentations, and a ticket to the evening event at the Imperial War Museum–is only £800. Buy tickets now.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sample_n_of(): a useful helper function

Thu, 05/24/2018 - 07:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

Here’s the problem: I have some data with nested time series. Lots of them. It’s
like there’s many, many little datasets inside my data. There are too many
groups to plot all of the time series at once, so I just want to preview a
handful of them.

For a working example, suppose we want to visualize the top 50 American female
baby names over time. I start by adding up the total number of births for each
name, finding the overall top 50 most populous names, and then keeping just the
time series from those top names.

library(ggplot2) library(dplyr, warn.conflicts = FALSE) babynames <- babynames::babynames %>% filter(sex == "F") top50 <- babynames %>% group_by(name) %>% summarise(total = sum(n)) %>% top_n(50, total) # keep just rows in babynames that match a row in top50 top_names <- babynames %>% semi_join(top50, by = "name")

Hmm, so what does this look like?

ggplot(top_names) + aes(x = year, y = n) + geom_line() + facet_wrap("name")

Aaack, I can’t read anything! Can’t I just see a few of them?

This is a problem I face frequently, so frequently that I wrote a helper
function to handle this problem: sample_n_of(). This is not a very clever
name, but it works. Below I call the function from my personal R package
and plot just the data from four names.

# For reproducible blogging set.seed(20180524) top_names %>% tjmisc::sample_n_of(4, name) %>% ggplot() + aes(x = year, y = n) + geom_line() + facet_wrap("name")

In this post, I walk through how this function works. It’s not very
complicated: It relies on some light tidy evaluation plus one obscure dplyr
function.

Working through the function

As usual, let’s start by sketching out the function we want to write:

sample_n_of <- function(data, size, ...) { # quote the dots dots <- quos(...) # ...now make things happen... }

where size are the number of groups to sample and ... are the columns names
that define the groups. We use quos(...) to capture and quote those column
names. (As I wrote before,
quotation is how we bottle up R code so we can deploy it for later.)

For interactive testing, suppose our dataset are the time series from the top 50
names and we want data from a sample of 5 names. In this case, the values for
the arguments would be:

data <- top_names size <- 5 dots <- quos(name)

A natural way to think about this problem is that we want to sample subgroups of
the dataframe. First, we create a grouped version of the dataframe using
group_by(). The function group_by() also takes a ... argument where the
dots are typically names of columns in the dataframe. We want to take the
names inside of our dots, unquote them and plug them in to where the ...
goes in group_by(). This is what the tidy evaluation world calls
splicing.

Think of splicing as doing this:

# Demo function that counts the number of arguments in the dots count_args <- function(...) length(quos(...)) example_dots <- quos(var1, var2, var2) # Splicing turns the first form into the second one count_args(!!! example_dots) #> [1] 3 count_args(var1, var2, var2) #> [1] 3

So, we create a grouped dataframe by splicing our dots into the group_by()
function.

grouped <- data %>% group_by(!!! dots)

There is a helper function buried in dplyr called group_indices() which
returns the grouping index for each row in a grouped dataframe.

grouped %>% tibble::add_column(group_index = group_indices(grouped)) #> # A tibble: 6,407 x 6 #> # Groups: name [50] #> year sex name n prop group_index #> #> 1 1880 F Mary 7065 0.0724 33 #> 2 1880 F Anna 2604 0.0267 4 #> 3 1880 F Emma 2003 0.0205 19 #> 4 1880 F Elizabeth 1939 0.0199 17 #> 5 1880 F Margaret 1578 0.0162 32 #> 6 1880 F Sarah 1288 0.0132 45 #> 7 1880 F Laura 1012 0.0104 29 #> 8 1880 F Catherine 688 0.00705 11 #> 9 1880 F Helen 636 0.00652 21 #> 10 1880 F Frances 605 0.00620 20 #> # ... with 6,397 more rows

We can randomly sample five of the group indices and keep the rows for just
those groups.

unique_groups <- unique(group_indices(grouped)) sampled_groups <- sample(unique_groups, size) sampled_groups #> [1] 4 25 43 20 21 subset_of_the_data <- data %>% filter(group_indices(grouped) %in% sampled_groups) subset_of_the_data #> # A tibble: 674 x 5 #> year sex name n prop #> #> 1 1880 F Anna 2604 0.0267 #> 2 1880 F Helen 636 0.00652 #> 3 1880 F Frances 605 0.00620 #> 4 1880 F Samantha 21 0.000215 #> 5 1881 F Anna 2698 0.0273 #> 6 1881 F Helen 612 0.00619 #> 7 1881 F Frances 586 0.00593 #> 8 1881 F Samantha 12 0.000121 #> 9 1881 F Karen 6 0.0000607 #> 10 1882 F Anna 3143 0.0272 #> # ... with 664 more rows # Confirm that only five names are in the dataset subset_of_the_data %>% distinct(name) #> # A tibble: 5 x 1 #> name #> #> 1 Anna #> 2 Helen #> 3 Frances #> 4 Samantha #> 5 Karen

Putting these steps together, we get:

sample_n_of <- function(data, size, ...) { dots <- quos(...) group_ids <- data %>% group_by(!!! dots) %>% group_indices() sampled_groups <- sample(unique(group_ids), size) data %>% filter(group_ids %in% sampled_groups) }

We can test that the function works as we might expect. Sampling 10 names
returns the data for 10 names.

ten_names <- top_names %>% sample_n_of(10, name) %>% print() #> # A tibble: 1,326 x 5 #> year sex name n prop #> #> 1 1880 F Sarah 1288 0.0132 #> 2 1880 F Frances 605 0.00620 #> 3 1880 F Rachel 166 0.00170 #> 4 1880 F Samantha 21 0.000215 #> 5 1880 F Deborah 12 0.000123 #> 6 1880 F Shirley 8 0.0000820 #> 7 1880 F Carol 7 0.0000717 #> 8 1880 F Jessica 7 0.0000717 #> 9 1881 F Sarah 1226 0.0124 #> 10 1881 F Frances 586 0.00593 #> # ... with 1,316 more rows ten_names %>% distinct(name) #> # A tibble: 10 x 1 #> name #> #> 1 Sarah #> 2 Frances #> 3 Rachel #> 4 Samantha #> 5 Deborah #> 6 Shirley #> 7 Carol #> 8 Jessica #> 9 Patricia #> 10 Sharon

We can sample based on multiple columns too. Ten combinations of names and years
should return just ten rows.

top_names %>% sample_n_of(10, name, year) #> # A tibble: 10 x 5 #> year sex name n prop #> #> 1 1907 F Jessica 17 0.0000504 #> 2 1932 F Catherine 5446 0.00492 #> 3 1951 F Nicole 94 0.0000509 #> 4 1953 F Janet 17761 0.00921 #> 5 1970 F Sharon 9174 0.00501 #> 6 1983 F Melissa 23473 0.0131 #> 7 1989 F Brenda 2270 0.00114 #> 8 1989 F Pamela 1334 0.000670 #> 9 1994 F Samantha 22817 0.0117 #> 10 2014 F Kimberly 2891 0.00148 Next steps

There are a few tweaks we could make to this function. For example, in my
package’s version, I warn the user when the number of groups is too large.

too_many <- top_names %>% tjmisc::sample_n_of(100, name) #> Warning: Sample size (100) is larger than number of groups (50). Using size #> = 50.

My version also randomly samples n of the rows when there are no grouping
variables provided.

top_names %>% tjmisc::sample_n_of(2) #> # A tibble: 2 x 5 #> year sex name n prop #> #> 1 1934 F Stephanie 128 0.000118 #> 2 2007 F Mary 3674 0.00174

One open question is how to handle data that’s already grouped. The function we
wrote above fails.

top_names %>% group_by(name) %>% sample_n_of(2, year) #> Error in filter_impl(.data, quo): Result must have length 136, not 6407

Is this a problem?

Here I think failure is okay because what do we think should happen? It’s not
obvious. It should randomly choose 2 of the years for each name.
Should it be the same two years? Then this should be fine.

top_names %>% sample_n_of(2, year) #> # A tibble: 100 x 5 #> year sex name n prop #> #> 1 1970 F Jennifer 46160 0.0252 #> 2 1970 F Lisa 38965 0.0213 #> 3 1970 F Kimberly 34141 0.0186 #> 4 1970 F Michelle 34053 0.0186 #> 5 1970 F Amy 25212 0.0138 #> 6 1970 F Angela 24926 0.0136 #> 7 1970 F Melissa 23742 0.0130 #> 8 1970 F Mary 19204 0.0105 #> 9 1970 F Karen 16701 0.00912 #> 10 1970 F Laura 16497 0.00901 #> # ... with 90 more rows

Or, should those two years be randomly selected for each name? Then, we should
let do() handle that. do() takes some code that returns a dataframe, applies
it to each group, and returns the combined result.

top_names %>% group_by(name) %>% do(sample_n_of(., 2, year)) #> # A tibble: 100 x 5 #> # Groups: name [50] #> year sex name n prop #> #> 1 1913 F Amanda 346 0.000528 #> 2 1953 F Amanda 428 0.000222 #> 3 1899 F Amy 281 0.00114 #> 4 1964 F Amy 9579 0.00489 #> 5 1916 F Angela 715 0.000659 #> 6 2005 F Angela 2893 0.00143 #> 7 1999 F Anna 9092 0.00467 #> 8 2011 F Anna 5649 0.00292 #> 9 1952 F Ashley 24 0.0000126 #> 10 2006 F Ashley 12340 0.00591 #> # ... with 90 more rows

I think raising an error and forcing the user to clarify their code is a better
than choosing one of these options and not doing what the user expects.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simple Spatial Modelling – Part 3: Exercises

Thu, 05/24/2018 - 06:39

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

So far, we have learned how to count spatial variability in our model. Please look at these two previous exercises here and here if you haven’t tried it yet. However, it only represents 1-Dimension model. On this exercise, we will try to expand our spatial consideration into 2-Dimension model.

Have a look at this plan view below to get an illustration of how the 2-dimensions model will work.

The water levels are store in a 2-D array. they are numbered as follows:

The water flows are store in 2 different 2-D arrays.
1. qv : defines water flows between buckets down the screen (in plan view)
2. qh : defines water flows between buckets across the screen.


Let’s get into the modelling by cracking the exercises below. Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Set all the required settings for the model:
a. set number of timesteps. Here we use 1000 (it shows how many timesteps in which we are going to run the model, you can change it as you want)
b. set the total number of cell, here we have 25 x 25 water tanks
c. set timestep in seconds; here is 1
d. set time at the start of simulations
e. set k between each water tank. Here we set a uniform value of k; 0.01

Exercise 2
create matrix H for initial water level in the water tank

Exercise 3
Set boundary conditions for the model; here we have water flow (qh) into the water tanks from three sides (top, left and right) and water flowing out on the bottom (qv;see the plan view). Water flow to the right and to the bottoms are considered positive. Don’t forget to declare the matrix for qh and qv.

Exercise 4
Create an output models for every 100 timesteps

Exercise 5
Run the model by creating loop for qh, qv, water storage update, and models output (Remember the threshold loop on latest previous exercise?)

Exercise 6
Plot model output using contour plot

Related exercise sets:
  1. Simple Spatial Modeling – Part 1: Exercises
  2. Descriptive Analytics-Part 6: Interactive dashboard ( 2/2)
  3. Practice Your ggplot Skills: Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A little function to help generate ICCs in simple clustered data

Thu, 05/24/2018 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

In health services research, experiments are often conducted at the provider or site level rather than the patient level. However, we might still be interested in the outcome at the patient level. For example, we could be interested in understanding the effect of a training program for physicians on their patients. It would be very difficult to randomize patients to be exposed or not to the training if a group of patients all see the same doctor. So the experiment is set up so that only some doctors get the training and others serve as the control; we still compare the outcome at the patient level.

Typically, when conducting an experiment we assume that individual outcomes are not related to each other (other than the common effect of the exposure). With site-level randomization, we can’t make that assumption – groups of patients are all being treated by the same doctor. In general, even before the intervention, there might be variation across physicians. At the same time, patients within a practice will vary. So, we have two sources of variation: between practice and within practice variation that explain overall variation.

I touched on this when I discussed issues related to Gamma distributed clustered data. A key concept is the intra-class coefficient or ICC, which is a measure of how between variation relates to overall variation. The ICC ranges from 0 (where there is no between variation – all site averages are the same) to 1 (where there is no variation within a site – all patients within the site have the same outcomes). Take a look at the earlier post for a bit more detail.

My goal here is to highlight a little function recently added to simstudy (v0.1.9, now available on CRAN). In the course of exploring study designs for cluster randomized trials, it is often useful to understand what happens (to sample size requirements, for example) when the ICC changes. When generating the data, it is difficult to control the ICC directly – we do this by controlling the variation. With normally distributed data, the ICC is an obvious function of the variances used to generate the data, so the connection is pretty clear. But, when the outcomes have binary, Poisson, or Gamma distributions (or anything else really), the connection between variation and the ICC is not always so obvious. Figuring out how to specify the data to generate a particular ICC might require quite a bit of trial and error.

The new function, iccRE (short for ICC random effects), allows users to specify target ICCs for a desired distribution (along with relevant parameters). The function returns the corresponding random effect variances that would be specified at the cluster level to generate the desired ICC(s).

Here’s an example for three possible ICCs in the context of the normal distribution:

library(simstudy) targetICC <- c(0.05, 0.075, 0.10) setVars <- iccRE(ICC = targetICC, dist = "normal", varWithin = 4) round(setVars, 4) ## [1] 0.2105 0.3243 0.4444

In the case when the target ICC is 0.075:

\[ ICC = \frac{\sigma_b^2}{\sigma_b ^2 + \sigma_w ^2} = \frac{0.324}{0.324 + 4} \approx 0.075\]

Simulating from the normal distribution

If we specify the variance for the site-level random effect to be 0.2105 in conjunction with the individual-level (within) variance of 4, the observed ICC from the simulated data will be approximately 0.05:

set.seed(73632) # specify between site variation d <- defData(varname = "a", formula = 0, variance = 0.2105, id = "grp") d <- defData(d, varname = "size", formula = 1000, dist = "nonrandom") a <- defDataAdd(varname = "y1", formula = "30 + a", variance = 4, dist = "normal") dT <- genData(10000, d) # add patient level data dCn05 <- genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCn05 <- addColumns(a, dCn05) dCn05 ## grp a size id y1 ## 1: 1 -0.3255465 1000 1 32.08492 ## 2: 1 -0.3255465 1000 2 27.21180 ## 3: 1 -0.3255465 1000 3 28.37411 ## 4: 1 -0.3255465 1000 4 27.70485 ## 5: 1 -0.3255465 1000 5 32.11814 ## --- ## 9999996: 10000 0.3191311 1000 9999996 30.15837 ## 9999997: 10000 0.3191311 1000 9999997 32.66302 ## 9999998: 10000 0.3191311 1000 9999998 28.34583 ## 9999999: 10000 0.3191311 1000 9999999 28.56443 ## 10000000: 10000 0.3191311 1000 10000000 30.06957

The between variance can be roughly estimated as the variance of the group means, and the within variance can be estimated as the average of the variances calculated for each group (this works well here, because we have so many clusters and patients per cluster):

between <- dCn05[, mean(y1), keyby = grp][, var(V1)] within <- dCn05[, var(y1), keyby = grp][, mean(V1)] total <- dCn05[, var(y1)] round(c(between, within, total), 3) ## [1] 0.212 3.996 4.203

The ICC is the ratio of the between variance to the total, which is also the sum of the two component variances:

round(between/(total), 3) ## [1] 0.05 round(between/(between + within), 3) ## [1] 0.05

Setting the site-level variance at 0.4444 gives us the ICC of 0.10:

d <- defData(varname = "a", formula = 0, variance = 0.4444, id = "grp") d <- defData(d, varname = "size", formula = 1000, dist = "nonrandom") a <- defDataAdd(varname = "y1", formula = "30 + a", variance = 4, dist = "normal") dT <- genData(10000, d) dCn10 <- genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCn10 <- addColumns(a, dCn10) between <- dCn10[, mean(y1), keyby = grp][, var(V1)] within <- dCn10[, var(y1), keyby = grp][, mean(V1)] round(between / (between + within), 3) ## [1] 0.102 Other distributions

The ICC is a bit more difficult to interpret using other distributions where the variance is a function of the mean, such as with the binomial, Poisson, or Gamma distributions. However, we can still use the notion of between and within, but it may need to be transformed to another scale.

In the case of binary outcomes, we have to imagine an underlying or latent continuous process that takes place on the logistic scale. (I talked a bit about this here.)

### binary (setVar <- iccRE(ICC = 0.05, dist = "binary")) ## [1] 0.173151 d <- defData(varname = "a", formula = 0, variance = 0.1732, id = "grp") d <- defData(d, varname = "size", formula = 1000, dist = "nonrandom") a <- defDataAdd(varname = "y1", formula = "-1 + a", dist = "binary", link = "logit") dT <- genData(10000, d) dCb05 <- genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCb05 <- addColumns(a, dCb05) dCb05 ## grp a size id y1 ## 1: 1 -0.20740274 1000 1 0 ## 2: 1 -0.20740274 1000 2 0 ## 3: 1 -0.20740274 1000 3 0 ## 4: 1 -0.20740274 1000 4 1 ## 5: 1 -0.20740274 1000 5 0 ## --- ## 9999996: 10000 -0.05448775 1000 9999996 0 ## 9999997: 10000 -0.05448775 1000 9999997 1 ## 9999998: 10000 -0.05448775 1000 9999998 0 ## 9999999: 10000 -0.05448775 1000 9999999 0 ## 10000000: 10000 -0.05448775 1000 10000000 0

The ICC for the binary distribution is on the logistic scale, and the within variance is constant. The between variance is estimated on the log-odds scale:

within <- (pi ^ 2) / 3 means <- dCb05[,mean(y1), keyby = grp] between <- means[, log(V1/(1-V1)), keyby = grp][abs(V1) != Inf, var(V1)] round(between / (between + within), 3) ## [1] 0.051

The ICC for the Poisson distribution is interpreted on the scale of the count measurements, even though the random effect variance is on the log scale. If you want to see the details behind the random effect variance derivation, see this paper by Austin et al., which was based on original work by Stryhn et al. that can be found here.

(setVar <- iccRE(ICC = 0.05, dist = "poisson", lambda = 30)) ## [1] 0.0017513 d <- defData(varname = "a", formula = 0, variance = 0.0018, id = "grp") d <- defData(d, varname = "size", formula = 1000, dist = "nonrandom") a <- defDataAdd(varname = "y1", formula = "log(30) + a", dist = "poisson", link = "log") dT <- genData(10000, d) dCp05 <- genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCp05 <- addColumns(a, dCp05) dCp05 ## grp a size id y1 ## 1: 1 0.035654485 1000 1 26 ## 2: 1 0.035654485 1000 2 36 ## 3: 1 0.035654485 1000 3 31 ## 4: 1 0.035654485 1000 4 34 ## 5: 1 0.035654485 1000 5 21 ## --- ## 9999996: 10000 0.002725561 1000 9999996 26 ## 9999997: 10000 0.002725561 1000 9999997 25 ## 9999998: 10000 0.002725561 1000 9999998 27 ## 9999999: 10000 0.002725561 1000 9999999 28 ## 10000000: 10000 0.002725561 1000 10000000 37

The variance components and ICC for the Poisson can be estimated using the same approach as the normal distribution:

between <- dCp05[, mean(y1), keyby = grp][, var(V1)] within <- dCp05[, var(y1), keyby = grp][, mean(V1)] round(between / (between + within), 3) ## [1] 0.051

Finally, here are the results for the Gamma distribution, which I talked about in great length in an earlier post:

(setVar <- iccRE(ICC = 0.05, dist = "gamma", disp = 0.25 )) ## [1] 0.01493805 d <- defData(varname = "a", formula = 0, variance = 0.0149, id = "grp") d <- defData(d, varname = "size", formula = 1000, dist = "nonrandom") a <- defDataAdd(varname = "y1", formula = "log(30) + a", variance = 0.25, dist = "gamma", link = "log") dT <- genData(10000, d) dCg05 <- genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCg05 <- addColumns(a, dCg05) dCg05 ## grp a size id y1 ## 1: 1 0.09466305 1000 1 14.31268 ## 2: 1 0.09466305 1000 2 39.08884 ## 3: 1 0.09466305 1000 3 28.08050 ## 4: 1 0.09466305 1000 4 53.27853 ## 5: 1 0.09466305 1000 5 37.93855 ## --- ## 9999996: 10000 0.25566417 1000 9999996 14.16145 ## 9999997: 10000 0.25566417 1000 9999997 42.54838 ## 9999998: 10000 0.25566417 1000 9999998 76.33642 ## 9999999: 10000 0.25566417 1000 9999999 34.16727 ## 10000000: 10000 0.25566417 1000 10000000 21.06282

The ICC for the Gamma distribution is on the log scale:

between <- dCg05[, mean(log(y1)), keyby = grp][, var(V1)] within <- dCg05[, var(log(y1)), keyby = grp][, mean(V1)] round(between / (between + within), 3) ## [1] 0.05

It is possible to think about the ICC in the context of covariates, but interpretation is less straightforward. The ICC itself will likely vary across different levels of the covariates. For this reason, I like to think of the ICC in the marginal context.

I leave you with some visuals of clustered binary data with ICC’s ranging from 0 to 0.075, both on the log-odds and probability scales:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create your Machine Learning library from scratch with R ! (3/5) – KNN

Wed, 05/23/2018 - 13:48

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement K nearest neighbors (KNN) using only the linear algebra available in R. Previously, we managed to implement PCA and next time we will deal with SVM and decision trees.

The K-nearest neighbors (KNN) is a simple yet efficient classification and regression algorithm. KNN assumes that an observation will be similar to its K closest neighbors. For instance, if most of the neighbors of a given point belongs to a given class, it seems reasonable to assume that the point will belong to the same given class.

The mathematics of KNN

Now, let’s quickly derive the mathematics used for KNN regression (they are similar for classification).

Let be the observations of our training dataset. The points are in . We denote the variable we seek to estimate. We know its value for the train dataset.
Let be a new point in . We do not know and will estimate it using our train dataset.

Let be a positive and non-zero integer (the number of neighbors used for estimation). We want to select the points from the dataset which are the closest to . To do so, we compute the euclidean distance . From all the distance, we can compute , the smallest radius of the circle centered on which includes exactly points from the training sample.

An estimation of is now easy to construct. This is the mean of the of the closest points to :

   

KNN regression in R

First, we build a “my_knn_regressor” object which stores all the training points, the value of the target variable and the number of neighbors to use.

my_knn_regressor <- function(x,y,k=5) { if (!is.matrix(x)) { x <- as.matrix(x) } if (!is.matrix(y)) { y <- as.matrix(y) } my_knn <- list() my_knn[['points']] <- x my_knn[['value']] <- y my_knn[['k']] <- k attr(my_knn, "class") <- "my_knn_regressor" return(my_knn) }

The tricky part of KNN is to compute efficiently the distance. We will use the function we created in our previous post on vectorization. The function and mathematical derivations are specified in this post.

gramMatrix <- function(X,Y) { tcrossprod(X, Y) } compute_pairwise_distance=function(X,Y) { xn <- rowSums(X ** 2) yn <- rowSums(Y ** 2) outer(xn, yn,'+') - 2 * tcrossprod(X, Y) }

Now we can build our predictor:

predict.my_knn_regressor <- function(my_knn,x) { if (!is.matrix(x)) { x=as.matrix(x) } ##Compute pairwise distance dist_pair <- compute_pairwise_distance(x,my_knn[['points']]) ##as.matrix(apply(dist_pair,2,order) <= my_knn[['k']]) orders the points by distance and select the k-closest points ##The M[i,j]=1 if x_j is on the k closest point to x_i t(as.matrix(apply(dist_pair,2,order) <= my_knn[['k']])) %*% my_knn[['value']] / my_knn[['k']] }

The last line may seem complicated:

  1. apply(dist_pair,2,order) orders the points by distance
  2. apply(dist_pair,2,order)<=my_knn[['k']] selects the k-closest points to each point in our new dataset
  3. M=t(as.matrix(apply(dist_pair,2,order) <= my_knn[['k']])) cast the matrix into a one hot matrix. if is one of the k closest points to . is zero otherwise.
  4. M %*% my_knn[['value']] / my_knn sums the value of the k closest points and normalises it by k
KNN Binary Classification in R

The previous code can be reused as it is for binary classification. Your outcome should be encoded as a one-hot variable. If the estimated output is greater (resp. less) than 0.5, you can assume that your point belongs to the class encoded as one (resp. zero). We will use the classical Iris dataset and classify the setosa versus the virginica specy.

iris_class <- iris[iris[["Species"]]!="versicolor",] iris_class[["Species"]] <- as.numeric(iris_class[["Species"]]!="setosa") knn_class <- my_knn_regressor(iris_class[,1:2],iris_class[,5]) predict(knn_class,iris_class[,1:2])

Since, we only used 2 variables, we can easily plot the decision boundaries on a 2D plot.

#Build grid x_coord <- seq(min(iris_class[,1]) - 0.2,max(iris_class[,1]) + 0.2,length.out = 200) y_coord <- seq(min(iris_class[,2])- 0.2,max(iris_class[,2]) + 0.2 , length.out = 200) coord <- expand.grid(x=x_coord, y=y_coord) #predict probabilities coord[['prob']] <- predict(knn_class,coord[,1:2]) library(ggplot2) ggplot() + ##Ad tiles according to probabilities geom_tile(data=coord,mapping=aes(x, y, fill=prob)) + scale_fill_gradient(low = "lightblue", high = "red") + ##add points geom_point(data=iris_class,mapping=aes(Sepal.Length,Sepal.Width, shape=Species),size=3 ) + #add the labels to the plots xlab('Sepal length') + ylab('Sepal width') + ggtitle('Decision boundaries of KNN')+ #remove grey border from the tile scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0))

And this gives us this cool plot:

Possible extensions

Our current KNN is basic, but you can improve and test it in several ways:

  • What is the influence of the number of neighbors ? (You should see some overfitting/underfitting)
  • Can you implement other metrics than distance ? Can you create kernel KNNs ?
  • Instead of doing estimations using only the mean, could you use a more complex mapping ?

Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.

The post Create your Machine Learning library from scratch with R ! (3/5) – KNN appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing Datazar Paper

Wed, 05/23/2018 - 12:08

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

We’re finally here. Datazar Paper is the most exciting tool we’ve created to date. Up until now, I’d like to believe we’ve only made par with the status quo in terms of the tools we have developed for our beloved community of industry researchers, hackers and academics.

We have a vision at Datazar where researchers would eventually be able to create and consume research in one, continious cycle. We introduced tools like Replication (ability to replicate a research project simply by clicking a button), Discussions (real time chat where you can exchange code snippets etc…) and Metrics (research tracking and feedback on views, usage etc…). These tools help users create research with their colleagues without juggling different tools.

Datazar Paper is the tools that ties all of these together. A common workflow on Datazar is (1) uploading data, (2) creating analysis based on the data using for example an R notebook and(3) exporting the results/visualizations for reporting or for a research paper. That last step was always awkward to deal with because there are so many ways to go about it. At some point we create a versatile, JavaScript notebook to try and tie all of it together to make reporting easier. Before that, we created a new language to simplify creating a report or paper while maintaining interactivity. Reports are papers are everywhere, the problem is they are not interactive and/or reproducible. This article* on the deatch of traditional papers explains it really well.

In short, we believe the traditional, static paper is not good enough anymore in today’s world to convey today’s complex ideas.

Datazar Paper Editor

Medium, yes this Medium was a giant leap in terms of how easy it made creating articles. Datazar Paper was heavily influced by Medium; specifically the editor. We took the same approach and created an editor that allows you to create an article/paper/report like a lego house. No code involved whatsoever. This last bit is extremely important because the people in an organization who usually do the reporting are the people with least exeprience in coding. And to be frank, you shouldn’t really need code to create a simple paper or report.

We recognize that the overall transition to an interactive and easily reproducible writing process will be long so I’d like to take this opportunity to say that we still support LaTeX and MarkDown and will continue to do so. In case you didn’t know, Datazar gives you the ability to create LaTeX and MarkDown documents in your browser. In fact, LaTeX documents are the most popular files created on Datazar.

With Datazar Paper, we got one step closer in creating an ecosystem where research is created, shared and preserved automatically. Now you can store your data, analyze it using R & Python notebooks and publish eveything from code to visualizations using Datazar Paper. All without switching between a dozen programs and without squeezig your CPU for computational power.

Link to Launch: https://www.producthunt.com/posts/datazar-paper

Some papers I picked out:

https://www.datazar.com/focus/f03b8705b-c0ba-454c-afa0-3a7729a6c96f

https://www.datazar.com/focus/f5abc508d-7091-40cd-a14b-c6f2da005c14

https://www.datazar.com/focus/f649aea35-953d-4eb2-8ad4-037d6c343225

*https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/

Introducing Datazar Paper was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

taxize: seven years of taxonomy in R

Wed, 05/23/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

taxize was seven years old this last Saturday!

What is taxize?

taxize is designed around making working with taxonomic names easier – abstracting away the details of what each of 20 or so taxonomic data sources require for a given use case.

A samping of use cases covered in taxize (all of these across many different data sources):

  • Taxonomic identifier from a taxonomic name and vice versa
  • Taxonomic name from a vernacular (common) name and vice versa
  • Taxonomic hierarchy/classification from identifier or name
  • Taxonomic children of an identifier or name
  • All taxa downstream to a certain rank from identifier or name
  • Taxonomic name synonyms from identifier or name
  • Lowest common taxon and rank for an identifier or name
  • Resolve taxonomic names, i.e., fix spelling errors

History

taxize was one of our first packages. Our first commit was on 2011-05-19, uneventfully adding an empty README:

We’ve come a long way since May 2011. We’ve added a lot of new functionality and many new contributors.

Commit history

Get git commits for taxize using a few tidyverse packages as well as git2r, our R package for working with git repositories:

library(git2r) library(ggplot2) library(dplyr) repo <- git2r::repository("~/github/ropensci/taxize") res <- commits(repo)

A graph of commit history

dates <- vapply(res, function(z) { as.character(as.POSIXct(z@author@when@time, origin = "1970-01-01")) }, character(1)) df <- tbl_df(data.frame(date = dates, stringsAsFactors = FALSE)) %>% group_by(date) %>% summarise(count = n()) %>% mutate(cumsum = cumsum(count)) %>% ungroup() ggplot(df, aes(x = as.Date(date), y = cumsum)) + geom_line(size = 2) + theme_grey(base_size = 16) + scale_x_date(labels = scales::date_format("%Y/%m")) + labs(x = 'May 2011 to May 2018', y = 'Cumulative Git Commits')

Contributors

A graph of new contributors through time

date_name <- lapply(res, function(z) { data_frame( date = as.character(as.POSIXct(z@author@when@time, origin = "1970-01-01")), name = z@author@name ) }) date_name <- bind_rows(date_name) firstdates <- date_name %>% group_by(name) %>% arrange(date) %>% filter(rank(date, ties.method = "first") == 1) %>% ungroup() %>% mutate(count = 1) %>% arrange(date) %>% mutate(cumsum = cumsum(count)) ## plot ggplot(firstdates, aes(as.Date(date), cumsum)) + geom_line(size = 2) + theme_grey(base_size = 18) + scale_x_date(labels = scales::date_format("%Y/%m")) + labs(x = 'May 2011 to May 2018', y = 'Cumulative New Contributors')

taxize contributors, including those that have opened issues (click to go to their GitHub profile):

afkoeppelahhurlbertalbndAlectoriaandzandz11antagomirarendseeashenkinashiklombomearabw4szcboettigcdetermanChrKoenigchuckrpclarson2191claudenozerescmzambranatdaattaliDanielGMeaddavharrisdavidvilanovadiogoprovdlebauerdlenz1dschlaepEDiLDemhartfdschneiderfgabriel1891fmichonneaugedankenstueckegimoyaGISKidgit-ogglarocgustavobioibartomeusjangoreckijarioksajebyrnesjohnbaumsjonmcalderJoStaerkjsgosnellkamapukarthikKevCazkgturnerkmeversonKoalhaljvillanuevaMarkus2015mcsipleMikkoVihtakarimillerjefmiriamgracempnelsenMUSEZOOLVERTnate-d-olsonnmatzkenpchpaternogbcphilippipmarchand1pssguyRodgerGrossmouncesariyascelmendorfsckottSimonGoringsnshethsnubianSquiercgtdjames1tmkurobetpaulson1tpoisotvijaybarvewcornwellwillpearsewpetryzachary-foster

taxize usage

Eduard Szöcs and I wrote a paper describing taxize back in 2013, published in F1000Research.

Scott Chamberlain and Eduard Szöcs (2013). taxize – taxonomic search and retrieval in R. F1000Research 2:191. https://doi.org/10.12688/f1000research.2-191.v1

The paper has probably made taxize users more likely to cite the package, though we have no direct proof of that.

The paper above and/or the package have been cited 69 times over the past 7 years.

The way taxize is used in research is often in “cleaning” taxonomic names in one way or another. In addition, many users use taxize to get taxonomic names for certain groups of interest.

One example comes from the paper

Weber, M. G., Porturas, L. D., & Taylor, S. A. (2016). Foliar nectar enhances plant–mite mutualisms: the effect of leaf sugar on the control of powdery mildew by domatia-inhabiting mites. Annals of Botany, 118(3), 459–466. doi:10.1093/aob/mcw118

Features coming down the road
  • Integration with taxa package (maintained by one of our rOpenSci fellows Zachary Foster) in all get_*() functions. This will make the outputs of all get_*() more consistent and easier to integrate into your downstream workflows.
  • not taxize per se, but taxizedb is starting to really take shape due to help from Zebulun Arendsee. taxizedb will make taxonomic name work much faster for large datasets. It’s worth checking out.

Thanks!

A hug thanks goes to all taxize users and contributors. It’s awesome to see how useful taxize has been through the years, and we look forward to making it even better moving forward.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Workshop in Cape Town: Web Scraping with R

Wed, 05/23/2018 - 02:00

(This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers)

Join Andrew Collier and Hanjo Odendaal for a workshop on using R for Web Scraping.

Who should attend?

This workshop is aimed at beginner and intermediate R users who want to learn more about using R for data acquisition and management, with a specific focus on web scraping.

What will you learn?

You will learn:

  • data manipulation with dplyr, tidyr and purrr;
  • tools for accessing the DOM;
  • scraping static sites with rvest;
  • scraping dynamic sites with RSelenium; and
  • setting up an automated scraper in the cloud.

See programme below for further details.

Where – Rise, Floor 5, Woodstock Exchange, 66 Albert Road, Woodstock, Cape Town When – 14-15 June 2018 Who – Andrew Collier
Hanjo Odendaal

There are just 20 seats available. A 10% discount is available for groups of 4 or more people from a single organisation attending both days.

Email training@exegetic.biz if you have any questions about the workshop.

Register

Programme Day 1
  • Motivating Example
  • R and the tidyverse
    • Vectors, Lists and Data Frames
    • Loading data from a file
    • Manipulating Data Frames with dplyr
    • Pivoting with tidyr
    • Functional programming with purrr
  • Introduction to scraping
    • Ethics
    • DOM
    • Developer Tools
    • CSS and XPath
    • robots.txt and site map
  • Scraping a static site with rvest
    • What happens under the hood
    • What the hell is curl?
    • Assisted Assignment: Movie information from IMDB
Day 2
  • Case Study: Investigating drug tests using rvest
  • Interacting with APIs
    • Using XHR to find an API
    • Building wrappers around APIs
  • Scraping a dynamic site with RSelenium
    • Why RSelenium is needed
    • Navigation around web-pages
    • Combining RSelenium with rvest
    • Useful JavaScript tools
    • Case Study
  • Deploying a Scraper in the Cloud
    • Launching and connecting to an EC2 instance
    • Headless browsers
    • Automation with cron

Register

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why you should regret not going to eRum 2018?

Wed, 05/23/2018 - 02:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

I spent 3 amazing days at eRum conference in Budapest. The conference was a blast and organizers (BIG thanks to them again) did wonderful job compiling such a high-level event.

My favourite talks from the conference

Better than Deep Learning – Gradient Boosting Machines (GBM) in R – Szilard Pafka

Szilard made a thorough overview of currently available ML algorithms. Showed that ML algorithm works better on tabular data than Deep Learning. Gave his advice on which packages to choose depending on your goals like maximizing speed on GPU/CPU or going to production.

My take away from his talk: you should choose algorithm based on the problem you have and take into account outside constraints like interpretability. Choosing model is one thing, but a lot of prediction improvement can come from feature engineering, so domain knowledge and problem understanding matters a lot.

Thanks to Erin LeDell’s talk we know that majority of ML tasks can be automated thanks to their awesome autoML framework.

Harness the R condition system – Lionel Henry

Lionel talked about improving errors in R. Currently R offers errors handling solely through tryCatch function. From the presentation we learn that errors are regular objects. This makes it possible for a user to provide a custom classes and error metadata, which makes it much easier to implement handling and reporting. Some of the ideas he shared will be available through the new release of lang package.

Show my your model 2.0! – Przemysław Biecek

Przemek together with Mateusz gave both a workshop and a talk about the Dalex package which is an impressive toolkit for understanding machine learning model. Dalex is being developed by talented group of Przemek’s students in Poland. Thanks to Dalex you can do single variable explanations for more than one model at the same time. It’s also easier to understand how the single variable is influencing the prediction.
You may wonder why should you use Dalex if you are already familiar with Lime and the answer is: Dalex offers many methods for variable inspection (lime has one) and comparison of many methods using selected method.

R-Ladies event

The cherry on the cake was R-Ladies Budapest event where we could here 6 amazing presentations. Some of the R-Ladies were giving the second talk during those 3 days. One of them was Omayma Said talking about her Shiny app “Stringr Explorer: Tweet Driven Development for a Shiny App!”. It’s a really cool app that helps you navigate the stringr package, plus Omayma story how it was created was entertaining and admirable.

Other conference perks

It’s always pleasure to meet in person people from R community and fellow R-Ladies.

Great things about events like this is that there is always something extra you learn: recipes package is a neat way to do data pre-processing for you model, thanks to Barbara now I know about 2 useful parameters ignoreInit and once in observEevent function and Tobias explained when would I want to choose R vs. Python when doing Deep Learning – If you just need Keras go for R, everything you can do in Python is available in R!

Making Shiny shine brighter!

Finally I had a pleasure to give an invited talk about new Shiny packages: “Taking inspirations from proven frontend frameworks to add to Shiny with 4 6 new packages”. You can access the slides here and watch the video or YouTube. It was really valuable and motivating to get feedback on our open source packages. I’m proud that I’m part of such a great team!

If you like the idea of shiny.users and shiny.admin and would like to know when the packages are released, you can visit packages landing page or keep following our blog.

What next?

I hope the idea of eRum will continue and others would pick it up, so we can all meet in 2 years time!
I’m already jealous of all the lucky people going to useR this year. I sadly won’t be there, but Marek will and let me reveal a little secret: he will have shiny.semantic and semantic.dashboard cheat sheets and stickers to give away!

Read the original post at
Appsilon Data Science Blog.

Follow Appsilon Data Science var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Finalfit, knitr and R Markdown for quick results

Tue, 05/22/2018 - 22:46

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

Thank you for the many requests to provide some extra info on how best to get finalfit results out of RStudio, and particularly into Microsoft Word.

Here is how.

Make sure you are on the most up-to-date version of finalfit.

devtools::install_github("ewenharrison/finalfit")

What follows is for demonstration purposes and is not meant to illustrate model building.

Does a tumour characteristic (differentiation) predict 5-year survival?

Demographics table

First explore variable of interest (exposure) by making it the dependent.

library(finalfit) library(dplyr) dependent = "differ.factor" # Specify explanatory variables of interest explanatory = c("age", "sex.factor", "extent.factor", "obstruct.factor", "nodes")

Note this useful alternative way of specifying explanatory variable lists:

colon_s %>% select(age, sex.factor, extent.factor, obstruct.factor, nodes) %>% names() -> explanatory

Look at associations between our exposure and other explanatory variables. Include missing data.

colon_s %>% summary_factorlist(dependent, explanatory, p=TRUE, na_include=TRUE)

label levels Well Moderate Poor p Age (years) Mean (SD) 60.2 (12.8) 59.9 (11.7) 59 (12.8) 0.788 Sex Female 51 (11.6) 314 (71.7) 73 (16.7) 0.400 Male 42 (9.0) 349 (74.6) 77 (16.5) Extent of spread Submucosa 5 (25.0) 12 (60.0) 3 (15.0) 0.081 Muscle 12 (11.8) 78 (76.5) 12 (11.8) Serosa 76 (10.2) 542 (72.8) 127 (17.0) Adjacent structures 0 (0.0) 31 (79.5) 8 (20.5) Obstruction No 69 (9.7) 531 (74.4) 114 (16.0) 0.110 Yes 19 (11.0) 122 (70.9) 31 (18.0) Missing 5 (25.0) 10 (50.0) 5 (25.0) nodes Mean (SD) 2.7 (2.2) 3.6 (3.4) 4.7 (4.4) <0.001 Warning messages: 1: In chisq.test(tab, correct = FALSE) : Chi-squared approximation may be incorrect 2: In chisq.test(tab, correct = FALSE) : Chi-squared approximation may be incorrect

Note missing data in obstruct.factor. We will drop this variable for now (again, this is for demonstration only). Also that nodes has not been labelled.
There are small numbers in some variables generating chisq.test warnings (predicted less than 5 in any cell). Generate final table.

Hmisc::label(colon_s$nodes) = "Lymph nodes involved" explanatory = c("age", "sex.factor", "extent.factor", "nodes") colon_s %>% summary_factorlist(dependent, explanatory, p=TRUE, na_include=TRUE, add_dependent_label=TRUE) -> table1 table1

Dependent: Differentiation Well Moderate Poor p Age (years) Mean (SD) 60.2 (12.8) 59.9 (11.7) 59 (12.8) 0.788 Sex Female 51 (11.6) 314 (71.7) 73 (16.7) 0.400 Male 42 (9.0) 349 (74.6) 77 (16.5) Extent of spread Submucosa 5 (25.0) 12 (60.0) 3 (15.0) 0.081 Muscle 12 (11.8) 78 (76.5) 12 (11.8) Serosa 76 (10.2) 542 (72.8) 127 (17.0) Adjacent structures 0 (0.0) 31 (79.5) 8 (20.5) Lymph nodes involved Mean (SD) 2.7 (2.2) 3.6 (3.4) 4.7 (4.4) <0.001

Logistic regression table

Now examine explanatory variables against outcome. Check plot runs ok.

explanatory = c("age", "sex.factor", "extent.factor", "nodes", "differ.factor") dependent = "mort_5yr" colon_s %>% finalfit(dependent, explanatory, dependent_label_prefix = "") -> table2

Mortality 5 year Alive Died OR (univariable) OR (multivariable) Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.195) Sex Female 243 (47.6) 194 (48.0) - - Male 268 (52.4) 210 (52.0) 0.98 (0.76-1.27, p=0.889) 0.98 (0.74-1.30, p=0.885) Extent of spread Submucosa 16 (3.1) 4 (1.0) - - Muscle 78 (15.3) 25 (6.2) 1.28 (0.42-4.79, p=0.681) 1.28 (0.37-5.92, p=0.722) Serosa 401 (78.5) 349 (86.4) 3.48 (1.26-12.24, p=0.027) 3.13 (1.01-13.76, p=0.076) Adjacent structures 16 (3.1) 26 (6.4) 6.50 (1.98-25.93, p=0.004) 6.04 (1.58-30.41, p=0.015) Lymph nodes involved Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.23 (1.17-1.30, p<0.001) Differentiation Well 52 (10.5) 40 (10.1) - - Moderate 382 (76.9) 269 (68.1) 0.92 (0.59-1.43, p=0.694) 0.70 (0.44-1.12, p=0.132) Poor 63 (12.7) 86 (21.8) 1.77 (1.05-3.01, p=0.032) 1.08 (0.61-1.90, p=0.796)

Odds ratio plot

colon_s %>% or_plot(dependent, explanatory, breaks = c(0.5, 1, 5, 10, 20, 30))

To MS Word via knitr/R Markdown

Important. In most R Markdown set-ups, environment objects require to be saved and loaded to R Markdown document.

# Save objects for knitr/markdown save(table1, table2, dependent, explanatory, file = "out.rda")

We use RStudio Server Pro set-up on Ubuntu. But these instructions should work fine for most/all RStudio/Markdown default set-ups.

In RStudio, select File > New File > R Markdown.

A useful template file is produced by default. Try hitting knit to Word on the knitr button at the top of the .Rmd script window.

Now paste this into the file:

--- title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "22/5/2018" output: word_document: default --- ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) load("out.rda") ``` ## Table 1 - Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Table 2 - Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Figure 1 - Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE} colon_s %>% or_plot(dependent, explanatory) ```

It’s ok, but not great.

Create Word template file

Now, edit the Word template. Click on a table. The style should be compact. Right click > Modify... > font size = 9. Alter heading and text styles in the same way as desired. Save this as template.docx. Upload to your project folder. Add this reference to the .Rmd YAML heading, as below. Make sure you get the space correct.

The plot also doesn’t look quite right and it prints with warning messages. Experiment with fig.width to get it looking right.

Now paste this into your .Rmd file and run:

--- title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "21/5/2018" output: word_document: reference_docx: template.docx --- ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) load("out.rda") ``` ## Table 1 - Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Table 2 - Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Figure 1 - Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10} colon_s %>% or_plot(dependent, explanatory) ```

This is now looking good for me, and further tweaks can be made.

To PDF via knitr/R Markdown

Default settings for PDF:

--- title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "21/5/2018" output: pdf_document: default --- ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) load("out.rda") ``` ## Table 1 - Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Table 2 - Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Figure 1 - Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE} colon_s %>% or_plot(dependent, explanatory) ```

Again, ok but not great.

We can fix the plot in exactly the same way. But the table is off the side of the page. For this we use the kableExtra package. Install this in the normal manner. You may also want to alter the margins of your page using geometry in the preamble.

--- title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "21/5/2018" output: pdf_document: default geometry: margin=0.75in --- ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) library(kableExtra) load("out.rda") ``` ## Table 1 - Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"), booktabs=TRUE) ``` ## Table 2 - Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"), booktabs=TRUE) %>% kable_styling(font_size=8) ``` ## Figure 1 - Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10} colon_s %>% or_plot(dependent, explanatory) ```

This is now looking pretty good for me as well.

There you have it. A pretty quick workflow to get final results into Word and a PDF.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages