The Power of Standards and Consistency
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
I’m going to (eventually) write a full post on the package I’m mentioning in this one : osqueryr. The TLDR on osqueryr is that it is an R DBI wrapper (that has just enough glue to also be plugged into dbplyr) for osquery. The TLDR on osquery is that it “exposes an operating system as a highperformance relational database. This design allows you to write SQLbased queries efficiently and easily to explore operating systems.”
In short, osquery turns the metadata and state information of your local system (or remote system(s)) into a SQLcompliant database. It also works on Windows, Linux, BSD and macOS. This means you can query a fleet of systems with a (mostly) normalized set of tables and get aggregated results. Operations and information security staff use this to manage systems and perform incident response tasks, but you can use it to get just about anything and there are even more powerful modes of operation for osquery. But, more on all the features of osquery[r] in another post.
If you are skeptical, here’s some proof (which I need to show regardless of your skepticism state). First, a local “connection”:
library(DBI) library(osqueryr) con < DBI::dbConnect(Osquery()) head(dbListTables(con), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 2 x 2 ## name system_time ## 1 Firewall 3 ## 2 firefox 517846then, a remote "connection":
con2 < osqueryr::dbConnect(Osquery(), host = "hrbrmstr@osq1") head(dbListTables(con2), 10) ## [1] "account_policy_data" "acpi_tables" "ad_config" ## [4] "alf" "alf_exceptions" "alf_explicit_auths" ## [7] "alf_services" "app_schemes" "apps" ## [10] "apt_sources" dbListFields(con2, "processes") ## [1] "cmdline" "cwd" "disk_bytes_read" ## [4] "disk_bytes_written" "egid" "euid" ## [7] "gid" "name" "nice" ## [10] "on_disk" "parent" "path" ## [13] "pgroup" "pid" "resident_size" ## [16] "root" "sgid" "start_time" ## [19] "state" "suid" "system_time" ## [22] "threads" "total_size" "uid" ## [25] "user_time" "wired_size" dbGetQuery(con2, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'") ## # A tibble: 1 x 2 ## name system_time ## 1 firefox 1071992 "You're talking an awful lot about the package when you said this was a post on 'standards' and 'consistency'."True, but we needed that bit above for context. To explain what this post has to do with "standards" and "consistency" I also need to tell you a bit more about how both osquery and the osqueryr package are implemented.
You can read about osquery indepth starting at the link at the top of this post, but the authors of the tool really wanted a consistent idiom for accessing system metadata with usable, normalized output. They chose (to use a word they didn't but one that works for an R audience) a "data frame" as the output format and picked the universal language of "data frames"  SQL  as the inquiry interface. So, right there are examples of both standards and consistency: using SQL vs coming up with yetanotherquerylanguage and avoiding the chaos of the myriad of outputs from various system commands by making all results conform to a rectangular data structure.
Let's take this onestep further with a specific example. All modern operating systems have the concept of a "process" and said processes have (mostly) similar attributes. However, the commands used to get a detailed listing of those processes differ (sometimes wildly) from OS to OS. The authors of osquery came up with a set of schemas to ensure a common, rectangular output and naming conventions (note that some schemas are unique to a particular OS since some elements of operating systems have no useful counterparts on other operating systems).
The osquery authors also took consistency and standards to yetanotherlevel by taking advantage of a feature of SQLite called virtual tables. That enables them to have C/C++/ObjectiveC "glue" that gets called when a query is made so they can dispatch the intent to the proper functions or shell commands and then send all the results back  or  use the SQLite engine capabilities to do joining, filtering, UDFcalling, etc to produce rich, targeted rectangular output back.
By not reinventing the wheel and relying on wellaccepted features like data frames, SQL and SQLite the authors could direct all their focus on solving the problem they posited.
"Um, you're talking alot about everything but R now."We're getting to the good (i.e. "R") part now.
Because the authors didn't try to become SQL parser writer experts and relied on the standard SQL offerings of SQLite, the queries made are "real" SQL (if you've worked with more than one database engine, you know how they all implement different flavours of SQL).
Because these queries are "real" SQL, we can write an R DBI driver for it. The DBI package aims "[to define] a common interface between R and database management systems (DBMS). The interface defines a small set of classes and methods similar in spirit to Perl's DBI, Java's JDBC, Python's DBAPI, and Microsoft's ODBC. It defines a set of classes and methods defines what operations are possible and how they are performed."
If you look at the osqueryr package source, you'll see a bunch of DBI boilerplate code (which is in the rdbi organization example code) and only a handful of "touch points" for the actual calls to osqueryi (the command that processes SQL). No handling of anything but passing on SQL to the osqueryi engine and getting rectangular results back. By abstracting the system call details, R users can work with a familiar, consistent, standard interface and have full access to the power of osquery without firing up a terminal.
But it gets even better.
As noted above, one design aspect of osquery was to enable remote usage. Rather than come up with yetanotherdaemonandcustomprotocol, the osquery authors suggest ssh as one way of invoking the command on remote systems and getting the rectangular results back.
Because the osqueryr package used the sys package for making local system calls, there was only a tiny bit of extra effort required to switch from sys::exec_internal() to a sibling call in the ssh package  ssh::ssh_exec_internal() when remote connections were specified. (Said effort could have been zero if I chose a slightly different function in sys, too.)
Relying on wellaccepted standards made both osqueryi and the R DBIdriver work seamlessly without much code at all and definitely without a rats nest of nested if/else statements and custom httr functions.
But it gets even more bettererSome folks like & grok SQL, others don't. (Humans have preferences, go figure.)
A few years ago, Hadley (do I even need to use his last name at this point in time?) came up with the idea to have a more expressive and consistent way to work with data frames. We now know this as the tidyverse but one core element of the tidyverse is dplyr, which can really levelup your data frame game (no comments about data.table, or the beauty of base R, please). Not too long after the birth of dplyr came the ability to work with remote, rectangular, SQLbased data sources with (mostly) the same idioms.
And, not too long after that, the remote dplyr interface (now, dbplyr) got up close and personal with DBI. Which ultimately means that if you make a nearfullycompliant DBI interface to a SQL backend you can now do something like this:
library(DBI) library(dplyr) library(osqueryr) con < DBI::dbConnect(Osquery()) osqdb < src_dbi(con) procs < tbl(osqdb, "processes") listen < tbl(osqdb, "listening_ports") left_join(procs, listen, by="pid") %>% filter(port != "", protocol == "17") %>% # 17 == TCP distinct(name, port, address, pid) ## # Source: lazy query [?? x 4] ## # Database: OsqueryConnection ## address name pid port ## 1 0.0.0.0 BetterTouchTool 46317 57183 ## 2 0.0.0.0 Dropbox 1214 17500 ## 3 0.0.0.0 SystemUIServer 429 0 ## 4 0.0.0.0 SystemUIServer 429 62240 ## 5 0.0.0.0 UserEventAgent 336 0 ## 6 0.0.0.0 WiFiAgent 493 0 ## 7 0.0.0.0 WiFiProxy 725 0 ## 8 0.0.0.0 com.docker.vpnkit 732 0 ## 9 0.0.0.0 identityservicesd 354 0 ## 10 0.0.0.0 loginwindow 111 0 ## # ... with more rowsThe src_dbi() call wires up everything for us because d[b]plyr can rely on DBI doing it's standard & consistent job and DBI can rely on the SQLite processing crunchy goodness of osqueryi to ultimately get us a list of really dangerous (if not firewalled off) processes that are listening on all network interfaces. (Note to self: find out why the BetterTouchTool and Dropbox authors feel the need to bind to 0.0.0.0…)
FIN
What did standards and consistency get us?
 The osquery authors spent time solving a hard problem vs creating new data formats and protocols
 Rectangular data (i.e. "data frame") provides consistency and structure which ends up causing more freedom
 "Standard" SQL enables a consistent means to work with rectangular data
 ssh normalizes (secure) access across systems with a consistent protocol
 A robust, welldefined standard mechanism for working with SQL databases enabled nigh instantaneous wiring up of a whole new backend to R
 ssh and sys common idioms made working with the new backend on remote systems as easy as is on a local system
 Another robust, welldefined modern mechanism for working with rectangular data got wired up to this new backend with (pretty much) one line of code because of the defined standard and expectation of consistency (and works for local and remote)
Standards and consistency are pretty darned cool.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Extracting Specific Lines from a Large (Compressed) Text File
(This article was first published on R on Yixuan's Blog, and kindly contributed to Rbloggers)
A few days ago a friend asked me the following question: how to efficiently extract some specific lines from a large text file, possibily compressed by Gzip? He mentioned that he tried some R functions such as read.table(skip = …), but found that reading the data was too slow. Hence he was looking for some alternative ways to extracting the data.
This is a common task in preprocessing large data sets, since in data exploration, very often we want to peek at a small subset of the whole data to gain some insights.
To leave a comment for the author, please follow the link and comment on their blog: R on Yixuan's Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio:addins part 3 – View objects, files, functions and more with 1 keypress
(This article was first published on Jozef's Rblog, and kindly contributed to Rbloggers)
IntroductionIn this post in the RStudio:addins series we will try to make our work more efficient with an addin for better inspection of objects, functions and files within RStudio. RStudio already has a very useful View function and a Go To Function / File feature with F2 as the default keyboard shortcut and yes, I know I promised automatic generation of @importFrom roxygen tags in the previous post, unfortunately we will have to wait a bit longer for that one but I believe this one more than makes up for it in usefulness.
The addin we will create in this article will let us use RStudio to View and inspect a wide range of objects, functions and files with 1 keypress. Contents Retrieving objects from sys.frames
 Viewing files, objects and functions and more efficiently
 The addin function, updating the .dcf file and key binding
 The addin in action
 TL;DR – Just give me the package
 References
As a first step, we need to be able to retrieve the value of the object we are looking for based on a character string from a frame within the currently present sys.frames() for our session. This may get tricky, as it is not sufficient to only look at parent frames, because we may easily have multiple sets of “parallel” call stacks, especially when executing addins.
An example can be seen in the following screenshot, where we have a browser() call executed during the Addin execution itself. We can see that our current frame is 18 and browsing through its parent would get us to frames 17 > 16 > 15 > 14 > 0 (0 being the .GlobalEnv). The object we are looking for is however most likely in one of the other frames (9 in this particular case):
getFromSysframes < function(x) { if (!(is.character(x) && length(x) == 1 && nchar(x) > 0)) { warning("Expecting a nonempty character of length 1. Returning NULL.") return(invisible(NULL)) } validframes < c(sys.frames()[sys.nframe()], .GlobalEnv) res < NULL for (i in validframes) { inherits < identical(i, .GlobalEnv) res < get0(x, i, inherits = inherits) if (!is.null(res)) { return(res) } } return(invisible(res)) } Viewing files, objects, functions and more efficientlyAs a second step, we write a function to actually view our object in RStudio. We have quite some flexibility here, so as a first shot we can do the following:
 Open a file if the selection (or the selection with quotes added) is a path to an existing file. This is useful for viewing our scripts, data files, etc. even if they are not quoted, such as the links in your Rmd files
 Attempt to retrieve the object by the name and if found, try to use View to view it
 If we did not find the object, we can optionally still try to retrieve the value by evaluating the provided character string. This carries some pitfalls, but is very useful for example for
 viewing elements of lists, vectors, etc. where we need to evaluate [, [[ or $ to do so.
 viewing operation results directly in the viewer, as opposed to writing them out into the console, useful for example for wide matrices that (subjectively) look better in the RStudio viewer, compared to the console output
 If the View fails, we can still show useful information by trying to View its structure, enabling us to inspect objects that cannot be coerced to a data.frame and therefore would fail to be viewed.
This function can of course be improved and updated in many ways, for example using the summary method instead of str for selected object classes, or showing contents of .csv (or other data) files already read into a data.frame.
The addin function, updating the .dcf file and key bindingIf you followed the previous posts in the series, you most likely already know what is coming up next. First, we need a function serving as a binding for the addin that will execute out viewObject function on the active document’s selections:
viewSelection < function() { context < rstudioapi::getActiveDocumentContext() lapply(X = context[["selection"]] , FUN = function(thisSel) { viewObject(thisSel[["text"]]) } ) return(invisible(NULL)) }Secondly, we update the inst/rstudio/addins.dcf file by adding the binding for the newly created addin:
Name: viewSelection Description: Tries to use View to View the object defined by a text selected in RStudio Binding: viewSelection Interactive: falseFinally, we reinstall the package and assign the keyboard shortcut in the Tools > Addins > Browse Addins... > Keyboard Shortcuts... menu. Personally I assigned a single F4 keystroke for this, as I use it very often:
The addin in actionNow, let’s view a few files, a data.frame, a function and a tryerror class object just pressing F4.
TL;DR – Just give me the package get the status of the package after this article
 or use git clone from https://gitlab.com/jozefhajnala/jhaddins.git
 Environments chapter of Advanced R
 Using RStudio’s Data Viewer
To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tips for great graphics
(This article was first published on R – Insights of a PhD, and kindly contributed to Rbloggers)
R is a great program for generating topnotch graphics. But to get the best out of it, you need to put in a little more work. Here are a few tips for adapting your R graphics to make them look a little better.
1) Dont use the “File/Save as…/” menu.
If you set up your graphic in the first place then theres no need to postprocess (eg crop, scale etc) the graphic in other software.
Use the graphic devices (jpeg(), tiff(), postscript(), etc), set your height and width to whatever you want the finished product to be and then create the graph.
tiff("~/Manuscript/Figs/Fig1.tiff", width =2, height =2, units ="in", res = 600) plot(dist ~ speed, cars) # cars is a base R dataset  data(cars) dev.off()The first argument to a graphic device such as tiff or jpeg is the filename, to which you can include the path. So that I dont have to worry about the order of arguments, I include the argument name. Width and height specify the width and height of the graphic in the units provided, in this case inches, but pixels, cm or mm can also be used. The resolution tells R how higher quality the graphic should be, the higher the better, but if you go too high, you could find that you have problems running out of space. I find 600 a nice compromise. Nice crisp lines, smooth curves and sharp letters. You can then import that straight into MS Word or whatever other word processor you use, or upload to go with that manuscript youve worked so hard on. Although, you could find yourself with BIG files. A 4×6 inch figure I made recently was 17.1MB! And for the web, 100 or 200 is probably enough.
This technique also provides you with the same output every time, which is not the case if you adjust the size of the default device window produced by plot()
2) Dont be afraid to change the default settings!
Personally, I find that a 1 inch margin at the base of the graphic is a bit generous. I also find the ticks a bit long and the gap between tick and the axis label a bit big. Luckily, these things are easy to change!
jpeg("t1.jpeg", width=3, height=3, units="in", res=100) plot(dist~speed, cars) dev.off()The above code produces this figure.
See what I mean about the margins?
Heres how to change it!
par(mai=c(0.5, 0.5, 0.1, 0.1) ) plot(dist ~ speed, cars, tck = 0.01, las=1, mgp=c(1.4,0.2,0))That call to par changes the “MArgin in Inches” setting. The tck argument to par deals with TiCK length (negative for outside, positive for inside) while mgp controls on which line certain things are printed (titles are the first argument, labels are the second and the axis itself is the third). The las argument controls the rotation of the labels (1 for horizontal, 2 for perpendicular and 3 for vertical).
This leads me nicely to number 3: Dont be afraid to have separate lines for different parts of your plot.
This allows far more freedom and flexibility than setting par arguments in the plot argument. You can have different mgp settings for each axis for instance.
par(mai=c(0.4, 0.5, 0.1, 0.1)) plot(dist ~ speed, cars, xaxt="n", mgp=c(1.4,0.2,0), las=1, tck=0.01) axis(side=1, tck = 0.01, las=1, mgp=c(0.5,0.2,0)) mtext("speed", side=1, line= 1)This plots the same graph, but allows different distances for the x and y axes, in terms of margin and where the title is situated. The axis function places an axis on the side determined by its side argument and mtext places Margin TEXT, again at the side in its argument and in this case on line 1.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to plot with patchwork
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
INTRODUCTION
The goal of patchwork is to make it simple to combine separate ggplots into the same graphic. As such it tries to solve the same problem as gridExtra::grid.arrange() and cowplot::plot_grid but using an API that incites exploration and iteration.
Installation
You can install patchwork from github with:
# install.packages("devtools")
devtools::install_github("thomasp85/patchwork")
The usage of patchwork is simple: just add plots together!
library(ggplot2)
library(patchwork)
p1 < ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 < ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))
p1 + p2
you are of course free to also add the plots together as part of the same plotting operation:
ggplot(mtcars) +
geom_point(aes(mpg, disp)) +
ggplot(mtcars) +
geom_boxplot(aes(gear, disp, group = gear))
layouts can be specified by adding a plot_layout() call to the assemble. This lets you define the dimensions of the grid and how much space to allocate to the different rows and columns
p1 + p2 + plot_layout(ncol = 1, heights = c(3, 1))
If you need to add a bit of space between your plots you can use plot_spacer() to fill a cell in the grid with nothing
p1 + plot_spacer() + p2
You can make nested plots layout by wrapping part of the plots in parentheses – in these cases the layout is scoped to the different nesting levels
p3 < ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 < ggplot(mtcars) + geom_bar(aes(carb))
p4 + {
p1 + {
p2 +
p3 +
plot_layout(ncol = 1)
}
} +
plot_layout(ncol = 1)
Advanced features
In addition to adding plots and layouts together, patchwork defines some other operators that might be of interest. – will behave like + but put the left and right side in the same nesting level (as opposed to putting the right side into the left sides nesting level). Observe:
p1 + p2 + p3 + plot_layout(ncol = 1)
this is basically the same as without braces (just like standard math arithmetic) – the plots are added sequentially to the same nesting level. Now look:
p1 + p2  p3 + plot_layout(ncol = 1)
Now p1 + p2 and p3 is on the same level.
Often you are interested in just putting plots besides or on top of each other. patchwork provides both  and / for horizontal and vertical layouts respectively. They can of course be combined for a very readable layout syntax:
(p1  p2  p3) /
p4
There are two additional operators that are used for a slightly different purpose, namely to reduce code repetition. Consider the case where you want to change the theme for all plots in an assemble. Instead of modifying all plots individually you can use & or * to add elements to all subplots. The two differ in that * will only affect the plots on the current nesting level:
(p1 + (p2 + p3) + p4 + plot_layout(ncol = 1)) * theme_bw()
whereas & will recurse into nested levels:
p1 + (p2 + p3) + p4 + plot_layout(ncol = 1) & theme_bw()
Note that parenthesis is required in the former case due to higher precedence of the * operator. The latter case is the most common so it has deserved the easiest use.
Related exercise sets: How to Display Multivariate Relationship Graphs With Lattice
 Lattice Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Programmatically creating text output in R – Exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
In the age of Rmarkdown and Shiny, or when making any custom output from your data you want your output to look consistent and neat. Also, when writing your output you often want it to obtain a specific (decorative) format defined by the html or LaTeX engine. These exercises are an opportunity to refresh our memory on functions such as paste, sprintf, formatC and others that are convenient tools to achieve these ends. All of the solutions rely partly on the ultra flexible sprintf() but there are nodoubt many ways to solve the exercises with other functions, feel free to share your solutions in the comment section.
Example solutions are available here.
Exercise 1
Print out the following vector as prices in dollars (to the nearest cent):
c(14.3409087337707, 13.0648270623048, 3.58504267621646, 18.5077076398145,
16.8279241011882). Example: $14.34
Exercise 2
Using these numbers c(25, 7, 90, 16) make a vector of filenames in the following format: file_025.txt. That is, left pad the numbers so they are all three digits.
Exercise 3
Actually, if we are only dealing numbers less than hundred file_25.txt would have been enough. Change the code from last exercise so that the padding is progammatically decided by the biggest number in the vector.
Exercise 4
Print out the following haiku on three lines, right aligned, with the help of cat. c("Stay the patient course.", "Of little worth is your ire.", "The network is down.").
Exercise 5
Write a function that converts a number to its hexadecimal representation. This is a useful skill when converting bmp colors from one representation to another. Example output:
tohex(12) [1] "12 is c in hexadecimal"Exercise 6
Take a string and programmatically surround it with the html header tag h1
Exercise 7
Back to the poem from exercise 4, let R convert to html unordered list. So that it would appear like the following in a browser:
 Stay the patient course.
 Of little worth is your ire.
 The network is down.
Exercise 8
Here is a list of the currently top 5 movies on imdb.com in terms of rating c("The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "The Dark Knight", "12 Angry Men", "Schindler's List") convert them into a list compatible with written text.
Example output:
[1] "The top ranked films on imdb.com are The Shawshank Redemption, The Godfather, The Godfather: Part II, The Dark Knight, 12 Angry Men and Schindler's List"
Exercise 9
Now you should be able to solve this quickly: write a function that converts a proportion to a percentage that takes as input the number of decimal places. Input of 0.921313 and 2 decimal places should return "92.13%"
Exercise 10
Improve the function from last exercise so the percentage take consistently 10 characters by doing some left padding. Raise an error if percentage already happens to be longer than 10.
(Image by Daniel Friedman).
Related exercise sets: Parallel Computing Exercises: Foreach and DoParallel (Part2)
 Parallel Computing Exercises: Snow and Rmpi (Part3)
 Building Shiny App exercises part 4
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Slopegraphs and R – A pleasant diversion – May 26, 2018
(This article was first published on Chuck Powell, and kindly contributed to Rbloggers)
I try to at least scan the Rbloggers
feed everyday. Not every article is of interest to me, but I often have
one of two different reactions to at least one article. Sometimes it is
an “ah ha” moment because the article is right on point for a problem
I have now or have had in the past and the article provides a (better)
solution. Other times my reaction is more of an “oh yeah”, because it
is something I have been meaning to investigate, or something I once
knew, but the article brings a different perspective to it.
The second case happened to me this week. I’ve been aware of slopegraphs
and bumpcharts for quite some time, and I certainly am aware of Tufte’s
work. As an amateur military
historian I’ve always loved, for example, his
poster depicting Napoleon’s
Russian Campaign. So when I saw the article from Murtaza
Haider titled
“Edward Tufte’s Slopegraphs and political fortunes in Ontario” I
just had to take a peek and revisit the topic.
The article does a good job of looking at slopegraphs in both R (via
plotrix) and Stata, even providing the code to do the work. My
challenge was that even though I’m adequate at plotting in base R, I
much prefer using ggplot2 wherever and whenever possible. My memory
was that I had seen another article on the related topic of a
bumpchart on Rbloggers in the not too distant past. A little
sleuthing turned up this earlier
article from Dominik
Koch who wrote some code to
compare national performance at the Winter Olympics, “Bump Chart –
Track performance over time”.
Finally, I wound up at this Github
repository for a project called
“Edward TufteInspired Slopegraphs” from Thomas J.
Leeper who has been building code to make
slopegraphs using both base plotting functions and ggplot2.
My post today will draw a little bit from all their work and hopefully
provide some useful samples for others to draw on if they share some of
my quirks about data layout and a preference for ggplot2 versus base
plot. I’m going to focus almost exclusively on slopegraphs, although
much of the work could be extended to bumpcharts as well.
We’re going to make occasional use of dplyr to manipulate the data,
extensive use of ggplot2 to do the plotting and ggrepel to solve one
specific labeling problem. We’ll load them and I am suppressing the
message from dplyr about namespace overrides.
The original
post
is about plotting the data from some polling results in Ontario. For the
reader’s convenience I’ve made the data available via a structure
command. We have data about two different polling dates, for 5 political
parties, and the measured variable is percent of people supporting
expressed as x.x (i.e. already multiplied by
100).
Let’s just take the data as we have it and feed it to ggplot in a nice
simple fashion and see what we get with very little effort.
The nice thing about ggplot is once you get used to the syntax it
becomes very “readable”. We’ve identified our dataset, the x & y
variables and our grouping variable. Lines too big? An adjustment to
size = 2 does it. Don’t like colors? Pull the color = Party clause.
So we’re already pretty close to what we need. Things are scaled
properly and the basic labeling of titles etc. is accomplished. Our
biggest “problem” is that ggplot has been a little too helpful and
adding some things we’d like to remove to give it a more “Tuftesque”
look. So what we’ll do in the next few steps is add lines of code – but
they are mainly designed to remove unwanted elements. This is in
contrast to a base plot where we have to write the code to add elements.
So lets:
 Move the x axis labels to the top with scale_x_discrete(position =
"top")  Change to a nice clean black and white theme theme_bw()
 Not display any legend(s) theme(legend.position = "none")
 Remove the default border from our plot theme(panel.border =
element_blank())
Nice progress! Continuing to remove things that can be considered
“clutter” we add some additional lines that all end in
element_blank() and are invoked to remove default plot items such as
the plot grid, the y axcis text, etc..
Very nice! We’re almost there! The “almost” is because now that we
have removed both the legend and all scales and tick marks we no longer
know who is who, and what the numbers are! Plus, I’m a little unhappy
with the way the titles are formatted, so we’ll play with that. Later,
I’ll get fancy but for now let’s just add some simple text labels on
the left and right to show the party name and their percentage. The code
geom_text(aes(label = Party)) will place the party name right on top
of the points that anchor either end of the line. If we make that
geom_text(aes(label = paste0(Party, "  ", Pct, "%"))) then we’ll get
labels that have both the party and the percent all neatly formatted,
but still right on top of the points that anchor the ends of the line.
hjust controls horizontal justification so if we change it to
geom_text(aes(label = paste0(Party, "  ", Pct, "%")), hjust = 1.35)
both sets of labels will slide to the left which is exactly what we want
for the May 11 labels but not the May 18 labels. If we feed hjust a
negative number they’ll go the other way. So what we’ll do is filter the
data using the filter function from dplyr and place the left hand
labels differently than the right hand labels. While we’re at it we’ll
make it bold face font and a little larger…
Eureka! Not perfect yet but definitely looking good.
Adding complexityI’m feeling pretty good about the solution so far but there are three
things I’d like to make better.
 How well will this solution work when we have more than two time
periods? Need to make sure it generalizes to a more complex case.  As Murtaza
Haider notes in
his post we’ll have issues if the data points are identical or very
close together. Our very neat little labels will overlap each other.
In his post I believe he mentions that he manually moved them in
some cases. Let’s try and fix that.  Oh my, that’s a lot of code to keep cutting and pasting, can we
simplify?
To test #1 and #2 I have “invented”” a new dataset called moredata.
It is fictional it’s labelled May 25th but today is actually May
24th. But I created it to add a third polling date and to make sure that
we had a chance to test what happens when we have two identical
datapoints on the same day. Notice that on May 25th the polling numbers
for the Liberals and the NDP are identical at
26.8%.
You’ll notice at the beginning of this post I loaded the ggrepel
library. ggrepel works with ggplot2 to repel things that overlap,
in this case our geom_text labels. The invocation is geom_text_repel
and it is very similar to geom_text but allows us to deconflict the
overlaps. We’ll use hjust = "left" and hjust = "right" to control
justifying the labels. We’ll use a fixed nudge left and right nudge_x =
.45 and nudge_x = .5 to move the labels left and right off the
plotted data points and we will explicitly tell geom_text_repel to
only move the labels vertically to avoid overlap with direction = "y".
Everything else remains the same.
Very nice! We have confirmed that our solution works for more than two
dates without any additional changes and we have found a solution to the
label overlap issue. In a little while we’ll talk about labeling the
data points in the center (if we want to).
Before we move on let’s make our life a little simpler. While the output
plot is good it’s a lot of code to produce one graph. Let’s see if we
can simplify…
Since ggplot2 objects are just regular R objects, you can put them in a
list. This means you can apply all of R’s great functional programming
tools. For example, if you wanted to add different geoms to the same
base plot, you could put them in a list and use lapply().
But for now let’s at least take all the invariant lines of code and put
them in a list. Then when we go to plot we can just invoke the list and
remain confident we get the right formatting. For now let’s name this
list something quaint and obvious like MySpecial.
MySpecial is actually an incredibly complex structure so I used the
summary function. What’s important to us is that in the future all we
need to do is include it in the ggplot command and magic happens.
Perhaps another day I’ll make it a proper function but for now I can
change little things like line size or titles and labels without
worrying about the rest. So here it is with some little things changed.
Feeling good about the solution so far I decided to press on to a much
more complex problem. Thomas J. Leeper has
a nice plot of Tufte’s Cancer survival
slopegraph
N.B. that the original Tufte is not accurate on the vertical scale.
Look at Prostate and Thyroid for example since visually I would argue
they should cross to reflect the data.
Let’s grab the data as laid out by
Tufte.
Year.5
Year.10
Year.15
Year.20
Prostate
99
95
87
81
Thyroid
96
96
94
95
Testis
95
94
91
88
Melanomas
89
87
84
83
Breast
86
78
71
75
Hodgkin’s
85
80
74
67
Uterus
84
83
81
79
Urinary
82
76
70
68
Cervix
71
64
63
60
Larynx
69
57
46
38
There, we have it in a neat data frame but not organized as we need it.
Not unusual, and an opportunity to use some other tools from broom and
reshape2. Let’s do the following:
 Let’s transpose the data with t
 Let’s use broom::fix_data_frame to get valid column names and
convert rownames to a proper column all in one function. Right now
the types of cancer are nothing but rownames.  Use reshape2::melt to take our transposed dataframe and convert it
to long format so we can send it off to ggplot. Along the way
we’ll rename the resulting dataframe newcancer with columns
named Year, Type and Survival.
Now we have whipped the data into the shape we need it. 96 rows with the
three columns we want to plot, Year, Type, and Survival. If you
look at the data though, you’ll notice two small faults. First, Year
is not a factor. The plot will work but have an annoying limitation.
Since “Year.5” is a character string it will be ordered after all the
other years. We could fix that on the fly within our ggplot call but I
find it cleaner and more understandable if I take care of that first.
I’ll use the factor function from base R to accomplish that and
while I’m at it make the values nicer looking. Second in three cases R
changed cancer type names because they couldn’t be column names in a
dataframe. I’ll use forcats::fct_recode to make them look better.
Now that we have the data the way we want it we can make our slopegraph.
Some of the necessary changes are obvious x = Year, y = Survival and
group = Type for example. Since there are a lot of plotted lines I’ve
reduced the weight or size of the individual lines. We no longer want to
plot the big round points, we’re going to substitute in the actual
numbers, so that line gets commented out. The left and right labels
require no change and geom_text_repel will keep them from overlapping
which is almost inevitable given the data. To put the actual survival
numbers on the plot we’ll turn to geom_label. It’s like geom_text
only it puts a label box around the text. We’ll choose a smallish size,
minimize the amount of padding, and make the border of the box
invisible. The end result is what we want. It overlays on top of the
lines we’ve already plotted and the invisible padding gives us just
enough room.
I hope you’ve found this useful. I am always open to comments,
corrections and suggestions.
Chuck (ibecav at gmail dot
com)
This
work is licensed under a
Creative
Commons AttributionShareAlike 4.0 International License.
To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Safe Disposal of Unexploded WWII Bombs
(This article was first published on Theory meets practice..., and kindly contributed to Rbloggers)
AbstractUnexploded WWII bombs are ticking threats despite being dropped more than 70 years ago. In this post we explain how statistical methods are used to plan the search and disposal of unexploded WWII bombs. In particular we consider and exemplify the nonparametric nearest neighbour distance (NND) method implemented in the R package highriskzone. The method analyses the spatial pattern of exploded bombs to determine so called riskzones, that is regions with a high likelihood of containing unexploded bombs. The coverage of such riskzones is investigated through both nonparametric and parametric point process simulation.
NCAP aerial photo from 1944 showing the bombing of the V2 rocket facility at Peenemünde, Germany. Image is available under a custom NCAP license – higher resolution images are available from NCAP.
This work is licensed under a Creative Commons AttributionShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.
\[
\newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}}
\]
During WWII Germany was pounded with about 1.5 mio tons of bombs of which about 1015% did not explode. More than 70 years after the end of WWII these unexploded bombs (UXBs) still pose a threat and are the frequent cause of large scale evacuations to secure their safe disposal when found. Luckily, lethal incidents are rare thanks to a huge effort to localise and safely dismantle UXBs. As part of this effort, aerial photos taken by the allies after the attacks provide valuable information about the possible locations of UXBs. Some UXBs are directly visible in the photos – see for example the green circles in this NCAP image or p. 6 in the following information flyer by one of the companies offering such aerial identification services (featured in this news article). In other cases the photos only provide information about the location of the exploded bombs. This information can be used to identify areas where there is a high likelihood of UXBs. Such areas would then be carefully scrutinized using ontheground search methods, for example, electromagnetic and magnetic detectors.
The aim of Mahling, Höhle, and Küchenhoff (2013) was to develop statistical methods for the identification of such riskzones in cooperation with Oberfinanzdirektion Niedersachsen, which supports the removal of unexploded ordnance in federal properties in Germany. In what follows we will discuss one of the methods used in the paper, the so called nearest neighbourhood distance method and illustrate its implementation in the R package highriskzone originally created by Heidi Seibold and now maintained by Felix Günther.
Mathematical SetupCasting matters into mathematical notation: Let \(X\) be a point process denoting the spatial locations of all bombs dropped in the particular window of interest \(W \subseteq \mathbb{R}^2\). Furthermore, let \(Y\) denote the observed point process of exploded bomb and \(Z=X\backslash Y\) the point process of unexploded bombs. Note that only the process \(Y\) is observed; \(Z\) is not observed and the target of our inference. We assume that the probability \(q\) of a dropped bomb not exploding is homogeneous in \(W\). Thus if \(X\) is a inhomogeneous Poisson point process with intensity function \(\lambda_X(\bm{s})\), \(\bm{s}\in W\), then
\[
\lambda_Y(\bm{s}) = (1q) \lambda_X(\bm{s})
\quad \text{and}\quad
\lambda_Z(\bm{s}) = q \lambda_X(\bm{s})
\]
are the intensity functions of \(Y\) and \(Z\), respectively.
The figure below shows \(Y\) for an actual observed WWII bomb point consisting of n=443 bombs available from the R package highriskzone (Seibold et al. 2017). The observation window contains a particular area of interest for which a risk assessment needs to be done – often these contain a known WWII military target, e.g., an airport, an arms factory or a military casern. In order to not disclose the exact location of the considered area, coordinates are given relative to an arbitrary origo. In reality one would closely link such digitized aerial images to other terrain features using a GIS system.
library(highriskzone) library(spatstat) data(craterA) summary(craterA) ## Planar point pattern: 443 points ## Average intensity 0.0001082477 points per square unit ## ## Coordinates are given to 4 decimal places ## ## Window: polygonal boundary ## single connected closed polygon with 208 vertices ## enclosing rectangle: [0, 2334.3758] x [0, 2456.4061] units ## Window area = 4092470 square units ## Fraction of frame area: 0.714The point pattern craterA corresponding to an instance of the process \(Y\) is provided in R as an object of class ppp from the R package spatstat (Baddeley, Rubak, and Turner 2015). Instead of inferring the locations in \(Z\) directly, we shall be interested in determining a region within \(W\), a so called high risk zone, which with a high likelihood contains the points of \(Z\). We shall consider two methods for this job: a nonparametric method based on nearest neighbour distances in \(Y\) and an intensity function based inhomogeneous Poisson process approach incorporating \(q\).
High Risk ZonesA heuristic way to determine a highrisk zone is the following: Determine the distribution function \(D\) of the nearest neighbour distance (NND) distribution based on the 443 points in the point pattern. Use the distribution to determine a desired quantile, say \(0 \leq p \leq 1\) of the NND distribution. Denoting the \(p\) sample quantile of the NND distribution by \(Q(p)\), a \(p\)quantile NND based highrisk zone is then obtained as the union of putting a disc of radius \(x_q\) around each observed exploded bomb in \(Y\) – in mathematical terms:
\[
R_p = \left(\bigcup_{\bm{y} \in Y} B(\bm{y}, Q(p))\right) \bigcap W =
\left\{\bm{s} \in W : \min_{\bm{y}\in Y}  \bm{s} − \bm{y}  \leq Q_Y(p) \right\},
\]
where \(B(\bm{s}, r)\) denotes a disc of radius \(r\) around the point \(\bm{s}\) and \(\bm{s} – \bm{y}\) is the distance between the two points \(\bm{s}\) and \(\bm{y}\). The intersection with \(W\) is done in order to guarantee that the risk zone lies entirely within \(W\). As an example, we would determine the 99%quantile NND zone for craterA using spatstat functionality as follows:
(Qp < quantile(nndist(craterA), p = 0.99, type = 8)) ## 99% ## 142.1391 dmap < distmap(craterA) zone_dist < eval.im( dmap < Qp )The above can also be done directly using highriskzone::det_hrz function:
zone_dist < det_hrz(craterA, type="dist", criterion = "indirect", cutoff=0.99)Either way, the resulting riskzone is as follows:
summary(zone_dist) ## highrisk zone of type dist ## criterion: indirect ## cutoff: 0.99 ## ## threshold: 142.1391 ## area of the highrisk zone: 2574507Mahling, Höhle, and Küchenhoff (2013) show that risk zones constructed by the NND method work surprisingly well despite lacking a clear theoretical justification. One theoretical issue is, for example, that the NND distribution function determined by the above method is for the \((1q)\) thinned process \(Y\), even though the actual interest is in the process \(X=Y\cup Z\). Because of the thinning one would typically have that \(D_Y(r) \leq D_X(r)\) and thus \(Q_Y(p) > Q_X(p)\). Using \(Q_Y(p)\) to make statements about \(X\) (and thus \(Z\)) is therefore slightly wrong. However, this error cancels, because we then use the points in \(Y\) to add a buffer of radius \(Q_Y(p)\). Had we instead used the smaller, but true, \(Q_X(p)\) the risk zone would have gotten a too small, because \(X\) would also have contained more points to form discs around than \(Y\). The method thus implicitly takes \(q\) nonparametrically into account, because its NND is determined based on \(Y\) and subsequently discs of radius \(Q_Y(p)\) are formed around the points of \(Y\).
Technical details you might want to skip: The above feature is most easily illustrated if \(X\) is a homogeneous Poisson process with intensity \(\lambda_X\). In this case we have that the NND distribution function is (p.68, Illian et al. 2008)
\[
D_X(r) = 1 – \exp(\lambda_X \pi r^2), \quad r>0.
\]
Also note that \(D_Y(r) = 1 – \exp((1q)\lambda_X \pi r^2)\) and therefore \(D_Y(r) > D_X(r)\). Now solving for the \(p\)quantile of the NND in this homogeneous Poisson case means solving
\[
Q_Y(p) = \min_{r\geq 0} \{ D_Y(r) \geq p \} \Leftrightarrow Q_Y(p) =
\sqrt{ \frac{\log(1p)}{\lambda_Y \pi}}.
\]
From this it becomes clear than in the homogeneous Poisson case \(Q_ Y(p)\) is a factor \(\sqrt{1/(1q)}\) larger than \(Q_X(p)\), which is the actual target of interest.
Assessing the coverage of a riskzone
Two criterion appear immediate in order to assess the coverage of a riskzone \(R\):

The probability \(p_{\text{out}}\) that there will be at least one bomb outside the risk zone, i.e. \(P( N( Z \backslash R) > 0)\), where \(N(A)\) denotes the number of events in the set \(A \subseteq W\). Note: this probability is depending heavily on the amount of points in \(Z\), the more points there are, the higher is \(p_{\text{out}}\). However, it reflects the idea "one miss is all it takes to get in trouble".

The proportion of events in \(Z\) not located in \(R\), i.e. \(N( Z \backslash R) / N(Z)\), we shall denote this criterion by \(p_{\text{miss}}\). Note: This probability is taking possible different sizes of \(Z\) into account, but also takes a more relative approach towards how many bombs are not covered by the zone.
Under the assumption of independence between whether \(Z\)events are within or outside the riskzone one can convert back and forth between \(p_{\text{miss}}\) and \(p_{\text{out}}\) by
\[
p_{\text{out}} = P( N( Z \backslash R) > 0) = 1 P(N(Z
\backslash R) = 0) \approx 1 – (1p_{\text{miss}})^{N(Z)},
\]
where one in a simulation setup would know \(Z\) and thus also \(N(Z)\). Note that for a \(p\)quantile NND riskzone we expect \(1p_{\text{miss}}\) to be approximately equal to \(p\). We can investigate the behaviour of riskzones according to the two above criterion through the use of simulation. Either by simply \(q\)thinning of the existing point pattern \(Y\) and then use this thinned pattern to determine a riskzone, which is then evaluated. An alternative approach is to estimate the intensity surface from \(Y\), upscale it to get the intensity of \(X\), simulate \(X\) as an inhomogeneous Poisson point process with this intensity surface, thin this pattern to get a simulated instance of \(Y\), construct the riskzone based on this pattern and then evaluate the coverage of the zone (Mahling, Höhle, and Küchenhoff 2013). Note that this type of simulation is based on more assumptions compared to the nonparametric thinning simulation approach.
We generate 1,000 realizations of \(Y\) and \(Z\) through \(q\)thinning of the original craterA pattern while computing the coverage measures for the NND method as follows:
suppressPackageStartupMessages(library(doParallel)) registerDoParallel(cores=4) simulate_method < "thinning" #"intensity" # "cond_intensity" sim < foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist"), criterion=c("indirect"), cutoff=0.99, numit = 250, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 1 x 5 ## p_out p_miss `1p_miss` p_out_derived nZ ## ## 1 0.051 0.00118 0.999 0.0509 44.3The numbers state the average p_out and p_missin the 1000 simulations. Furthermore, nZ denotes the average number of events in \(Z\). We see that the NND method performs even a little better than intended, because \(1p_{\text{miss}}\) is even higher than the intended \(p\)=99%. The probability that the riskzone misses at least one bomb lies as low as 0.051. This is quite close to the above described approximate conversion from \(p_{\text{miss}}\) (0.051 vs. 0.051). Changing the simulation method for \(X\) to that of an inhomogeneous Poisson process with intensity \(1/(1q) \cdot \hat{\lambda}_Y(\bm{s})\) yields similar results:
## # A tibble: 1 x 5 ## p_out p_miss `1p_miss` p_out_derived nZ ## ## 1 0.47 0.0143 0.986 0.511 49.6We note that the probability of missing at least one bomb is much higher under this parametric simulation method. Only a small fraction of this is explained by \(Z\) now consisting of more points. A likely explanation is that the parametric model is only semiadequate to describe how the point patterns form. Therefore, the new \(X\) might have a somewhat different neighbourhood distribution than anticipated.
To compare more specifically with the intensity function based riskzone method of Mahling, Höhle, and Küchenhoff (2013) we use a specification, where the riskzone derived by the NND method or the intensity method have the same area (250 hectare).
sim_area < foreach(i=seq_len(getDoParWorkers()), .combine=rbind) %dopar% { tryCatch( eval_method(craterA, type=c("dist","intens"), criterion=rep("area",2), cutoff=rep(2500000,2), numit = 100, simulate=simulate_method, pbar=FALSE), error= function(e) return(NULL)) } ## # A tibble: 2 x 6 ## Type p_out p_miss `1p_miss` p_out_derived area_zone ## ## 1 dist 0.123 0.00278 0.997 0.117 2500009. ## 2 intens 0.55 0.0172 0.983 0.539 2499994.For the particular example we see an advantage of using the NND method, because both p_out p_miss are much lower for the intensity based method. Again, this might be due to the intensity method being based on assumptions, which for the particular example do not appear to be so adequate. Results in Mahling (2013) were, however, much better for this example (c.f. Tab 2), which could be an indication that there is a problem in the highriskzone package implementation of this method?
DiscussionBeing a statistician is fascinating, because the job is the entry ticket to so many diverse research fields. The proposed methods and evaluations helped the Oberfinanzdirektion obtain a quantitative framework to decide which methods to use in their routine riskassessment. Further details on the above application can be found in Mahling, Höhle, and Küchenhoff (2013) as well as in Monia’s Ph.D. dissertation (Mahling 2013). Note also that the techniques are not limited to UXB detection: Inferunknownpointsfromathinnedprocess problems occur both in 1D and 2D point processes in a range of other fields, e.g., underreporting of infectious disease locations or in the calculation of animal abundance in ecology.
As a personal anecdote: When finishing the work on Mahling, Höhle, and Küchenhoff (2013) I was in transition from university to working at a public health institute. The deal was to finish the UXB work partly in sparetime and partly in the new work time. To honour this I added my new work place as second affiliation before submitting, but as part of the institution’s internal clearing procedure of the publication, I was asked to remove this affiliation again by the higher management, because the work ‘had nothing to do with public health’. While its questionable whether exploding bombs really do not have a public health impact, a few months later, I ended up using very similar statistical techniques to model occurredbutnotyetreported cases during a critical infectious disease outbreak (Höhle and an der Heiden 2014).
Literature
Baddeley, A., E. Rubak, and R. Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press.
Höhle, M., and M. an der Heiden. 2014. “Bayesian Nowcasting During the STEC O104:H4 Outbreak in Germany, 2011.” Biometrics 70 (4): 993–1002. doi:10.1111/biom.12194.
Illian, J., A. Penttinen, H. Stoyan, and D. Stoyan. 2008. Stistical Analysis and Modelling of Spatial Point Patterns. Wiley.
Mahling, M. 2013. “Determining HighRisk Zones by Using Spatial Point Process Methodology.” PhD thesis, Department of Statistics, University of Munich. https://edoc.ub.unimuenchen.de/15886/1/Mahling_Monia.pdf.
Mahling, M., M. Höhle, and H. Küchenhoff. 2013. “Determining highrisk zones for unexploded World War II bombs by using point process methodology.” Journal of the Royal Statistical Society, Series C 62 (2): 181–99. doi:10.1111/j.14679876.2012.01055.x.
Seibold, H., M. Mahling, S. Linne, and F. Günther. 2017. Highriskzone: Determining and Evaluating HighRisk Zones. https://cran.rproject.org/web/packages/highriskzone/index.html.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
201805 Selective Raster Graphics
(This article was first published on R – Stat Tech, and kindly contributed to Rbloggers)
This report explores ways to render specific components of an R plot in raster format, when the overall format of the plot is vector. For example, we demonstrate ways to draw raster data symbols within a PDF scatter plot. A general solution is provided by the grid.rasterize function from the R package ‘rasterize’.
Paul Murrell
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Comparative Review of the RKWard GUI for R
(This article was first published on R – r4stats.com, and kindly contributed to Rbloggers)
IntroductionRKWard is a free and open source Graphical User Interface for the R software, one that supports beginners looking to pointandclick their way through analyses, as well as advanced programmers. You can think of it as a blend of the menus and dialog boxes that R Commander offers combined with the programming support that RStudio provides. RKWard is available on Windows, Mac, and Linux.
This review is one of a series which aims to help nonprogrammers choose the Graphical User Interface (GUI) that is best for them. However, I do include a cursory overview of how RKWard helps you work with code. In most sections, I’ll begin with a brief description of the topic’s functionality and how GUIs differ in implementing it. Then I’ll cover how RKWard does it.
Terminology
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface specifically using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So GUI users are people who prefer using a GUI to perform their analyses. They often don’t have the time required to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include pointandclick style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.
Installation
The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or BlueSky Statistics, install in a single step. Others install in multiple steps, such as R Commander and Deducer. Advanced computer users often don’t appreciate how lost beginners can become while attempting even a singlestep installation. I work at the University of Tennessee, and our HelpDesk is flooded with such calls at the beginning of each semester!
Installing RKWard on Windows is done in a single step since its installation file contains both R and RKWard. However, Mac and Linux users have a twostep process, installing R first, then download RKWard which links up to the most recent version of R that it finds. Regardless of their operating system, RKWard users never need to learn how to start R, then execute the install.packages function, and then load a library. Installers for all three operating systems are available here.
The RKWard installer obtains the appropriate version of R, simplifying the installation and ensuring complete compatibility. However, if you already had a copy of R installed, depending on its version, you could end up with a second copy.
RKWard minimizes the size of its download by waiting to install some R packages until you actually try to use them for the first time. Then it prompts you, offering default settings that will get the package you need.
On Windows, the installation file is 136 megabytes in size.
Plugins
When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling section of this series of articles. Regardless of what comes builtin, it’s good to know how active the development community is. They contribute “plugins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, BlueSky, Deducer) through moderate (jamovi) to very active (R Commander).
Currently all plugins are included with the initial installation. You can see them using the menu selection Settings> Configure Packages> Manage RKWard Plugins. There are only brief descriptions of what they do, but once installed, you can access the help files with a single click.
RKWard addon modules are part of standard R packages and are distributed on CRAN. Their package descriptions include a field labeled, “enhances: rkward”. You can sort packages by that field in RKWard’s package installation dialog where they are displayed with the RKWard icon.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
EARL London 2018 – Agenda announced
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
The EARL London 2018 agenda is now available. Explore the schedule of keynotes, presentations, and lightning talks that cover a huge range of topics, including the best uses of Shiny, R in healthcare, using R for good, and R in finance. The brilliant lineup of speakerswho represent a wide range of industriesis sure to provide inspiration for all R users of all levels.
We have surveyed the Mango team to find out what talks they are most looking forward to:
Lung cancer detection with deep learning in R – David Smith, Microsoft
David will be taking us through an endtoend example of building a deep learning model to predict lung cancer from image data. Anything that helps to improve healthcare is a fascinating subject.
Using network analysis of colleague relationships to find interesting new investment managers – Robin Penfold, Willis Towers Watson
Using the tidyverse, tidygraph and ggraph, Robin has established a way to save time and money by researching the backgrounds of all institutional investors in one go, as opposed to one at a time.
As right as rain: developing a sales forecasting model for Dyson using R – Dan Erben, Dyson
We love hearing about how organisations have adopted a datadriven approach using R. Dan’s talk will summarise how Dyson developed a statistical model for sales forecasting and the path they have taken to adopt it in the business.
Interpretable Machine Learning with LIME: now and tomorrow – Kasia Kulma, Aviva
Explaining models is a really valuable tool in communicating with the business. Kasia will be using breast cancer data as a specific case scenario.
Text mining in child social care – James Lawrence, The Behavioural Insight Team
James will be sharing how The Behavioural Insight Team have been able to use R’s text mining tools on unstructured data to improve child social services actions, significantly reducing the potential for harm without overburdening an already stretched social service provider.
Experience of using R in the productive environment of Boehringer Ingelheim – Matthias Trampisch, Boehringer Ingelheim
Boehringer Ingelheim have been making headway with R and Matthias will be sharing their journey.
Using R to Drive Revenue for your Online Business – Catherine Gamble, Marks and Spencer
Online business is fastpaced and competitive, Catherine shares how Marks and Spencer have been using R to gain the upperhand and a make difference to their bottomline.
Buy your tickets before 31 July to make the most of the early bird ticket prices – a full conference pass–which includes workshops, access to the conference presentations, and a ticket to the evening event at the Imperial War Museum–is only £800. Buy tickets now.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
sample_n_of(): a useful helper function
(This article was first published on Higher Order Functions, and kindly contributed to Rbloggers)
Here’s the problem: I have some data with nested time series. Lots of them. It’s
like there’s many, many little datasets inside my data. There are too many
groups to plot all of the time series at once, so I just want to preview a
handful of them.
For a working example, suppose we want to visualize the top 50 American female
baby names over time. I start by adding up the total number of births for each
name, finding the overall top 50 most populous names, and then keeping just the
time series from those top names.
Hmm, so what does this look like?
ggplot(top_names) + aes(x = year, y = n) + geom_line() + facet_wrap("name")Aaack, I can’t read anything! Can’t I just see a few of them?
This is a problem I face frequently, so frequently that I wrote a helper
function to handle this problem: sample_n_of(). This is not a very clever
name, but it works. Below I call the function from my personal R package
and plot just the data from four names.
In this post, I walk through how this function works. It’s not very
complicated: It relies on some light tidy evaluation plus one obscure dplyr
function.
As usual, let’s start by sketching out the function we want to write:
sample_n_of < function(data, size, ...) { # quote the dots dots < quos(...) # ...now make things happen... }where size are the number of groups to sample and ... are the columns names
that define the groups. We use quos(...) to capture and quote those column
names. (As I wrote before,
quotation is how we bottle up R code so we can deploy it for later.)
For interactive testing, suppose our dataset are the time series from the top 50
names and we want data from a sample of 5 names. In this case, the values for
the arguments would be:
A natural way to think about this problem is that we want to sample subgroups of
the dataframe. First, we create a grouped version of the dataframe using
group_by(). The function group_by() also takes a ... argument where the
dots are typically names of columns in the dataframe. We want to take the
names inside of our dots, unquote them and plug them in to where the ...
goes in group_by(). This is what the tidy evaluation world calls
splicing.
Think of splicing as doing this:
# Demo function that counts the number of arguments in the dots count_args < function(...) length(quos(...)) example_dots < quos(var1, var2, var2) # Splicing turns the first form into the second one count_args(!!! example_dots) #> [1] 3 count_args(var1, var2, var2) #> [1] 3So, we create a grouped dataframe by splicing our dots into the group_by()
function.
There is a helper function buried in dplyr called group_indices() which
returns the grouping index for each row in a grouped dataframe.
We can randomly sample five of the group indices and keep the rows for just
those groups.
Putting these steps together, we get:
sample_n_of < function(data, size, ...) { dots < quos(...) group_ids < data %>% group_by(!!! dots) %>% group_indices() sampled_groups < sample(unique(group_ids), size) data %>% filter(group_ids %in% sampled_groups) }We can test that the function works as we might expect. Sampling 10 names
returns the data for 10 names.
We can sample based on multiple columns too. Ten combinations of names and years
should return just ten rows.
There are a few tweaks we could make to this function. For example, in my
package’s version, I warn the user when the number of groups is too large.
My version also randomly samples n of the rows when there are no grouping
variables provided.
One open question is how to handle data that’s already grouped. The function we
wrote above fails.
Is this a problem?
Here I think failure is okay because what do we think should happen? It’s not
obvious. It should randomly choose 2 of the years for each name.
Should it be the same two years? Then this should be fine.
Or, should those two years be randomly selected for each name? Then, we should
let do() handle that. do() takes some code that returns a dataframe, applies
it to each group, and returns the combined result.
I think raising an error and forcing the user to clarify their code is a better
than choosing one of these options and not doing what the user expects.
To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Simple Spatial Modelling – Part 3: Exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
So far, we have learned how to count spatial variability in our model. Please look at these two previous exercises here and here if you haven’t tried it yet. However, it only represents 1Dimension model. On this exercise, we will try to expand our spatial consideration into 2Dimension model.
Have a look at this plan view below to get an illustration of how the 2dimensions model will work.
The water levels are store in a 2D array. they are numbered as follows:
The water flows are store in 2 different 2D arrays.
1. qv : defines water flows between buckets down the screen (in plan view)
2. qh : defines water flows between buckets across the screen.
Let’s get into the modelling by cracking the exercises below. Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Set all the required settings for the model:
a. set number of timesteps. Here we use 1000 (it shows how many timesteps in which we are going to run the model, you can change it as you want)
b. set the total number of cell, here we have 25 x 25 water tanks
c. set timestep in seconds; here is 1
d. set time at the start of simulations
e. set k between each water tank. Here we set a uniform value of k; 0.01
Exercise 2
create matrix H for initial water level in the water tank
Exercise 3
Set boundary conditions for the model; here we have water flow (qh) into the water tanks from three sides (top, left and right) and water flowing out on the bottom (qv;see the plan view). Water flow to the right and to the bottoms are considered positive. Don’t forget to declare the matrix for qh and qv.
Exercise 4
Create an output models for every 100 timesteps
Exercise 5
Run the model by creating loop for qh, qv, water storage update, and models output (Remember the threshold loop on latest previous exercise?)
Exercise 6
Plot model output using contour plot
 Simple Spatial Modeling – Part 1: Exercises
 Descriptive AnalyticsPart 6: Interactive dashboard ( 2/2)
 Practice Your ggplot Skills: Exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A little function to help generate ICCs in simple clustered data
(This article was first published on ouR data generation, and kindly contributed to Rbloggers)
In health services research, experiments are often conducted at the provider or site level rather than the patient level. However, we might still be interested in the outcome at the patient level. For example, we could be interested in understanding the effect of a training program for physicians on their patients. It would be very difficult to randomize patients to be exposed or not to the training if a group of patients all see the same doctor. So the experiment is set up so that only some doctors get the training and others serve as the control; we still compare the outcome at the patient level.
Typically, when conducting an experiment we assume that individual outcomes are not related to each other (other than the common effect of the exposure). With sitelevel randomization, we can’t make that assumption – groups of patients are all being treated by the same doctor. In general, even before the intervention, there might be variation across physicians. At the same time, patients within a practice will vary. So, we have two sources of variation: between practice and within practice variation that explain overall variation.
I touched on this when I discussed issues related to Gamma distributed clustered data. A key concept is the intraclass coefficient or ICC, which is a measure of how between variation relates to overall variation. The ICC ranges from 0 (where there is no between variation – all site averages are the same) to 1 (where there is no variation within a site – all patients within the site have the same outcomes). Take a look at the earlier post for a bit more detail.
My goal here is to highlight a little function recently added to simstudy (v0.1.9, now available on CRAN). In the course of exploring study designs for cluster randomized trials, it is often useful to understand what happens (to sample size requirements, for example) when the ICC changes. When generating the data, it is difficult to control the ICC directly – we do this by controlling the variation. With normally distributed data, the ICC is an obvious function of the variances used to generate the data, so the connection is pretty clear. But, when the outcomes have binary, Poisson, or Gamma distributions (or anything else really), the connection between variation and the ICC is not always so obvious. Figuring out how to specify the data to generate a particular ICC might require quite a bit of trial and error.
The new function, iccRE (short for ICC random effects), allows users to specify target ICCs for a desired distribution (along with relevant parameters). The function returns the corresponding random effect variances that would be specified at the cluster level to generate the desired ICC(s).
Here’s an example for three possible ICCs in the context of the normal distribution:
library(simstudy) targetICC < c(0.05, 0.075, 0.10) setVars < iccRE(ICC = targetICC, dist = "normal", varWithin = 4) round(setVars, 4) ## [1] 0.2105 0.3243 0.4444In the case when the target ICC is 0.075:
\[ ICC = \frac{\sigma_b^2}{\sigma_b ^2 + \sigma_w ^2} = \frac{0.324}{0.324 + 4} \approx 0.075\]
Simulating from the normal distributionIf we specify the variance for the sitelevel random effect to be 0.2105 in conjunction with the individuallevel (within) variance of 4, the observed ICC from the simulated data will be approximately 0.05:
set.seed(73632) # specify between site variation d < defData(varname = "a", formula = 0, variance = 0.2105, id = "grp") d < defData(d, varname = "size", formula = 1000, dist = "nonrandom") a < defDataAdd(varname = "y1", formula = "30 + a", variance = 4, dist = "normal") dT < genData(10000, d) # add patient level data dCn05 < genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCn05 < addColumns(a, dCn05) dCn05 ## grp a size id y1 ## 1: 1 0.3255465 1000 1 32.08492 ## 2: 1 0.3255465 1000 2 27.21180 ## 3: 1 0.3255465 1000 3 28.37411 ## 4: 1 0.3255465 1000 4 27.70485 ## 5: 1 0.3255465 1000 5 32.11814 ##  ## 9999996: 10000 0.3191311 1000 9999996 30.15837 ## 9999997: 10000 0.3191311 1000 9999997 32.66302 ## 9999998: 10000 0.3191311 1000 9999998 28.34583 ## 9999999: 10000 0.3191311 1000 9999999 28.56443 ## 10000000: 10000 0.3191311 1000 10000000 30.06957The between variance can be roughly estimated as the variance of the group means, and the within variance can be estimated as the average of the variances calculated for each group (this works well here, because we have so many clusters and patients per cluster):
between < dCn05[, mean(y1), keyby = grp][, var(V1)] within < dCn05[, var(y1), keyby = grp][, mean(V1)] total < dCn05[, var(y1)] round(c(between, within, total), 3) ## [1] 0.212 3.996 4.203The ICC is the ratio of the between variance to the total, which is also the sum of the two component variances:
round(between/(total), 3) ## [1] 0.05 round(between/(between + within), 3) ## [1] 0.05Setting the sitelevel variance at 0.4444 gives us the ICC of 0.10:
d < defData(varname = "a", formula = 0, variance = 0.4444, id = "grp") d < defData(d, varname = "size", formula = 1000, dist = "nonrandom") a < defDataAdd(varname = "y1", formula = "30 + a", variance = 4, dist = "normal") dT < genData(10000, d) dCn10 < genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCn10 < addColumns(a, dCn10) between < dCn10[, mean(y1), keyby = grp][, var(V1)] within < dCn10[, var(y1), keyby = grp][, mean(V1)] round(between / (between + within), 3) ## [1] 0.102 Other distributionsThe ICC is a bit more difficult to interpret using other distributions where the variance is a function of the mean, such as with the binomial, Poisson, or Gamma distributions. However, we can still use the notion of between and within, but it may need to be transformed to another scale.
In the case of binary outcomes, we have to imagine an underlying or latent continuous process that takes place on the logistic scale. (I talked a bit about this here.)
### binary (setVar < iccRE(ICC = 0.05, dist = "binary")) ## [1] 0.173151 d < defData(varname = "a", formula = 0, variance = 0.1732, id = "grp") d < defData(d, varname = "size", formula = 1000, dist = "nonrandom") a < defDataAdd(varname = "y1", formula = "1 + a", dist = "binary", link = "logit") dT < genData(10000, d) dCb05 < genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCb05 < addColumns(a, dCb05) dCb05 ## grp a size id y1 ## 1: 1 0.20740274 1000 1 0 ## 2: 1 0.20740274 1000 2 0 ## 3: 1 0.20740274 1000 3 0 ## 4: 1 0.20740274 1000 4 1 ## 5: 1 0.20740274 1000 5 0 ##  ## 9999996: 10000 0.05448775 1000 9999996 0 ## 9999997: 10000 0.05448775 1000 9999997 1 ## 9999998: 10000 0.05448775 1000 9999998 0 ## 9999999: 10000 0.05448775 1000 9999999 0 ## 10000000: 10000 0.05448775 1000 10000000 0The ICC for the binary distribution is on the logistic scale, and the within variance is constant. The between variance is estimated on the logodds scale:
within < (pi ^ 2) / 3 means < dCb05[,mean(y1), keyby = grp] between < means[, log(V1/(1V1)), keyby = grp][abs(V1) != Inf, var(V1)] round(between / (between + within), 3) ## [1] 0.051The ICC for the Poisson distribution is interpreted on the scale of the count measurements, even though the random effect variance is on the log scale. If you want to see the details behind the random effect variance derivation, see this paper by Austin et al., which was based on original work by Stryhn et al. that can be found here.
(setVar < iccRE(ICC = 0.05, dist = "poisson", lambda = 30)) ## [1] 0.0017513 d < defData(varname = "a", formula = 0, variance = 0.0018, id = "grp") d < defData(d, varname = "size", formula = 1000, dist = "nonrandom") a < defDataAdd(varname = "y1", formula = "log(30) + a", dist = "poisson", link = "log") dT < genData(10000, d) dCp05 < genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCp05 < addColumns(a, dCp05) dCp05 ## grp a size id y1 ## 1: 1 0.035654485 1000 1 26 ## 2: 1 0.035654485 1000 2 36 ## 3: 1 0.035654485 1000 3 31 ## 4: 1 0.035654485 1000 4 34 ## 5: 1 0.035654485 1000 5 21 ##  ## 9999996: 10000 0.002725561 1000 9999996 26 ## 9999997: 10000 0.002725561 1000 9999997 25 ## 9999998: 10000 0.002725561 1000 9999998 27 ## 9999999: 10000 0.002725561 1000 9999999 28 ## 10000000: 10000 0.002725561 1000 10000000 37The variance components and ICC for the Poisson can be estimated using the same approach as the normal distribution:
between < dCp05[, mean(y1), keyby = grp][, var(V1)] within < dCp05[, var(y1), keyby = grp][, mean(V1)] round(between / (between + within), 3) ## [1] 0.051Finally, here are the results for the Gamma distribution, which I talked about in great length in an earlier post:
(setVar < iccRE(ICC = 0.05, dist = "gamma", disp = 0.25 )) ## [1] 0.01493805 d < defData(varname = "a", formula = 0, variance = 0.0149, id = "grp") d < defData(d, varname = "size", formula = 1000, dist = "nonrandom") a < defDataAdd(varname = "y1", formula = "log(30) + a", variance = 0.25, dist = "gamma", link = "log") dT < genData(10000, d) dCg05 < genCluster(dtClust = dT, cLevelVar = "grp", numIndsVar = "size", level1ID = "id") dCg05 < addColumns(a, dCg05) dCg05 ## grp a size id y1 ## 1: 1 0.09466305 1000 1 14.31268 ## 2: 1 0.09466305 1000 2 39.08884 ## 3: 1 0.09466305 1000 3 28.08050 ## 4: 1 0.09466305 1000 4 53.27853 ## 5: 1 0.09466305 1000 5 37.93855 ##  ## 9999996: 10000 0.25566417 1000 9999996 14.16145 ## 9999997: 10000 0.25566417 1000 9999997 42.54838 ## 9999998: 10000 0.25566417 1000 9999998 76.33642 ## 9999999: 10000 0.25566417 1000 9999999 34.16727 ## 10000000: 10000 0.25566417 1000 10000000 21.06282The ICC for the Gamma distribution is on the log scale:
between < dCg05[, mean(log(y1)), keyby = grp][, var(V1)] within < dCg05[, var(log(y1)), keyby = grp][, mean(V1)] round(between / (between + within), 3) ## [1] 0.05It is possible to think about the ICC in the context of covariates, but interpretation is less straightforward. The ICC itself will likely vary across different levels of the covariates. For this reason, I like to think of the ICC in the marginal context.
I leave you with some visuals of clustered binary data with ICC’s ranging from 0 to 0.075, both on the logodds and probability scales:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Create your Machine Learning library from scratch with R ! (3/5) – KNN
(This article was first published on Enhance Data Science, and kindly contributed to Rbloggers)
This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement K nearest neighbors (KNN) using only the linear algebra available in R. Previously, we managed to implement PCA and next time we will deal with SVM and decision trees.
The Knearest neighbors (KNN) is a simple yet efficient classification and regression algorithm. KNN assumes that an observation will be similar to its K closest neighbors. For instance, if most of the neighbors of a given point belongs to a given class, it seems reasonable to assume that the point will belong to the same given class.
The mathematics of KNNNow, let’s quickly derive the mathematics used for KNN regression (they are similar for classification).
Let be the observations of our training dataset. The points are in . We denote the variable we seek to estimate. We know its value for the train dataset.
Let be a new point in . We do not know and will estimate it using our train dataset.
Let be a positive and nonzero integer (the number of neighbors used for estimation). We want to select the points from the dataset which are the closest to . To do so, we compute the euclidean distance . From all the distance, we can compute , the smallest radius of the circle centered on which includes exactly points from the training sample.
An estimation of is now easy to construct. This is the mean of the of the closest points to :
KNN regression in R
First, we build a “my_knn_regressor” object which stores all the training points, the value of the target variable and the number of neighbors to use.
my_knn_regressor < function(x,y,k=5) { if (!is.matrix(x)) { x < as.matrix(x) } if (!is.matrix(y)) { y < as.matrix(y) } my_knn < list() my_knn[['points']] < x my_knn[['value']] < y my_knn[['k']] < k attr(my_knn, "class") < "my_knn_regressor" return(my_knn) }The tricky part of KNN is to compute efficiently the distance. We will use the function we created in our previous post on vectorization. The function and mathematical derivations are specified in this post.
gramMatrix < function(X,Y) { tcrossprod(X, Y) } compute_pairwise_distance=function(X,Y) { xn < rowSums(X ** 2) yn < rowSums(Y ** 2) outer(xn, yn,'+')  2 * tcrossprod(X, Y) }Now we can build our predictor:
predict.my_knn_regressor < function(my_knn,x) { if (!is.matrix(x)) { x=as.matrix(x) } ##Compute pairwise distance dist_pair < compute_pairwise_distance(x,my_knn[['points']]) ##as.matrix(apply(dist_pair,2,order) <= my_knn[['k']]) orders the points by distance and select the kclosest points ##The M[i,j]=1 if x_j is on the k closest point to x_i t(as.matrix(apply(dist_pair,2,order) <= my_knn[['k']])) %*% my_knn[['value']] / my_knn[['k']] }The last line may seem complicated:
 apply(dist_pair,2,order) orders the points by distance
 apply(dist_pair,2,order)<=my_knn[['k']] selects the kclosest points to each point in our new dataset
 M=t(as.matrix(apply(dist_pair,2,order) <= my_knn[['k']])) cast the matrix into a one hot matrix. if is one of the k closest points to . is zero otherwise.
 M %*% my_knn[['value']] / my_knn sums the value of the k closest points and normalises it by k
The previous code can be reused as it is for binary classification. Your outcome should be encoded as a onehot variable. If the estimated output is greater (resp. less) than 0.5, you can assume that your point belongs to the class encoded as one (resp. zero). We will use the classical Iris dataset and classify the setosa versus the virginica specy.
iris_class < iris[iris[["Species"]]!="versicolor",] iris_class[["Species"]] < as.numeric(iris_class[["Species"]]!="setosa") knn_class < my_knn_regressor(iris_class[,1:2],iris_class[,5]) predict(knn_class,iris_class[,1:2])Since, we only used 2 variables, we can easily plot the decision boundaries on a 2D plot.
#Build grid x_coord < seq(min(iris_class[,1])  0.2,max(iris_class[,1]) + 0.2,length.out = 200) y_coord < seq(min(iris_class[,2]) 0.2,max(iris_class[,2]) + 0.2 , length.out = 200) coord < expand.grid(x=x_coord, y=y_coord) #predict probabilities coord[['prob']] < predict(knn_class,coord[,1:2]) library(ggplot2) ggplot() + ##Ad tiles according to probabilities geom_tile(data=coord,mapping=aes(x, y, fill=prob)) + scale_fill_gradient(low = "lightblue", high = "red") + ##add points geom_point(data=iris_class,mapping=aes(Sepal.Length,Sepal.Width, shape=Species),size=3 ) + #add the labels to the plots xlab('Sepal length') + ylab('Sepal width') + ggtitle('Decision boundaries of KNN')+ #remove grey border from the tile scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0))And this gives us this cool plot:
Possible extensionsOur current KNN is basic, but you can improve and test it in several ways:
 What is the influence of the number of neighbors ? (You should see some overfitting/underfitting)
 Can you implement other metrics than distance ? Can you create kernel KNNs ?
 Instead of doing estimations using only the mean, could you use a more complex mapping ?
Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.
The post Create your Machine Learning library from scratch with R ! (3/5) – KNN appeared first on Enhance Data Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introducing Datazar Paper
(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to Rbloggers)
We’re finally here. Datazar Paper is the most exciting tool we’ve created to date. Up until now, I’d like to believe we’ve only made par with the status quo in terms of the tools we have developed for our beloved community of industry researchers, hackers and academics.
We have a vision at Datazar where researchers would eventually be able to create and consume research in one, continious cycle. We introduced tools like Replication (ability to replicate a research project simply by clicking a button), Discussions (real time chat where you can exchange code snippets etc…) and Metrics (research tracking and feedback on views, usage etc…). These tools help users create research with their colleagues without juggling different tools.
Datazar Paper is the tools that ties all of these together. A common workflow on Datazar is (1) uploading data, (2) creating analysis based on the data using for example an R notebook and(3) exporting the results/visualizations for reporting or for a research paper. That last step was always awkward to deal with because there are so many ways to go about it. At some point we create a versatile, JavaScript notebook to try and tie all of it together to make reporting easier. Before that, we created a new language to simplify creating a report or paper while maintaining interactivity. Reports are papers are everywhere, the problem is they are not interactive and/or reproducible. This article* on the deatch of traditional papers explains it really well.
In short, we believe the traditional, static paper is not good enough anymore in today’s world to convey today’s complex ideas.
Datazar Paper EditorMedium, yes this Medium was a giant leap in terms of how easy it made creating articles. Datazar Paper was heavily influced by Medium; specifically the editor. We took the same approach and created an editor that allows you to create an article/paper/report like a lego house. No code involved whatsoever. This last bit is extremely important because the people in an organization who usually do the reporting are the people with least exeprience in coding. And to be frank, you shouldn’t really need code to create a simple paper or report.
We recognize that the overall transition to an interactive and easily reproducible writing process will be long so I’d like to take this opportunity to say that we still support LaTeX and MarkDown and will continue to do so. In case you didn’t know, Datazar gives you the ability to create LaTeX and MarkDown documents in your browser. In fact, LaTeX documents are the most popular files created on Datazar.
With Datazar Paper, we got one step closer in creating an ecosystem where research is created, shared and preserved automatically. Now you can store your data, analyze it using R & Python notebooks and publish eveything from code to visualizations using Datazar Paper. All without switching between a dozen programs and without squeezig your CPU for computational power.
Link to Launch: https://www.producthunt.com/posts/datazarpaper
Some papers I picked out:
https://www.datazar.com/focus/f03b8705bc0ba454cafa03a7729a6c96f
https://www.datazar.com/focus/f5abc508d709140cda14bc6f2da005c14
https://www.datazar.com/focus/f649aea35953d4eb28ad4037d6c343225
*https://www.theatlantic.com/science/archive/2018/04/thescientificpaperisobsolete/556676/
Introducing Datazar Paper was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
taxize: seven years of taxonomy in R
(This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers)
taxize was seven years old this last Saturday!
What is taxize?taxize is designed around making working with taxonomic names easier – abstracting away the details of what each of 20 or so taxonomic data sources require for a given use case.
A samping of use cases covered in taxize (all of these across many different data sources):
 Taxonomic identifier from a taxonomic name and vice versa
 Taxonomic name from a vernacular (common) name and vice versa
 Taxonomic hierarchy/classification from identifier or name
 Taxonomic children of an identifier or name
 All taxa downstream to a certain rank from identifier or name
 Taxonomic name synonyms from identifier or name
 Lowest common taxon and rank for an identifier or name
 Resolve taxonomic names, i.e., fix spelling errors
taxize was one of our first packages. Our first commit was on 20110519, uneventfully adding an empty README:
We’ve come a long way since May 2011. We’ve added a lot of new functionality and many new contributors.
Commit historyGet git commits for taxize using a few tidyverse packages as well as git2r, our R package for working with git repositories:
library(git2r) library(ggplot2) library(dplyr) repo < git2r::repository("~/github/ropensci/taxize") res < commits(repo)A graph of commit history
dates < vapply(res, function(z) { as.character(as.POSIXct(z@author@when@time, origin = "19700101")) }, character(1)) df < tbl_df(data.frame(date = dates, stringsAsFactors = FALSE)) %>% group_by(date) %>% summarise(count = n()) %>% mutate(cumsum = cumsum(count)) %>% ungroup() ggplot(df, aes(x = as.Date(date), y = cumsum)) + geom_line(size = 2) + theme_grey(base_size = 16) + scale_x_date(labels = scales::date_format("%Y/%m")) + labs(x = 'May 2011 to May 2018', y = 'Cumulative Git Commits') ContributorsA graph of new contributors through time
date_name < lapply(res, function(z) { data_frame( date = as.character(as.POSIXct(z@author@when@time, origin = "19700101")), name = z@author@name ) }) date_name < bind_rows(date_name) firstdates < date_name %>% group_by(name) %>% arrange(date) %>% filter(rank(date, ties.method = "first") == 1) %>% ungroup() %>% mutate(count = 1) %>% arrange(date) %>% mutate(cumsum = cumsum(count)) ## plot ggplot(firstdates, aes(as.Date(date), cumsum)) + geom_line(size = 2) + theme_grey(base_size = 18) + scale_x_date(labels = scales::date_format("%Y/%m")) + labs(x = 'May 2011 to May 2018', y = 'Cumulative New Contributors')taxize contributors, including those that have opened issues (click to go to their GitHub profile):
afkoeppel – ahhurlbert – albnd – Alectoria – andzandz11 – antagomir – arendsee – ashenkin – ashiklom – bomeara – bw4sz – cboettig – cdeterman – ChrKoenig – chuckrp – clarson2191 – claudenozeres – cmzambranat – daattali – DanielGMead – davharris – davidvilanova – diogoprov – dlebauer – dlenz1 – dschlaep – EDiLD – emhart – fdschneider – fgabriel1891 – fmichonneau – gedankenstuecke – gimoya – GISKid – gitog – glaroc – gustavobio – ibartomeus – jangorecki – jarioksa – jebyrnes – johnbaums – jonmcalder – JoStaerk – jsgosnell – kamapu – karthik – KevCaz – kgturner – kmeverson – Koalha – ljvillanueva – Markus2015 – mcsiple – MikkoVihtakari – millerjef – miriamgrace – mpnelsen – MUSEZOOLVERT – natedolson – nmatzke – npch – paternogbc – philippi – pmarchand1 – pssguy – RodgerG – rossmounce – sariya – scelmendorf – sckott – SimonGoring – snsheth – snubian – Squiercg – tdjames1 – tmkurobe – tpaulson1 – tpoisot – vijaybarve – wcornwell – willpearse – wpetry – zacharyfoster
taxize usageEduard Szöcs and I wrote a paper describing taxize back in 2013, published in F1000Research.
Scott Chamberlain and Eduard Szöcs (2013). taxize – taxonomic search and retrieval in R. F1000Research 2:191. https://doi.org/10.12688/f1000research.2191.v1
The paper has probably made taxize users more likely to cite the package, though we have no direct proof of that.
The paper above and/or the package have been cited 69 times over the past 7 years.
The way taxize is used in research is often in “cleaning” taxonomic names in one way or another. In addition, many users use taxize to get taxonomic names for certain groups of interest.
One example comes from the paper
Weber, M. G., Porturas, L. D., & Taylor, S. A. (2016). Foliar nectar enhances plant–mite mutualisms: the effect of leaf sugar on the control of powdery mildew by domatiainhabiting mites. Annals of Botany, 118(3), 459–466. doi:10.1093/aob/mcw118
Features coming down the road Integration with taxa package (maintained by one of our rOpenSci fellows Zachary Foster) in all get_*() functions. This will make the outputs of all get_*() more consistent and easier to integrate into your downstream workflows.
 not taxize per se, but taxizedb is starting to really take shape due to help from Zebulun Arendsee. taxizedb will make taxonomic name work much faster for large datasets. It’s worth checking out.
A hug thanks goes to all taxize users and contributors. It’s awesome to see how useful taxize has been through the years, and we look forward to making it even better moving forward.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Workshop in Cape Town: Web Scraping with R
(This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to Rbloggers)
Join Andrew Collier and Hanjo Odendaal for a workshop on using R for Web Scraping.
Who should attend?This workshop is aimed at beginner and intermediate R users who want to learn more about using R for data acquisition and management, with a specific focus on web scraping.
What will you learn?You will learn:
 data manipulation with dplyr, tidyr and purrr;
 tools for accessing the DOM;
 scraping static sites with rvest;
 scraping dynamic sites with RSelenium; and
 setting up an automated scraper in the cloud.
See programme below for further details.
Where – Rise, Floor 5, Woodstock Exchange, 66 Albert Road, Woodstock, Cape Town When – 1415 June 2018 Who – Andrew CollierHanjo Odendaal
There are just 20 seats available. A 10% discount is available for groups of 4 or more people from a single organisation attending both days.
Email training@exegetic.biz if you have any questions about the workshop.
Programme Day 1 Motivating Example
 R and the tidyverse
 Vectors, Lists and Data Frames
 Loading data from a file
 Manipulating Data Frames with dplyr
 Pivoting with tidyr
 Functional programming with purrr
 Introduction to scraping
 Ethics
 DOM
 Developer Tools
 CSS and XPath
 robots.txt and site map
 Scraping a static site with rvest
 What happens under the hood
 What the hell is curl?
 Assisted Assignment: Movie information from IMDB
 Case Study: Investigating drug tests using rvest
 Interacting with APIs
 Using XHR to find an API
 Building wrappers around APIs
 Scraping a dynamic site with RSelenium
 Why RSelenium is needed
 Navigation around webpages
 Combining RSelenium with rvest
 Useful JavaScript tools
 Case Study
 Deploying a Scraper in the Cloud
 Launching and connecting to an EC2 instance
 Headless browsers
 Automation with cron
To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Why you should regret not going to eRum 2018?
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
I spent 3 amazing days at eRum conference in Budapest. The conference was a blast and organizers (BIG thanks to them again) did wonderful job compiling such a highlevel event.
My favourite talks from the conferenceBetter than Deep Learning – Gradient Boosting Machines (GBM) in R – Szilard Pafka
Szilard made a thorough overview of currently available ML algorithms. Showed that ML algorithm works better on tabular data than Deep Learning. Gave his advice on which packages to choose depending on your goals like maximizing speed on GPU/CPU or going to production.
My take away from his talk: you should choose algorithm based on the problem you have and take into account outside constraints like interpretability. Choosing model is one thing, but a lot of prediction improvement can come from feature engineering, so domain knowledge and problem understanding matters a lot.
Thanks to Erin LeDell’s talk we know that majority of ML tasks can be automated thanks to their awesome autoML framework.
Harness the R condition system – Lionel Henry
Lionel talked about improving errors in R. Currently R offers errors handling solely through tryCatch function. From the presentation we learn that errors are regular objects. This makes it possible for a user to provide a custom classes and error metadata, which makes it much easier to implement handling and reporting. Some of the ideas he shared will be available through the new release of lang package.
Show my your model 2.0! – Przemysław Biecek
Przemek together with Mateusz gave both a workshop and a talk about the Dalex package which is an impressive toolkit for understanding machine learning model. Dalex is being developed by talented group of Przemek’s students in Poland. Thanks to Dalex you can do single variable explanations for more than one model at the same time. It’s also easier to understand how the single variable is influencing the prediction.
You may wonder why should you use Dalex if you are already familiar with Lime and the answer is: Dalex offers many methods for variable inspection (lime has one) and comparison of many methods using selected method.
The cherry on the cake was RLadies Budapest event where we could here 6 amazing presentations. Some of the RLadies were giving the second talk during those 3 days. One of them was Omayma Said talking about her Shiny app “Stringr Explorer: Tweet Driven Development for a Shiny App!”. It’s a really cool app that helps you navigate the stringr package, plus Omayma story how it was created was entertaining and admirable.
Other conference perksIt’s always pleasure to meet in person people from R community and fellow RLadies.
Great things about events like this is that there is always something extra you learn: recipes package is a neat way to do data preprocessing for you model, thanks to Barbara now I know about 2 useful parameters ignoreInit and once in observEevent function and Tobias explained when would I want to choose R vs. Python when doing Deep Learning – If you just need Keras go for R, everything you can do in Python is available in R!
Making Shiny shine brighter!Finally I had a pleasure to give an invited talk about new Shiny packages: “Taking inspirations from proven frontend frameworks to add to Shiny with 4 6 new packages”. You can access the slides here and watch the video or YouTube. It was really valuable and motivating to get feedback on our open source packages. I’m proud that I’m part of such a great team!
If you like the idea of shiny.users and shiny.admin and would like to know when the packages are released, you can visit packages landing page or keep following our blog.
What next?I hope the idea of eRum will continue and others would pick it up, so we can all meet in 2 years time!
I’m already jealous of all the lucky people going to useR this year. I sadly won’t be there, but Marek will and let me reveal a little secret: he will have shiny.semantic and semantic.dashboard cheat sheets and stickers to give away!
Read the original post at
Appsilon Data Science Blog.
 Follow us on Twitter!
 Check us out on Facebook page!
 Follow us on LinkedIn!
 Star our GitHub packages, including shiny.semantic, shiny.router, shiny.collections, shiny.i18n and semantic.dashboard!
 Sign up for our newsletter and get free ebook!
To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Finalfit, knitr and R Markdown for quick results
(This article was first published on R – DataSurg, and kindly contributed to Rbloggers)
Thank you for the many requests to provide some extra info on how best to get finalfit results out of RStudio, and particularly into Microsoft Word.
Here is how.
Make sure you are on the most uptodate version of finalfit.
devtools::install_github("ewenharrison/finalfit")What follows is for demonstration purposes and is not meant to illustrate model building.
Does a tumour characteristic (differentiation) predict 5year survival?
Demographics tableFirst explore variable of interest (exposure) by making it the dependent.
library(finalfit) library(dplyr) dependent = "differ.factor" # Specify explanatory variables of interest explanatory = c("age", "sex.factor", "extent.factor", "obstruct.factor", "nodes")Note this useful alternative way of specifying explanatory variable lists:
colon_s %>% select(age, sex.factor, extent.factor, obstruct.factor, nodes) %>% names() > explanatoryLook at associations between our exposure and other explanatory variables. Include missing data.
colon_s %>% summary_factorlist(dependent, explanatory, p=TRUE, na_include=TRUE) label levels Well Moderate Poor p Age (years) Mean (SD) 60.2 (12.8) 59.9 (11.7) 59 (12.8) 0.788 Sex Female 51 (11.6) 314 (71.7) 73 (16.7) 0.400 Male 42 (9.0) 349 (74.6) 77 (16.5) Extent of spread Submucosa 5 (25.0) 12 (60.0) 3 (15.0) 0.081 Muscle 12 (11.8) 78 (76.5) 12 (11.8) Serosa 76 (10.2) 542 (72.8) 127 (17.0) Adjacent structures 0 (0.0) 31 (79.5) 8 (20.5) Obstruction No 69 (9.7) 531 (74.4) 114 (16.0) 0.110 Yes 19 (11.0) 122 (70.9) 31 (18.0) Missing 5 (25.0) 10 (50.0) 5 (25.0) nodes Mean (SD) 2.7 (2.2) 3.6 (3.4) 4.7 (4.4) <0.001 Warning messages: 1: In chisq.test(tab, correct = FALSE) : Chisquared approximation may be incorrect 2: In chisq.test(tab, correct = FALSE) : Chisquared approximation may be incorrectNote missing data in obstruct.factor. We will drop this variable for now (again, this is for demonstration only). Also that nodes has not been labelled.
There are small numbers in some variables generating chisq.test warnings (predicted less than 5 in any cell). Generate final table.
Now examine explanatory variables against outcome. Check plot runs ok.
explanatory = c("age", "sex.factor", "extent.factor", "nodes", "differ.factor") dependent = "mort_5yr" colon_s %>% finalfit(dependent, explanatory, dependent_label_prefix = "") > table2 Mortality 5 year Alive Died OR (univariable) OR (multivariable) Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.991.01, p=0.986) 1.01 (1.001.02, p=0.195) Sex Female 243 (47.6) 194 (48.0)   Male 268 (52.4) 210 (52.0) 0.98 (0.761.27, p=0.889) 0.98 (0.741.30, p=0.885) Extent of spread Submucosa 16 (3.1) 4 (1.0)   Muscle 78 (15.3) 25 (6.2) 1.28 (0.424.79, p=0.681) 1.28 (0.375.92, p=0.722) Serosa 401 (78.5) 349 (86.4) 3.48 (1.2612.24, p=0.027) 3.13 (1.0113.76, p=0.076) Adjacent structures 16 (3.1) 26 (6.4) 6.50 (1.9825.93, p=0.004) 6.04 (1.5830.41, p=0.015) Lymph nodes involved Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.181.30, p<0.001) 1.23 (1.171.30, p<0.001) Differentiation Well 52 (10.5) 40 (10.1)   Moderate 382 (76.9) 269 (68.1) 0.92 (0.591.43, p=0.694) 0.70 (0.441.12, p=0.132) Poor 63 (12.7) 86 (21.8) 1.77 (1.053.01, p=0.032) 1.08 (0.611.90, p=0.796) Odds ratio plot colon_s %>% or_plot(dependent, explanatory, breaks = c(0.5, 1, 5, 10, 20, 30)) To MS Word via knitr/R MarkdownImportant. In most R Markdown setups, environment objects require to be saved and loaded to R Markdown document.
# Save objects for knitr/markdown save(table1, table2, dependent, explanatory, file = "out.rda")We use RStudio Server Pro setup on Ubuntu. But these instructions should work fine for most/all RStudio/Markdown default setups.
In RStudio, select File > New File > R Markdown.
A useful template file is produced by default. Try hitting knit to Word on the knitr button at the top of the .Rmd script window.
Now paste this into the file:
 title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "22/5/2018" output: word_document: default  ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) load("out.rda") ``` ## Table 1  Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Table 2  Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Figure 1  Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE} colon_s %>% or_plot(dependent, explanatory) ``` Create Word template fileNow, edit the Word template. Click on a table. The style should be compact. Right click > Modify... > font size = 9. Alter heading and text styles in the same way as desired. Save this as template.docx. Upload to your project folder. Add this reference to the .Rmd YAML heading, as below. Make sure you get the space correct.
The plot also doesn’t look quite right and it prints with warning messages. Experiment with fig.width to get it looking right.
Now paste this into your .Rmd file and run:
 title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "21/5/2018" output: word_document: reference_docx: template.docx  ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) load("out.rda") ``` ## Table 1  Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Table 2  Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Figure 1  Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10} colon_s %>% or_plot(dependent, explanatory) ```This is now looking good for me, and further tweaks can be made.
To PDF via knitr/R MarkdownDefault settings for PDF:
 title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "21/5/2018" output: pdf_document: default  ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) load("out.rda") ``` ## Table 1  Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Table 2  Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r")) ``` ## Figure 1  Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE} colon_s %>% or_plot(dependent, explanatory) ```Again, ok but not great.
We can fix the plot in exactly the same way. But the table is off the side of the page. For this we use the kableExtra package. Install this in the normal manner. You may also want to alter the margins of your page using geometry in the preamble.
 title: "Example knitr/R Markdown document" author: "Ewen Harrison" date: "21/5/2018" output: pdf_document: default geometry: margin=0.75in  ```{r setup, include=FALSE} # Load data into global environment. library(finalfit) library(dplyr) library(knitr) library(kableExtra) load("out.rda") ``` ## Table 1  Demographics ```{r table1, echo = FALSE, results='asis'} kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"), booktabs=TRUE) ``` ## Table 2  Association between tumour factors and 5 year mortality ```{r table2, echo = FALSE, results='asis'} kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"), booktabs=TRUE) %>% kable_styling(font_size=8) ``` ## Figure 1  Association between tumour factors and 5 year mortality ```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10} colon_s %>% or_plot(dependent, explanatory) ```This is now looking pretty good for me as well.
There you have it. A pretty quick workflow to get final results into Word and a PDF.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...