simmer 3.6.3
(This article was first published on R – Enchufa2, and kindly contributed to Rbloggers)
The third update of the 3.6.x release of simmer, the DiscreteEvent Simulator for R, is on CRAN. First of all and once again, I must thank Duncan Garmonsway (@nacnudus) for writing a new vignette: “The Bank Tutorial: Part II”.
Among various fixes and performance improvements, this release provides a way of knowing the progress of a simulation. This feature is embed into the run() function as a parameter called progress, which accepts a function with a single argument and will be called for each one of the steps (10 by default) with the current progress ratio (a number between 0 and 1). A very naive example using a simple print():
library(simmer) wait_and_go < trajectory() %>% timeout(1) simmer() %>% add_generator("dummy", wait_and_go, function() 1) %>% run(until=10, progress=print, steps=5) ## [1] 0 ## [1] 0.2 ## [1] 0.4 ## [1] 0.6 ## [1] 0.8 ## [1] 1 ## simmer environment: anonymous  now: 10  next: 10 ## { Generator: dummy  monitored: 1  n_generated: 10 }Or we can get a nice progress bar with the assistance of the progress package:
simmer() %>% add_generator("dummy", wait_and_go, function() 1) %>% run(until=1e5, progress=progress::progress_bar$new()$update) #> [==============] 20%But more importantly, this release implements a new way of retrieving attributes (thus deprecating the old approach, which will be still available throughout the 3.6.x series and will be removed in version 3.7). Since v3.1.x, arrival attributes were retrieved by providing a function with one argument. A very simple example:
trajectory() %>% set_attribute("delay", 3) %>% timeout(function(attr) attr["delay"]) ## Warning: Attribute retrieval through function arguments is deprecated. ## Use 'get_attribute' instead. ## trajectory: anonymous, 2 activities ## { Activity: SetAttribute  key: delay, value: 3, global: 0 } ## { Activity: Timeout  delay: 0x5569098b9228 }Later on, v3.5.1 added support for global attributes, making it necessary to add a second argument to retrieve this new set of attributes:
trajectory() %>% set_attribute("delay", 3, global=TRUE) %>% timeout(function(attr, glb) glb["delay"]) ## Warning: Attribute retrieval through function arguments is deprecated. ## Use 'get_attribute' instead. ## trajectory: anonymous, 2 activities ## { Activity: SetAttribute  key: delay, value: 3, global: 1 } ## { Activity: Timeout  delay: 0x556908730320 }This method is a kind of rarity in simmer. It’s clunky, as it is not easy to document (and therefore to discover and learn), and nonscalable, because new features would require more and more additional arguments. Thus, it is now deprecated, and the get_attribute() function becomes the new method for retrieving attributes. It works in the same way as now() for the simulation time:
env < simmer() trajectory() %>% set_attribute("delay_1", 3) %>% # shortcut equivalent to set_attribute(..., global=TRUE) set_global("delay_2", 2) %>% timeout(function() get_attribute(env, "delay_1")) %>% # shortcut equivalent to get_attribute(..., global=TRUE) timeout(function() get_global(env, "delay_2")) ## trajectory: anonymous, 4 activities ## { Activity: SetAttribute  key: delay_1, value: 3, global: 0 } ## { Activity: SetAttribute  key: delay_2, value: 2, global: 1 } ## { Activity: Timeout  delay: 0x55690829f550 } ## { Activity: Timeout  delay: 0x55690830c310 }This is a little bit more verbose, but I believe it is more consistent and intuitive. Moreover, it allows us to easily implement new features for extracting arrival information. In fact, get_attribute() will come hand in hand with two more verbs: get_name() and get_prioritization(), to retrieve the arrival name and prioritization values respectively.
New features: Show simulation progress via an optional progress callback in run() (#103).
 New “The Bank Tutorial: Part II” vignette, by Duncan Garmonsway @nacnudus (#106).
 New getters for running arrivals (#109), meant to be used inside trajectories:
 get_name() retrieves the arrival name.
 get_attribute() retrieves an attribute by name. The old method of retrieving them by providing a function with one argument is deprecated in favour of get_attribute(), and will be removed in version 3.7.x.
 get_prioritization() retrieves the three prioritization values (priority, preemptible, restart) of the active arrival.
 New shortcuts for global attributes (#110): set_global() and get_global(), equivalent to set_attribute(global=TRUE) and get_attribute(global=TRUE) respectively.
 Some code refactoring and performance improvements (2f4b484, ffafe1e, f16912a, fb7941b, 2783cd8).
 Use Rcpp::DataFrame instead of Rcpp::List (#104).
 Improve argument parsing and error messages (#107).
 Improve internal function make_resetable() (c596f73).
Article originally published in Enchufa2.es: simmer 3.6.3.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Joy Division, Population Surfaces and Pioneering Electronic Cartography
(This article was first published on R – Spatial.ly, and kindly contributed to Rbloggers)
There has been a resurgence of interest in data visualizations inspired by Joy Division’s Unknown Pleasures album cover. These socalled “Joy Plots” are easier to create thanks to the development of the “ggjoy” R package and also some nice code posted using D3. I produced a global population map (details here) using a similar technique in 2013 and since then I’ve been on a quest to find some earlier examples. This is because a number of people have said it reminds them of maps created by Harvard’s digital mapping pioneers in the 1970s and I got the feeling I was coming late to the party…
The pulsar style plots are already well covered by Jen Christiansen who wrote an excellent blog post about the Joy Division album cover, which includes many examples in a range of scientific publications and also features this interview with the designer Peter Saville.
Most interestingly, Jen’s post also includes the glimpse of the kind of map (top left) that I’d heard about but have been unable to get my hands on shown in the book Graphis Diagrams: The Graphic Visualization of Abstract Data.
My luck changed this week thanks to a kind email from John Hessler (Specialist in Mathematical Cartography and GIS at the Library of Congress) alerting me to the huge archive of work they have associated with GIS pioneer Roger Tomlinson (who’s PhD thesis I feature here). I enquired if he knew anything about the population lines maps to which he causally replied “we have hundreds of those” and that he’d already been blogging extensively on the early GIS/ spatial analysis pioneers (I’ve posted links below).
John sent me the below movie entitled “American Graph Fleeting, United States Population Growth 17901970” created in 1978. It’s extraordinary, both in how contemporary it looks but also because it was created as a hologram!
http://dataviz.spatial.ly/American_Graph_Fleeting_2.mp4
He writes:
In 1978 Geoffrey Dutton, who was at the Harvard Lab for Computer Graphics and Spatial Analysis, decided to make a spatialtemporal holographic projection based on the work of William Warntz (see my article How to Map a Sandwich: Surface, Topological Existence Theorems and the Changing Nature of Thematic Cartography for an image of the original Warntz population surface). Dutton’s surface projections were made from the ASPEX software with population data smoothed on to grid of 82 x 127 cells ( a lot to handle computationally at the time)…
The first two images below show the hologram in action and the third shows how it was created.
If you want to find out more information and this pioneering work (and remind yourself how far we have/haven’t come), John Hessler’s “Computing Space” series of blog posts are a great place to start:
From Hypersurfaces to Algorithms: Saving Early Computer Cartography at the Library of Congress
Ernesto and Kathy Split a Sandwich
Taking Waldo Tobler’s Geography 482
Papers of the “Father of GIS” Come to the Library of Congress
Computing Space IV: William Bunge and The Philosophy of Maps
The Many Languages of Space or How to Read Marble and Dacey
Mapping the Web or Pinging your Way to Infinity
Searching for Magpie and Possum: Contemplating the Algorithmic Nature of Cartographic Space
Games Cartographers Play: Alphago, Neural Networks and Tobler’s First Law
To leave a comment for the author, please follow the link and comment on their blog: R – Spatial.ly. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualising Similarity: Maps vs. Graphs
(This article was first published on The Exactness of Mind, and kindly contributed to Rbloggers)
The visualization of complex data
sets is of essential importance in communicating your data products.
Beyond pie charts, histograms, line graphs and other common forms of
visual communication begins the reign of data sets that encompass too
much information to be easily captured by these simple data displays. A
typical context that abounds with complexity is found in the areas of
text mining, natural language processing, and cognitive computing in
general; such a complex context for data
presentation is pervasive in an attempt to build a visual interface for
products like semantic search engines or recommendation engines. For
example, statistical models like LDA (Latent Dirichlet Allocation)
enable for a thorough insight into the similarity structure across
textual documents or vocabulary terms used to describe them. But as the
number of pairwise similarities between terms of documents to be
presented to the end users increases, the problem of effective data
visualization becomes harder. In this and the following blog posts, I
wish to present and discuss some of the wellknown approaches to
visualize difficult data sets from the viewpoint of pure ergonomics, as
well as to suggest some improvements and new ideas here and there.
A simple similarity map for iris
In this first blog on this topic, the idea is to
focus on the basics of ergonomics in the visualization of similarity
information. We will compare two ideas: the usage of similarity maps following a dimensionality reduction of the dataset – which can be taken as a sort of baseline idea in the visualization of similarity data – and the usage of directed graphs from higherdimensional spaces directly, which is less common. I would like to put forward an argument in support of the later approach which I started using some time ago.
But first, let us have some data. I will use a well
known, rather simple data set, merely for the purposes of the
illustration here: namely, the famous iris:
# – the iris dataset
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The iris dataset with its
four numerical and one categorical variable is, as you will see,
complicated quite enough to illustrate the main problem that I wish to
discuss. Now, let’s illustrate the problem. The classical approach to
assess the similarity between the different observations (rows) in this
dataset would be to derive some metric from the numerical variables,
perform dimensionality reduction, and then plot the data points in a
lowerdimensional space. I will work with Euclidean distance and a
nonlinear dimensionality reduction approach by performing ordinal
multidimensional scaling (MDS) with {smacof} in R:
# – distances in iris:
dIris < dist(iris[, 1:4], method = “euclidean”)
Following the computation of Euclidean distances,
each observation from the dataset can be treated as if it was
represented by a set of features, each feature standing for a distance
from the observation under discussion to all other observations in the
dataset (including itself, where the distance is zero). The source data
were fourdimensional (four continuous variables were used to represent
each observation); now they are Ndimensional, with N representing the
number of observations. But, being distances, the data now convey
information about the similarities between the observations. Now, the
dimensionality reduction step:
# – multidimensional scaling w. {smacof}
library(smacof)
mdsIris < smacofSym(dIris, type = “ordinal”)
mdsIris$stress
confIris < as.data.frame(mdsIris$conf)
# – add category to confIris:
confIris$Species < iris$Species
head(confIris)
D1 D2 Species
1 0.8962210 0.09385943 setosa
2 0.9060827 0.05265667 setosa
3 0.9608955 0.03096467 setosa
4 0.9167038 0.08550457 setosa
5 0.9101914 0.10073609 setosa
6 0.7702262 0.22385668 setosa
In ordinal multidimensional scaling, only the
ordering of the initial distances is preserved; the obtained mapping is
thus invariant under rotation and translation – like any MDS solution –
and any monotonic scale transformation that preserves the rank order of
similarity. The stress of the final MDS configuration is .025, not bad. Now we have a 2D representation of the data. Let’s plot it with {ggplot2}:
# – visualize w. {ggplot2}
library(ggplot2)
library(ggrepel)
library(RColorBrewer)
confIris$Color < recode(confIris$Species, ‘setosa’ = ‘aquamarine3’, ‘versicolor’ = ‘coral3’, ‘virginica’ = ‘deepskyblue3’)
confIris < confIris[order(confIris$Species), ]
confIris$Label[which(confIris$Species == ‘setosa’)] < paste(’s’, 1:length(which(confIris$Species == ‘setosa’)), sep = “”)
confIris$Label[which(confIris$Species
== ‘versicolor’)] <paste(’ve’, 1:length(which(confIris$Species ==
‘versicolor’)), sep = “”)
confIris$Label[which(confIris$Species ==
‘virginica’)] < paste(‘vi’, 1:length(which(confIris$Species ==
‘virginica’)), sep = “”)
ggplot(confIris, aes(x = D1, y = D2, label = Label)) + geom_text_repel(aes(x = D1, y = D2, label = Label),
size = 2, segment.size = .1) +
geom_point(size = 1.5, color = confIris$Color) +
geom_point(size = 1, color = “white”) +
ggtitle(‘Iris: semantic map.’) +
theme_bw() + theme(plot.title = element_text(size = 10))
With 150 objects to represent, it is already difficult to read the similarity map. It can be readily seen that different types of observations (“s” for “setosa”, “vi” for “virginica”, and “ve” for “versicolor”, the famous iris$Species,
each additionally indexed by the observation number in the respective
class) tend to cluster naturally. However, we know one very important
thing about the iris dataset: it is not
linearly separable. In the simplest possible explanation of what
nonlinear separability means: look for the border between the blue (“virginica”) and red (“versicolor”)
points, and try to imagine a straight line through the plane that
perfectly separates the two classes (and then generalize if you wish: no
plane in 3D, no hyperplane in a higher dimensional space… can separate
the classes perfectly). Ask yourself: what observations are the most responsible for these feature in the iris dataset? For example, is “vi7” at the
bottom of the map a problematic case in categorization? What about the
following: “vi50”, “vi47”, “ve34”, “ve23”? The observation “vi7” is
found far away from the other members of its category, however, it’s
still somewhat lonesome in this 2D similarity map. Observations “vi50”,
“vi47”, “ve34”, and “ve23”, found in the lower right segment of the map
but closer to its center, on the other hand, are found in a very
crowded part of the similarity space: could they be a part of
explanation for the presence of nonlinear separability in iris?
The problem of ergonomics is obviously related to the
overcrowding of the points in the map. The conceptual problem with
dimensionality reduction to simple, visually comprehensible similarity
spaces, is that there is no dimensionality reduction without a loss of information from
the initial, higherdimensional space (except in some really rare,
theoretically possible cases). As we will see, as a consequence of this
there are situations when seemingly unimportant observations manage to
hide a part of their nature in a lowerdimensional representation. But
we need to see that first.
Similarity in iris as a directed graph
Before performing dimensionality reduction with
ordinal MDS, we were found in a 150dimensional similarity – distance,
actually – space. In spite of being a powerful technique for nonlinear
dimensionality reduction, MDS has “eaten” some of the information from
the initial hyperspace. Now, let’s see if we can go back into the
hyperspace and visualize it somehow to enable us discuss the importance
of particular observations in this dataset. Each data point is described
as a vector of 150 elements, each element representing the distance
between the data point under observation and some other data point.
Let’s search through these vectors and find the nearest neighbor – the most similar observation – for each data point:
# – iris as a directed graph:
mdsDistIris < as.matrix(mdsIris$confdist)
nnIris < apply(mdsDistIris, 1, function(x) {
minDist < sort(x, decreasing = T)
minDist < minDist[length(minDist)1]
which(x == minDist)[1]
})
# – graphFrame:
graphFrame < data.frame(from = 1:150,
to = nnIris,
Species = iris$Species)
We have first discovered the nearest neighbors, and
the introduced a new variable, Degree, that describes to how many other
observations (excluding itself) does a particular data point presents
the nearest neighbor. In the following similarity map, the size of the
marker represents the degree of each observation:
Observations “vi50” and “ve34” stand close to each
other, belonging to different classes; moreover, they are marked as
having a higher degree in the nearest neighbor network (we haven’t
visualized the network yet) than many other observations; they are candidates to represent the nearest neighbors to each other.
Still, “vi7” at the bottom
of the map is certainly not a hub. In the next step, we change the
visualization strategy. Again, get ready for a loss of information.
While MDS attempts at a representation that will preserve as many as
possible information about all pairwise
similarities from the initial dataset, thus necessarily compromising
with some error present in the lowerdimensional representation, we
chose to sacrifice the context and represent information on similarity neighborhoods only. We will represent the similarity data from iris by instantiating a directed graph, with each node representing an observation, and each node having exactly one link, pointing towards the nearest neighbor of the respective observation. We use {igraph} to construct and represent the graph:
# – represent highdimensional Iris neighbourhoods as directed graph:
graphFrame$from < confIris$Label[graphFrame$from]
graphFrame$to < confIris$Label[graphFrame$to]
library(igraph)
irisNet < graph.data.frame(graphFrame, directed = T)
V(irisNet)$size < degree(irisNet)*1.25
V(irisNet)$color < as.character(confIris$Color)
V(irisNet)$label < confIris$Label
# – plot w. {igraph}
par(mai=c(rep(0,4)))
plot(irisNet,
vertex.shape = “circle”,
vertex.frame.color = NA,
edge.width = .75,
edge.color = “grey”,
edge.arrow.size = 0.15,
vertex.label.font = 1,
vertex.label.color = “black”,
vertex.label.family = “sans”,
vertex.label.cex = .65,
vertex.label.dist = .35,
edge.curved = 0.5,
margin = c(rep(0,4)))
As expected, “vi50” and “vi34” have mutual links,
standing for each other’s nearest neighbor, and still being members of
different categories; but “vi39” and “ve21” too, a fact that was less
obvious in the similarity map. Also, we now discover that the nearest
neighbor to “vi7” is “ve13” from another class.
The clustering in the nearest neighbor graphs occurs naturally, and
thus the graph conveys some information about the similarity context for
groups of points in spite of the fact that it was not intentionally
built to represent such information.
Maps vs. Graphs
The nearest neighbor directed graph approach is not
very common among the attempts to visualize similarity data. However,
following years of work in the field of distributive semantics, I have
started abandoning similarity maps as a mean of data visualization
almost completely in favor of graphs. While I haven’t performed any
methodologically rigorous comparisons among the two approaches, I have
used both in building interfaces for semantic search engines, and the
users turned out to be always more satisfied with directed nearest
neighbor graphs in comparisons to similarity maps. Intuitively, I would
say the main reason is the absence of too much crowding of data points.
Also, the aesthetics of graphs seem more pleasing. Graphs are maybe a
bit more difficult to visually search through: the links and the
occurrence of clusters tend to somehow guide the eye to search for a
particular data point in the imminent neighborhood only. Again, I don’t
find the ergonomics of similarity maps much more efficient in that
respect. The drawback of directed graphs: while some important clusters
of nearest neighbors emerge naturally in them, the graph layout will not
necessarily enable the user to recognize the way the observations tend
to group in general. I have experimented with directed graphs in which
two nearest neighbors are selected for each observations, and that works
too, but the graph becomes unreadable quickly for any number significantly larger than ten to fifteen observations.
Comparing the loss of information in similarity maps in contrast to the nearest neighbor graphs, it seems to me that the latter have
a clear advantage in the presentation of complex datasets, and I would
clearly prefer the graphs in any situation – or at least suggest always
using them as an additional method of interpretation alongside the
respective maps. In order to illustrate, I have inspected ordinal MDS solutions for iris
distances in Euclidean space, ranging the solution dimensionality from
two to five, and computed the accuracy of determining correctly the Nth
neighbor for each observation for each solution. The following line
chart is quite telling.
On the xaxis we find the order of the neighbor (1st, 2nd,
etc), while the accuracy of correctly determining the neighbor of the
respective order across the observations is represented on the yaxis (%
scale). One can now have a quite direct insight into the behavior of
information loss in ordinal MDS. The 2D solution seems to be especially problematic, with far less than 50% recognition of the 1st
neighbor across the observations. The local information from
higherdimensional spaces represented by directed graphs perfectly
complements the compromises made in the lowerdimensional similarity
maps.
Goran S. Milovanović, PhD
Data Science Consultant, SmartCat
To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Joy Plot of Length Frequencies
(This article was first published on fishR Blog, and kindly contributed to Rbloggers)
There has been a bit of a buzz recently about socalled “joyplots.” Wilke described joyplots as “partially overlapping line plots that create the impression of a mountain range.” I would describe them as partially overlapping density plots (akin to a smoothed histogram). Examples of joyplots are here, here, and here (from among many).
I thought that joyplots might provide a nice visualization of length frequencies over time. For example, we have a paper forthcoming in the North American Journal of Fisheries Management that examines (among other things) the lengths of Kiyi (Coregonus kiyi) captured in trawl tows in Lake Superior from 2003 to 2014. Figure 1 is a modified version (stripped of code that increased fonts, changed colors, etc.) of the length frequency histograms we included in the paper. Figure 2 is a near default joyplot of the same data.
Figure 1: Histograms of the total lengths of Lake Superior Kiyi from 2003 to 2014.
Figure 2: Joyplot of the total lengths of Lake Superior Kiyi from 2003 to 2014.
In my opinion, the joyplot makes it easier to follow through time the strong yearclasses that first appear in 2004 and 2010 and to see how fish in the strong yearclasses grow and eventually merge in size with older fish. Joyplots look useful for displaying length (or age) data across many groups (years, locations, etc.). Some, however, like the specificity (less smoothing) of the histograms (see the last example in Wilke for overlapping histograms).
R CodeThe data file used in this example has the following structure.
str(lf) ## 'data.frame': 9151 obs. of 3 variables: ## $ year: Factor w/ 12 levels "2003","2004",..: 1 1 2 2 2 2 5 5 5 5 ... ## $ mon : Factor w/ 3 levels "Jul","Jun","May": 3 3 3 3 3 3 3 3 3 3 ... ## $ tl : int 183 259 200 249 215 234 119 188 157 189 ...The histograms were constructed with the following code (requires the ggplot2 package).
ggplot(lf,aes(x=tl)) + scale_x_continuous(expand=c(0.02,0),limits=c(40,310), breaks=seq(0,350,50), labels=c("","",100,"",200,"",300,"")) + geom_histogram(aes(y=..ncount..),binwidth=5,fill="gray40") + facet_wrap(~year,nrow=4,dir="v") + labs(x="Total Length (mm)",y="Relative Frequency") + theme_bw() + theme(axis.text.y=element_blank())The joyplot was constructed with the following code (requires the ggplot2 and ggjoy packages). See the introductory vignette by Wilke for how to further modify joyplots.
ggplot(lf,aes(x=tl,y=year)) + geom_joy(rel_min_height = 0.01) + # removes tails scale_y_discrete(expand = c(0.01, 0)) + # removes cutoff top labs(x="Total Length (mm)",y="Year") + theme_minimal() var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Looking for R at JSM
(This article was first published on R Views, and kindly contributed to Rbloggers)
I am very much looking forward to attending JSM which begins this Sunday. And once again, I will be spending a good bit of my time hunting for new and interesting applications of R. In years gone by, this was a difficult game at JSM because R, R Package, Shiny, tidyverse and the like did not often turn up in a keyword search. This year, however, there is quite a bit of low hanging fruit. Below twentyfive sessions are listed by their JSM Activity Number.
If you are going to the conference, try to work some of these into your program, and please feel free to write to me (joseph.rickert@rstudio.com) if you see an R related talk that was particularly interesting.
Sunday, 07/30/201724 – Acc: An R Package to Process, Visualize, and Analyze Accelerometer Data Jae Joon Song, University of Texas Health Science Center; Michael Swartz, The University of Texas Health Science Center At Houston, School of Public Health; Matthew G. Cox, University of Texas MD Anderson Cancer Center; Karen BasenEngquist, University of Texas MD Anderson Cancer Center
83 – R Package TDA for Statistical Inference on Topological Data Analysis Jisu Kim, Carnegie Mellon University
Monday, 07/31/2017118 – Identification of Potentially Problematic Prescribing to a Vulnerable Population Ryan Jarrett; Richard Epstein, Chapin Hall at the University of Chicago; Michael Cull, Vanderbilt University; David Schlueter, Vanderbilt University; Kathy Gracey, Vanderbilt University; Molly Butler, Vanderbilt University; Rameela Chandrasekhar, Vanderbilt University
159 – No Punt Intended Nicholas Eisemann, California State Univ Chico; Camille Pensabene, St. John Fisher College; Dylan Gouthro, Chico State; Demond Handley, Kansas State University ; MyDoris Soto, California State UniversityFullerton
159 – A Shiny App for Investigation of NHL Draft Pick Career Trajectories James White, St. Lawrence University; Michael Schuckers, St. Lawrence University 162 – Forced Migration: Visualizing Patterns of Refugee Movements Jacquelyn Neal, Vanderbilt University; Hannah E Weeks, Vanderbilt University
171 – Developing a WebBased Tool to Evaluate Survey Performance Using R Shiny Scott Fricker, U.S. Bureau of Labor Statistics; Jeffrey Gonzalez, Bureau of Labor Statistics; Randall Powers, US Bureau of Labor Statistics
182 – Identification of Potentially Problematic Prescribing to a Vulnerable Population Ryan Jarrett; Richard Epstein, Chapin Hall at the University of Chicago; Michael Cull, Vanderbilt University; David Schlueter, Vanderbilt University; Kathy Gracey, Vanderbilt University; Molly Butler, Vanderbilt University; Rameela Chandrasekhar, Vanderbilt University
243 – An R Shiny Foundation for Standardized Clinical Review Tools Jimmy Wong, FDA
250 – A Shiny App for Investigation of NHL Draft Pick Career Trajectories James White, St. Lawrence University; Michael Schuckers, St. Lawrence University 253 – Forced Migration: Visualizing Patterns of Refugee Movements Jacquelyn Neal, Vanderbilt University; Hannah E Weeks, Vanderbilt University
Tuesday, 08/01/2017279 – Wilcoxon RankBased Tests for Clustered Data with R Package Clusrank Jun Yan, University of Connecticut; Yujing Jiang, University of Connecticut; MeiLing Ting Lee, University of Maryland; Xin He, University of Maryland, College Park; Bernard Rosner, Harvard Medical School
358 – Contextualizing Sensitivity Analysis in Observational Studies: Calculating Bias Factors for Known Covariates (Note this is also listed as Activity Number 252) Lucy D’Agostino McGowan; Robert A Greevy, Jr, Vanderbilt University
389 – Applications of Interactive Exploration of Large MultiPanel Faceted Displays Using Trelliscope — Ryan Hafen, Purdue University
389 – Extracting Insights from Genomic Data via Complex Interactive Heatmaps — Alicia Schep, Stanford School of Medicine
389 – Visualizing the Changing Landscape of Software Developers — David Robinson, StackOverflowcom
Wednesday, 08/02/2017454 – An R Package for SpatioTemporal Change of Support Andrew Raim, U.S. Census Bureau; Scott H. Holan, University of Missouri; Jonathan R Bradley, Florida State University; Christopher Wikle, University of Missouri
504 – Discriminative Power and Variable Selection in Logistic Regression Models under Outcome Misspecification Jia Bao, NYU Langone Medical Center; Yongzhao Shao, New York UniversitySchool of Medicine
518 – EyeTrVis: An R Package for the Extraction and Visualization of Eye Tracking Data from People Looking at Posters — Chunyang Li ; Jurgen Symanzik, Utah State University
518 – In with the New: a Shiny App for Exploring Summary Data Tables from the Medical Expenditure Panel Survey — Emily Mitchell, Agency for Healthcare Research and Quality
518 – RCAP Designer: Creating Analytical Dashboards on the RCloud Platform — Robert Archibald, Macrosoft Inc ; Ganesh Subramaniam, AT&T ; Simon Urbanek, AT&T Labs Research
518 – Standardizing Assessment of TimeSeries Forecasting Models — Nicholas G. Reich, University of Massachusetts Amherst ; Joshua Kaminsky, Johns Hopkins Bloomberg School of Public Health ; Lindsay Keegan, Johns Hopkins Bloomberg School of Public Health ; Stephen Lauer, University of Massachusetts – Amherst ; Krzysztof Sakrejda, University of Massachusetts – Amherst ; Justin Lessler, Johns Hopkins Bloomberg School of Public Health
543 – Discussing the Uses and Creation of R Shiny Applications Justin Post, North Carolina State University
548 – A girl geek’s guide to new research on interactive data visualization for statistics with lots of data — Dianne Cook, Monash University
587 – Ggmapr – Tools for Working with and Visualizing Maps Heike Hofmann, Iowa State University
Thursday, 08/03/2017599 – Guiding Principles for Interactive Graphics Based on LIBD Data Science Projects Leonardo ColladoTorres, Lieber Institute for Brain Development; Bill Ulrich, Lieber Institute for Brain Development; Stephen Semick, Lieber Institute for Brain Development; Andrew E Jaffe, Lieber Institute for Brain Development
599 – Interactive Data Visualization on the Web Using R — Carson Sievert, Carson Sievert
_____='https://rviews.rstudio.com/2017/07/28/lookingforratjsm/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Social Network Analysis and Topic Modeling of codecentric’s Twitter friends and followers
(This article was first published on Shirin's playgRound, and kindly contributed to Rbloggers)
I have written the following post about Social Network Analysis and Topic Modeling of codecentric’s Twitter friends and followers for codecentric’s blog:
Recently, Matthias Radtke has written a very nice blog post on Topic Modeling of the codecentric Blog Articles, where he is giving a comprehensive introduction to Topic Modeling. In this article I am showing a realworld example of how we can use Data Science to gain insights from text data and social network analysis.
I am using publicly available Twitter data to characterize codecentric’s friends and followers for identifying the most “influential” followers and using text analysis tools like sentiment analysis to characterize their interests from their user descriptions, performing Social Network Analysis on friends, followers and a subset of second degree connections to identify key players who will be able to pass on information to a wide reach of other users and combing this network analysis with topic modeling to identify metagroups with similar interests.
Knowing the interests and social network positions of our followers allows us to identify key users who are likely to retweet posts that fall within their range of interests and who will reach a wide audience.
…
Continue reading at https://blog.codecentric.de/en/2017/07/combiningsocialnetworkanalysistopicmodelingcharacterizecodecentricstwitterfriendsfollowers/…
The entire analysis has been done in R 3.4.0 and you can find my code on Github.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Reading PCAP Files with Apache Drill and the sergeant R Package
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant package that rests atop d[b]plyr. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.
Why am I calling this out?
Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linuxish system go do something like this:
sudo tcpdump ni en0 s0 w capture01.pcap
And, wait a bit.
NOTE: Some of you may have to change the en0 to your main network interface name (a quick google for that for your platform should get you to the right one to use).
That command will passively record all network activity on your system until you ctrlc it. The longer it goes the larger it gets.
When you’ve recorded a minute or two of packets, ctrlc the program and then try to look at the PCAP file. It’s a binary mess. You can reread it with tcpdump or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter to work with these files but my use case/project dried up and haven’t gone back to it.
Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:
library(sergeant) library(iptools) library(tidyverse) library(cymruservices) db < src_drill("localhost") my_pcaps < tbl(db, "dfs.caps.`/capture02.pcap`") glimpse(my_pcaps) ## Observations: 25 ## Variables: 12 ## $ src_ip "192.168.10.100", "54.159.166.81", "192.168.10... ## $ src_port 60025, 443, 60025, 443, 60025, 58976, 443, 535... ## $ tcp_session 2.082796e+17, 2.082796e+17, 2.082796e+17, ... ## $ packet_length 129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75... ## $ data "...g9B..c.<..O..@=,0R.`........K..EzYd=......... ## $ src_mac_address "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:... ## $ dst_port 443, 60025, 443, 60025, 443, 443, 58976, 5353,... ## $ type "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP... ## $ dst_ip "54.159.166.81", "192.168.10.100", "54.159.166... ## $ dst_mac_address "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:... ## $ network 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1... ## $ timestamp 20170727 23:54:58, 20170727 23:54:59, 201... summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>% collect() %>% summarise(max  min) ## # A tibble: 1 x 1 ## `max  min` ## ## 1 1.924583 mins count(my_pcaps, type) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## type n ## ## 1 TCP 4974 ## 2 UDP 774 filter(my_pcaps, type=="TCP") %>% count(dst_port, sort=TRUE) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## # Ordered by: desc(n) ## dst_port n ## ## 1 443 2580 ## 2 56202 476 ## 3 56229 226 ## 4 56147 169 ## 5 56215 103 ## 6 56143 94 ## 7 56085 89 ## 8 56203 56 ## 9 56205 39 ## 10 56209 39 ## # ... with more rows filter(my_pcaps, type=="TCP") %>% count(dst_ip, sort=TRUE) %>% collect() > dst_ips filter(dst_ips, !is.na(dst_ip)) %>% left_join(ips_in_cidrs(.$dst_ip, c("10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")), by = c("dst_ip"="ips")) %>% filter(!in_cidr) %>% left_join(distinct(bulk_origin(.$dst_ip), ip, .keep_all=TRUE), c("dst_ip" = "ip")) %>% select(dst_ip, n, as_name) ## # A tibble: 37 x 3 ## dst_ip n as_name ## ## 1 104.244.42.2 862 TWITTER  Twitter Inc., US ## 2 104.244.46.103 556 TWITTER  Twitter Inc., US ## 3 104.20.60.241 183 CLOUDFLARENET  CloudFlare, Inc., US ## 4 31.13.80.8 160 FACEBOOK  Facebook, Inc., US ## 5 52.218.160.76 100 AMAZON02  Amazon.com, Inc., US ## 6 104.20.59.241 79 CLOUDFLARENET  CloudFlare, Inc., US ## 7 52.218.160.92 66 AMAZON02  Amazon.com, Inc., US ## 8 199.16.156.81 58 TWITTER  Twitter Inc., US ## 9 104.244.42.193 47 TWITTER  Twitter Inc., US ## 10 52.86.113.212 42 AMAZONAES  Amazon.com, Inc., US ## # ... with 27 more rowsNo custom R code. No modification to the sergeant package. Just query it like any other data source.
One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.
I had to add "pcap": { "type": "pcap" } to the formats section of the dfs storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap type. But after that, it just works.
Well, kinda.
The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec & honeypotlands) and it seems to bork on IPv6 packets a bit. And, my sergeant package (for now) can’t do much with the data component (neither can Drillproper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.
Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDFside and sergeant side to help make the data element more useful.
Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.
If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Le Monde puzzle [#1707]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
A geometric Le Monde mathematical puzzle:
 Given a pizza of diameter 20cm, what is the way to cut it by two perpendicular lines through a point distant 5cm from the centre towards maximising the surface of two opposite slices?
 Using the same point as the tip of the four slices, what is the way to make four slices with equal arcs in four cuts from the tip again towards maximising the surface of two opposite slices?
For both questions, I did not bother with the maths but went itself to a discretisation of the disk, counting the proportion of points within two opposite slices and letting the inclination of these slices move from zero to π/2. Unsurprisingly, for the first question, the answer is π/4, given that there is no difference between both surfaces at angles 0 and π/2. My R code is as follows, using (5,0) as the tip:
M=100 surfaz=function(alpha){ surfz=0 cosal=cos(alpha);sinal=sin(alpha) X=Y=seq(10,10,le=M) Xcosal=(X5)*cosal Xsinal=(X5)*sinal for (i in 1:M){ norm=sqrt(X[i]^2+Y^2) scal1=Xsinal[i]+Y*cosal scal2=Xcosal[i]+Y*sinal surfz=surfz+sum((norm<=10)*(scal1*scal2>0))} return(4*surfz/M/M/pi)}The second puzzle can be solved by a similar code, except that the slice area between two lines has to be determined by a cross product:
surfoz=function(alpha,ploz=FALSE){ sinal=sin(alpha);cosal=cos(alpha) X=Y=seq(10,10,le=M) frsterm=cosal*(10*cosal5)+sinal*(10*sinal5) trdterm=cosal*(10*cosal+5)+sinal*(10*sinal+5) surfz=0 for (i in 1:M){ norm=sqrt(X[i]^2+Y^2) scal1=(10*(Y[i]5)*cosal(10*sinal5)*X)*frsterm scal2=(10*(Y[i]5)*sinal(10*cosal5)*X)*frsterm scal3=(10*(Y[i]5)*cosal+(10*sinal+5)*X)*trdterm scal4=(10*(Y[i]5)*sinal+(10*cosal+5)*X)*trdterm surfz=surfz+sum((norm<=10)* ((scal1>0)*(scal2>0)+ (scal3>0)*(scal4>0)))} return(4*surfz/M/M/pi)}a code that shows that all cuts lead to identical surfaces for bot sets of slices. A fairly surprising result!
Filed under: Books, Kids, R Tagged: competition, geometry, Le Monde, mathematical puzzle, optimisation, R
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The R6 Class System
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
R is an objectoriented language with several objectorientation systems. There's the original (and still widelyused) S3 class system based on the "class" attribute. There's the somewhat stricter, signaturebased S4 class system. There are reference classes (also called R5), which provide R objects with multiple references without duplicating data in memory. And now there's the R6 class system, implemented as an R package by Winston Chang,
So why yet another class system for R? As Winston explains in his useR!2017 talk embedded below, the S3 and S4 class systems are functional objectoriented systems, where class methods are separate from objects, and objects are not mutable (that is, they can't be modified directly by methods; you have to assign modified values to the object directly). The R6 system, by contrast, is an encapsulated object oriented system akin to those in Java or C++, where objects contain methods in addition to data, and those methods can modify objects directly.
The big advantage of R6 is that it makes it much easier to implement some common data structures in a userfriendly manner. For example, to implement a stack "pop" operation in S3 or S4 you have to do something like this:
x < topval(mystack) mystack < remove_top(mystack)In R6, the implementation is much simpler to use:
x < mystack$pop()That single statement both returns the top value (as x) from the stack mystack, but also modifies the stack in the process (minus the popped element).
Despite being first released over 3 years ago, the R6 isn't widely known. It is widely used, however: it's used to manage session state in Shiny and mrsdeploy, to manage database connections in dplyr, and to represent system processes in the processx package. In fact, the R6 package is the most downloaded CRAN package, according according to Rdocumentation Trends.
The R6 package is available on CRAN, and it also comes preinstalled in Microsoft R. If you'd like to start using R6, the vignette Introduction to R6 Classes is a great place to start. You can also check out the development of R6 at the R6 Github repository.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Options for teaching R to beginners: a false dichotomy?
(This article was first published on SAS and R, and kindly contributed to Rbloggers)
I’ve been reading David Robinson’s excellent blog entry “Teach the tidyverse to beginners” (http://varianceexplained.org/r/teachtidyverse), which argues that a tidyverse approach is the best way to teach beginners. He summarizes two competing curricula:
1) “Base R first”: teach syntax such as $ and [[]], built in functions like ave() and tapply(), and use base graphics
2) “Tidyverse first”: start from scratch with pipes (%>%) and leverage dplyr and use ggplot2 for graphics
If I had to choose one of these approaches, I’d also go with 2) (“Tidyverse first”), since it helps to move us closer to helping our students “think with data” using more powerful tools (see here for my sermon on this topic).
A third way
Of course, there’s a third option that addresses David’s imperative to “get students doing powerful things quickly”. The mosaic package was written to make R easier to use in introductory statistics courses. The package is part of Project MOSAIC (http://mosaicweb.org), an NSFfunded initiative to integrate statistics, modeling, and computing. A paper outlining the mosaic package’s “Less Volume, More Creativity” approach was recently published in the R Journal (https://journal.rproject.org/archive/2017/RJ2017024). To his credit, David mentions the mosaic package in a response to one of the comments on his blog.
Less Volume, More CreativityOne of the big ideas in the mosaic package is that students build on the existing formula interface in R as a mechanism to calculate summary statistics, generate graphical displays, and fit regression models. Randy Pruim has dubbed this approach “Less Volume, More Creativity”.
While teaching this formula interface involves adding a new learning outcome (what is “Y ~ X“?), the mosaic approach simplifies calculation of summary statistics by groups and the generation of two or three dimensional displays on day one of an introductory statistics course (see for example Wang et al., “Data Viz on Day One: bringing big ideas into intro stats early and often” (2017), TISE).
The formula interface also prepares students for more complicated models in R (e.g., logistic regression, classification).
Here’s a simple example using the diamonds data from the ggplot2 package. We model the relationships between two colors (D and J), number of carats, and price.
I’ll begin with a bit of data wrangling to generate an analytic dataset with just those two colors. (Early in a course I would either hide the next code chunk or make the recoded dataframe accessible to the students to avoid cognitive overload.) Note that an R Markdown file with the following commands is available for download at https://nhorton.people.amherst.edu/mosaicblog.Rmd.
recoded < diamonds %>%
filter(color==”D”  color==”J”) %>%
mutate(col = as.character(color))
We first calculate the mean price (in US$) for each of the two colors.
mean(price ~ col, data = recoded)
D J
3170 5324
This call is an example of how the formula interface facilitates calculation of a variable’s mean for each of the levels of another variable. We see that D color diamonds tend to cost less than J color diamonds.
A useful function in mosaic is favstats() which provides a useful set of summary statistics (including sample size and missing values) by group.
favstats(price ~ col, data = recoded)
col
min
Q1
median
Q3
max
mean
sd
n
missing
D
357
911
1838
4214
18693
3170
3357
6775
0
J
335
1860
4234
7695
18710
5324
4438
2808
0
A similar command can be used to generate side by side boxplots. Here we illustrate the use of lattice graphics. (An alternative formula based graphics system (ggformula) will be the focus of a future post.)
bwplot(col ~ price, data = recoded)
The distributions are skewed to the right (not surprisingly since they are prices). If we wanted to formally compare these sample means we could do so with a twosample ttest (or in a similar fashion, by fitting a linear model).
t.test(price ~ col, data = recoded)
Welch Two Sample ttest
data: price by col
t = 20, df = 4000, pvalue <2e16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2336 1971
sample estimates:
mean in group D mean in group J
3170 5324
msummary(lm(price ~ col, data = recoded))
Estimate Std. Error t value Pr(>t)
(Intercept) 3170.0 45.0 70.4 <2e16 ***
colJ 2153.9 83.2 25.9 <2e16 ***
Residual standard error: 3710 on 9581 degrees of freedom
Multiple Rsquared: 0.0654, Adjusted Rsquared: 0.0653
Fstatistic: 670 on 1 and 9581 DF, pvalue: <2e16
The results from the two approaches are consistent: the group differences are highly statistically significant. We could conclude that J diamonds tend to cost more than D diamonds, back in the population of all diamonds.
Let’s do a quick review of the mosaic modeling syntax to date:
mean(price ~ col)
bwplot(price ~ col)t.test(price ~ col)lm(price ~ col) See the pattern? On a statistical note, it’s important to remember that the diamonds were not randomized into colors: this is a found (observational dataset) so there may be other factors at play. The revised GAISE College report reiterates the importance of multivariate thinking in intro stats.Moving to three dimensionsLet’s continue with the “Less Volume, More Creativity” approach to bring in a third variable: the number of carats in each diamond. xyplot(price ~ carat, groups=col, auto.key=TRUE, type=c(“p”, “r”), data = recoded)
We see that controlling for the number of carats, the D color diamonds tend to sell for more than the J color diamonds. We can confirm this by fitting a regression model that controls for both variables (and then display the resulting predicted values from this parallel slopes model using plotModel()).
This is a great example of Simpson’s paradox: accounting for the number of carats has yielded opposite results from a model that didn’t include carats.If we were to move forward with such an analysis we’d need to be sure to undertake an assessment of our model and verify conditions and assumptions (but for the purpose of the blog entry I’ll defer that).
Moving beyond mosaic
The revised GAISE College report enunciated the importance of technology when teaching statistics. Many courses still use calculators or webbased applets to incorporate technology into their classes. R is an excellent environment for teaching statistics, but many instructors feel uncomfortable using it (particularly if they feel compelled to teach the $ and [[]] syntax, which many find offputting). The mosaic approach helps make the use of R feasible for many audiences by keeping things simple. It’s unfortunately true that many introductory statistics courses don’t move beyond bivariate relationships (so students may feel paralyzed about what to do about other factors). The mosaic approach has the advantage that it can bring multivariate thinking, modeling, and exploratory data tools together with a single interface (and modest degree of difficulty in terms of syntax). I’ve been teaching multiple regression as a descriptive method early in an intro stat course for the past ten years (and it helps to get students excited about material that they haven’t seen before). The mosaic approach also scales well: it’s straightforward to teach students dplyr/tidyverse data wrangling by adding in the pipe operator and some key data idioms. (So perhaps the third option should be labeled “mosaic and tidyverse”.)
See the following for an example of how favstats() can be replaced by dplyr idioms.
recoded %>%
group_by(col) %>%
summarize(meanval = mean(price, na.rm = TRUE))
col
meanval
D
3170
J
5324
That being said, I suspect that many students (and instructors) will still use favstats() for simple tasks (e.g., to check sample sizes, check for missing data, etc). I know that I do. But the important thing is that unlike training wheels, mosaic doesn’t hold them back when they want to learn new things. I’m a big fan of ggplot2, but even Hadley agrees that the existing syntax is not what he wants it to be. While it’s not hard to learn to use + to glue together multiple graphics commands and to get your head around aesthetics, teaching ggplot2 adds several additional learning outcomes to a course that’s already overly pregnant with them.
Side note
I would argue that a lot of what is in mosaic should have been in base R (e.g., formula interface to mean(), data= option for mean()). Other parts are more focused on teaching (e.g., plotModel(), xpnorm(), and resampling with the do() function).
Closing thoughtsIn summary, I argue that the mosaic approach is consistent with the tidyverse. It dovetails nicely with David’s “Teach tidyverse” as an intermediate step that may be more accessible for undergraduate audiences without a strong computing background. I’d encourage people to check it out (and let Randy, Danny, and I know if there are ways to improve the package).
Want to learn more about mosaic? In addition to the R Journal paper referenced above, you can see how we get students using R quickly in the package’s “Less Volume, More Creativity” and “Minimal R” vignettes. We also provide curated examples from commonly used textbooks in the “mosaic resources” vignette and a series of freely downloadable and remixable monographs including The Student’s Guide to R and Start Teaching with R.
To leave a comment for the author, please follow the link and comment on their blog: SAS and R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Divided Differences Method of Polynomial Interpolation
(This article was first published on R – Aaron Schlegel, and kindly contributed to Rbloggers)
Part of 6 in the series Numerical AnalysisThe divided differences method is a numerical procedure for interpolating a polynomial given a set of points. Unlike Neville’s method, which is used to approximate the value of an interpolating polynomial at a given point, the divided differences method constructs the interpolating polynomial in Newton form.
Consider a table of values of and corresponding values of a function at those points:
x yAlso assume that is the th Lagrangian polynomial that corresponds with the function at these points. The polynomial can be expressed using the divided differences of the function with respect to the values.
Therefore the constants must be found to construct the polynomial. To find these constants, the divided differences are recursively generated until iterations have been completed. We start with the zeroth divided difference of the function with respect to , which is the value of at that point. Bracket notation is introduced to distinguish the divided differences.
Now begins the recursive generation of the divided differences. The first divided difference is then the function with respect to the values and .
The second divided difference follows:
This iteration continues until the th divided difference:
Thus the interpolating polynomial resulting from the divided differences method takes the form:
Divided Differences ExampleConsider the following set of points and the corresponding values of a function at those points. This data set is the same from the previous post on Neville’s method. We will perform the divided differences method to find the interpolating polynomial of these points and approximate the polynomial at .
x f(x) 8.1 16.9446 8.3 17.56492 8.6 18.50515 8.7 18.82091First, is found.
Then, is computed.
With and , we find .
is then found.
We can then compute :
Lastly, we calculate :
The interpolating polynomial, , is then constructed.
We can see the interpolated polynomial passes through the points with the following:
p3x < function(x) { f < 16.9446 + 3.1016 * (x  8.1) + 0.065 * (x  8.1) * (x  8.3)  0.010417 * (x  8.1) * (x  8.3) * (x  8.6) return(f) } x < c(8.1, 8.3, 8.6, 8.7) fx < c(16.9446, 17.56492, 18.50515, 18.82091) dat < data.frame(cbind(x, fx)) ggplot(dat, aes(x=x, y=fx)) + geom_point(size=5, col='blue') + stat_function(fun = p3x, size=1.25, alpha=0.4)Using the function above, we can also see the interpolated polynomial resulting from the divided differences method returns the same approximated value of the function , as Neville’s method.
p3x(8.4) ## [1] 17.87709 Divided Differences in RThe following is an implementation of the divided differences method of polynomial interpolation in R. The function utilizes the rSymPy library to build the interpolating polynomial and approximate the value of the function for a given value of .
divided.differences < function(x, y, x0) { require(rSymPy) n < length(x) q < matrix(data = 0, n, n) q[,1] < y f < as.character(round(q[1,1], 5)) fi < '' for (i in 2:n) { for (j in i:n) { q[j,i] < (q[j,i1]  q[j1,i1]) / (x[j]  x[ji+1]) } fi < paste(fi, '*(x  ', x[i1], ')', sep = '', collapse = '') f < paste(f, ' + ', round(q[i,i], 5), fi, sep = '', collapse = '') } x < Var('x') sympy(paste('e = ', f, collapse = '', sep = '')) approx < sympy(paste('e.subs(x, ', as.character(x0), ')', sep = '', collapse = '')) return(list('Approximation from Interpolation'=as.numeric(approx), 'Interpolated Function'=f, 'Divided Differences Table'=q)) }Let’s use the function to interpolate a function given the values of and the corresponding values of a function , and approximate the function when .
divided.differences(x, fx, 8.4) ## Loading required package: rSymPy ## Loading required package: rJython ## Loading required package: rJava ## Loading required package: rjson ## $`Approximation from Interpolation` ## [1] 17.87709 ## ## $`Interpolated Function` ## [1] "16.9446 + 3.1016*(x  8.1) + 0.065*(x  8.1)*(x  8.3) + 0.01042*(x  8.1)*(x  8.3)*(x  8.6)" ## ## $`Divided Differences Table` ## [,1] [,2] [,3] [,4] ## [1,] 16.94460 0.0000 0.00000 0.00000000 ## [2,] 17.56492 3.1016 0.00000 0.00000000 ## [3,] 18.50515 3.1341 0.06500 0.00000000 ## [4,] 18.82091 3.1576 0.05875 0.01041667The function returns the approximated value, the interpolated polynomial use to approximate the function and the divided differences table which contains the intermediate calculations as we saw earlier. I wasn’t able to find a function in R or a package that performs the divided differences method, so, unfortunately, I have nothing for which to compare our function. However, the results agree with our manual calculations so that should be satisfactory for this introductory purpose.
ReferencesBurden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.
The post Divided Differences Method of Polynomial Interpolation appeared first on Aaron Schlegel.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio meets MilanoR – Presentations, photos and video!
(This article was first published on MilanoR, and kindly contributed to Rbloggers)
Hello Rusers,
On June 29th we had the great pleasure to host the RStudio in Milano at Impact Hub. It went absolutely well with great participation, thank you all!
This post is just to give the materials to those of you who could not make it or just want to go through it again and again!
We had two very exceptional guests: Nathan Stephens, director of solutions engineering at Rstudio, and Joseph Rickert, member of the R Consortium board of directors, accompanied by Pete Knast, enterprise advocate at RStudio. They came all the way from U.S. just to meet the MilanoR Community and the Milano Data Science Community!
The event started with a short introduction of the two hosting communities: MilanoR community and Milano Data Science community. Mariachiara spoke about the events that we have been organising this year: please take a look at the slides for past, present and future events of MilanoR community. Gianmario introduced the Data Science Milan community, the independent group of professionals with the goal of promoting and pioneering knowledge and innovation of the datadriven revolution in the Italian peninsula and beyond.
MilanoR – Past, present and future
by Mariachiara Fortuna
by Gianmario Spacagna
Then our special guests: first Nathan Stephens went through the latest news by Rstudio. Which direction is RStudio taking? What’s next? Through an interactive and entertaining demonstration, Nathan showed us the newest RStudio products: tidyverse, Rnotebook and RStudio connect. These tools enable better data science with R and will make working with R in team simpler as well as more efficient. He gave us insights on how RStudio is evolving towards a more efficient way to communicate and share results. Read the slides below to learn more about Nathan’s presentation.
Publishing and selfservice analytics with R
by Nathan Stephens
Last, Joseph Rickert introduced us to the financial support that the R consortium gives to R communities and R related initiatives. He gave us a quick overview of the many R communities that are spread all around the world. It was somewhat fascinating and inspiring to hear how we all share this same enthusiasm for R. And sure it was for a group of young girls that the day after the meeting gave birth to the MilanoR ladies! Thank you Joe for contributing to enlarging the R community!
Community support: R Consortium & RStudio
by Joseph Rickert
At the end of the meeting we had the chance to network while enjoying a nice refreshment kindly offered by Rstudio.
The whole meeting went on a Facebook streaming as well as on a Youtube streaming, so the full recording is available both on our Facebook channel and on Youtube. Thanks @Akiro and @Fabio for dealing with it!
We will never be able to thank enough our sponsors, Quantide and RStudio, that made all of this possible, Nathan and Joseph that gave two inspiring talks, and of course all of you for supporting us with such enthusiasm.
Thank you so much, and enjoy your summer holidays!
The post RStudio meets MilanoR – Presentations, photos and video! appeared first on MilanoR.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: MilanoR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quick Way of Installing all your old R libraries on a New Device
(This article was first published on R – Discovering Python & R, and kindly contributed to Rbloggers)
I recently bought a new laptop and began installing essential software all over again, including R of course! And I wanted all the libraries that I had installed in my previous laptop. Instead of installing libraries one by one all over again, I did the following:
Step 1: Save a list of packages installed in your old computing device (from your old device).
installed < as.data.frame(installed.packages())
write.csv(installed, 'installed_previously.csv')
This saves information on installed packages in a csv file named installed_previously.csv. Now copy or email this file to your new device and access it from your working directory in R.
Step 2: Create a list of libraries from your old list that were not already installed when you freshly download R (from your new device).
installedPreviously < read.csv('installed_previously.csv')
baseR < as.data.frame(installed.packages())
toInstall < setdiff(installedPreviously, baseR)
We now have a list of libraries that were installed in your previous computer in addition to the R packages already installed when you download R. So you now go ahead and install these libraries.
Step 3: Download this list of libraries.
install.packages(toInstall)
That’s it. Save yourself the trouble installing packages onebyone all over again.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Discovering Python & R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How Virtual Tags have transformed SCADA data analysis
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
This article describes how to use Virtual tags to analyse SCADA data. Virtual tags provide context o SCADA or Historian data by combining information from various tags with meta data about these tags. Continue reading →
The post How Virtual Tags have transformed SCADA data analysis appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
UseR Brussels 2017
(This article was first published on Clean Code, and kindly contributed to Rbloggers)
I went to UseR2017 in Brussels. It was my first time at an UseR (I have been to the first satRday), AND I LOVED IT!
There were many interesting talks, I am so going to use Fast Frugal Trees in the future for instance
and I saw a lot of shiny applications and R professional.
But best of all. I talked to a lot of people, people I only spoke to online.
Thanked some people for their help in my packages and generally had a lot of fun!
One of the fun things I did was asking people about packages that should be created but
are not yet there, I put them here: https://github.com/RMHogervorst/shouldbeapackage Contribute if you want.
One of the packages that should exist is one that gives you suggestions for Kareoke, preferably powerballads,
of course I used one of the breaks to create a github page and a package. However a suggestion app needs songs, so
me and Kevin O’Brian tricked people into adding to the list of powerballads.
What we still need though, is a shiny app that you can access from your phone when you are IN the kareoke bar.
We put the package under a new github group: Reoke
And yes, submit your songs!
See you around.
UseR Brussels 2017 was originally published by at Clean Code on July 27, 2017.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Clean Code. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Span Dates and Times without Overhead
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
I am working on v.0.4.0 of the padr package this summer. Two new features that will be added are wrappers around seq.Date and seq.POSIXt. Since it is going to take a while before the new release is on CRAN, I go ahead and do an early presentation of these functions. Date and datetime parsing in base R are powerful and comprehensive, but also tedious. They can slow you down in your programming or analysis. Luckily, good wrappers and alternatives exist, at least the ymd{_h}{m}{s} suite from lubridate and Dirk Eddelbuettel’s anytime. These functions remove much of the overhead of date and datetime parsing, allowing for quick formatting of vectors in all kinds of formats. They also alleviate the pain of using seq.Date() and seq.POSIXt a little, because the from and the to arguments should be parsed dates or datetimes. Take the following example.
seq(as.POSIXct("20170725 00:00:00"), as.POSIXct("20170725 03:00:00"), by = "hour") ## [1] "20170725 00:00:00 CEST" "20170725 01:00:00 CEST" ## [3] "20170725 02:00:00 CEST" "20170725 03:00:00 CEST" library(lubridate) seq(ymd_h("20170725 00"), ymd_h("20170725 03"), by = "hour") ## [1] "20170725 00:00:00 UTC" "20170725 01:00:00 UTC" ## [3] "20170725 02:00:00 UTC" "20170725 03:00:00 UTC"I think, however, that there is still some overhead in the second specification. By overhead I mean specifying things that feel redundant, things that could be set to some kind of default. Since the whole idea behind padr is automating away redundant and tedious actions in preparing datetime data, providing alternative functions that ask for as little as possible are a natural addition. This resulted in span_date and span_time. They remove overhead by:

allowing for specification of from and to directly as integer or character in lubridatish format.

setting the unspecified datetime parts to a default of 1 for month and day, and 0 for hour, minute, and second.

assuming the desired interval (the by statement in seq.Date and seq.POSIXt) as the lowest of the datetime parts specified in either from or two.
If this is a little abstract still, let me give some examples. The above becomes example becomes:
devtools::install_github("EdwinTh/padr") # download the dev version library(padr) span_time("20170725 00", "20170725 03") ## [1] "20170725 00:00:00 UTC" "20170725 01:00:00 UTC" ## [3] "20170725 02:00:00 UTC" "20170725 03:00:00 UTC"We can simplify this even further, specifying the 00 for the hour in from is not strictly necesarry. Since the hour is specified in to already the interval will remain hour if we omit it.
span_time("20170725", "20170725 03") ## [1] "20170725 00:00:00 UTC" "20170725 01:00:00 UTC" ## [3] "20170725 02:00:00 UTC" "20170725 03:00:00 UTC"We can even use an integer instead of a character for from. When there are no time parts involved, a character is not necesarry. Since we use it in span_time it will be parsed to POSIXct, not to Date.
span_time(20170725, "20170725 03") ## [1] "20170725 00:00:00 UTC" "20170725 01:00:00 UTC" ## [3] "20170725 02:00:00 UTC" "20170725 03:00:00 UTC"to does not have to be specified, we can use len_out instead. The interval is derived only from from then. To get Jan 1st, from 2010 to 2014 we can do both
span_date(2010, 2014) ## [1] "20100101" "20110101" "20120101" "20130101" "20140101"and
span_date(2010, len_out = 5) ## [1] "20100101" "20110101" "20120101" "20130101" "20140101"If you want the interval to be different from the default, you can specify it.
span_date(2016, 2017, interval = "month") ## [1] "20160101" "20160201" "20160301" "20160401" "20160501" ## [6] "20160601" "20160701" "20160801" "20160901" "20161001" ## [11] "20161101" "20161201" "20170101"Note however, that you can often also specify the interval by providing more information in from or to.
span_date(201601, 2017) ## [1] "20160101" "20160201" "20160301" "20160401" "20160501" ## [6] "20160601" "20160701" "20160801" "20160901" "20161001" ## [11] "20161101" "20161201" "20170101"I hope you find these little wrappers around seq.Date and seq.POSIXt useful and that they will enable you to conquer dates and datetimes a little quicker. You can obtain the function by downloading the dev version of padr as I did above. If you can think of improvements of the functions before it hits CRAN please tell me. Issues filed, pull requests, emails, and tweets are much appreciated.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ggplot2 – Easy way to mix multiple graphs on the same page
(This article was first published on Easy Guides, and kindly contributed to Rbloggers)
To arrange multiple ggplot2 graphs on the same page, the standard R functions – par() and layout() – cannot be used.
The basic solution is to use the gridExtra R package, which comes with the following functions:
 grid.arrange() and arrangeGrob() to arrange multiple ggplots on one page
 marrangeGrob() for arranging multiple ggplots over multiple pages.
However, these functions makes no attempt at aligning the plot panels; instead, the plots are simply placed into the grid as they are, and so the axes are not aligned.
If axis alignment is required, you can switch to the cowplot package, which include the function plot_grid() with the argument align. However, the cowplot package doesn’t contain any solution for multipages layout. Therefore, we provide the function ggarrange() [in ggpubr], a wrapper around the plot_grid() function, to arrange multiple ggplots over multiple pages. It can also create a common unique legend for multiple plots.
This article will show you, step by step, how to combine multiple ggplots on the same page, as well as, over multiple pages, using helper functions available in the following R package: ggpubr R package, cowplot and gridExtra. We’ll also describe how to export the arranged plots to a file.
Related articles:
 Bar Plots and Modern Alternatives
 Facilitating Exploratory Data Visualization: Application to TCGA Genomic Data
 Add Pvalues and Significance Levels to ggplots
Contents:
 Prerequisites
 Create some plots
 Arrange on one page
 Annotate the arranged figure
 Align plot panels
 Change column/row span of a plot
 Use common legend for combined ggplots
 Scatter plot with marginal density plots
 Mix table, text and ggplot2 graphs
 Insert a graphical element inside a ggplot
 Arrange over multiple pages
 Nested layout with ggarrange()
 Export plots
 Acknoweledgment
 Infos
You need to install the R package ggpubr (version >= 0.1.3), to easily create ggplot2based publication ready plots.
We recommend to install the latest developmental version from GitHub as follow:
if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/ggpubr")If installation from Github failed, then try to install from CRAN as follow:
install.packages("ggpubr")Note that, the installation of ggpubr will install the gridExtra and the cowplot package; so you don’t need to reinstall them.
Load ggpubr:
library(ggpubr) Demo data setsData: ToothGrowth and mtcars data sets.
# ToothGrowth data("ToothGrowth") head(ToothGrowth) len supp dose 1 4.2 VC 0.5 2 11.5 VC 0.5 3 7.3 VC 0.5 4 5.8 VC 0.5 5 6.4 VC 0.5 6 10.0 VC 0.5 # mtcars data("mtcars") mtcars$name < rownames(mtcars) mtcars$cyl < as.factor(mtcars$cyl) head(mtcars[, c("name", "wt", "mpg", "cyl")]) name wt mpg cyl Mazda RX4 Mazda RX4 2.620 21.0 6 Mazda RX4 Wag Mazda RX4 Wag 2.875 21.0 6 Datsun 710 Datsun 710 2.320 22.8 4 Hornet 4 Drive Hornet 4 Drive 3.215 21.4 6 Hornet Sportabout Hornet Sportabout 3.440 18.7 8 Valiant Valiant 3.460 18.1 6 Create some plotsHere, we’ll use ggplot2based plotting functions available in ggpubr. You can use any ggplot2 functions to create the plots that you want for arranging them later.
We’ll start by creating 4 different plots:
 Box plots and dot plots using the ToothGrowth data set
 Bar plots and scatter plots using the mtcars data set
You’ll learn how to combine these plots in the next sections using specific functions.
 Create a box plot and a dot plot:
 Create an ordered bar plot and a scatter plot:
Create ordered bar plots. Change the fill color by the grouping variable “cyl”. Sorting will be done globally, but not by groups.
# Bar plot (bp) bp < ggbarplot(mtcars, x = "name", y = "mpg", fill = "cyl", # change fill color by cyl color = "white", # Set bar border colors to white palette = "jco", # jco journal color palett. see ?ggpar sort.val = "asc", # Sort the value in ascending order sort.by.groups = TRUE, # Sort inside each group x.text.angle = 90 # Rotate vertically x axis texts ) bp + font("x.text", size = 8) # Scatter plots (sp) sp < ggscatter(mtcars, x = "wt", y = "mpg", add = "reg.line", # Add regression line conf.int = TRUE, # Add confidence interval color = "cyl", palette = "jco", # Color by groups "cyl" shape = "cyl" # Change point shape by groups "cyl" )+ stat_cor(aes(color = cyl), label.x = 3) # Add correlation coefficient sp Arrange on one pageTo arrange multiple ggplots on one single page, we’ll use the function ggarrange()[in ggpubr], which is a wrapper around the function plot_grid() [in cowplot package]. Compared to the standard function plot_grid(), ggarange() can arrange multiple ggplots over multiple pages.
ggarrange(bxp, dp, bp + rremove("x.text"), labels = c("A", "B", "C"), ncol = 2, nrow = 2)Alternatively, you can also use the function plot_grid() [in cowplot]:
library("cowplot") plot_grid(bxp, dp, bp + rremove("x.text"), labels = c("A", "B", "C"), ncol = 2, nrow = 2)or, the function grid.arrange() [in gridExtra]:
library("gridExtra") grid.arrange(bxp, dp, bp + rremove("x.text"), ncol = 2, nrow = 2) Annotate the arranged figureR function: annotate_figure() [in ggpubr].
figure < ggarrange(sp, bp + font("x.text", size = 10), ncol = 1, nrow = 2) annotate_figure(figure, top = text_grob("Visualizing mpg", color = "red", face = "bold", size = 14), bottom = text_grob("Data source: \n mtcars data set", color = "blue", hjust = 1, x = 1, face = "italic", size = 10), left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90), right = "I'm done, thanks :)!", fig.lab = "Figure 1", fig.lab.face = "bold" )Note that, the function annotate_figure() supports any ggplots.
Align plot panelsA real use case is, for example, when plotting survival curves with the risk table placed under the main plot.
To illustrate this case, we’ll use the survminer package. First, install it using install.packages(“survminer”), then type this:
# Fit survival curves library(survival) fit < survfit( Surv(time, status) ~ adhere, data = colon ) # Plot survival curves library(survminer) ggsurv < ggsurvplot(fit, data = colon, palette = "jco", # jco palette pval = TRUE, pval.coord = c(500, 0.4), # Add pvalue risk.table = TRUE # Add risk table ) names(ggsurv) [1] "plot" "table" "data.survplot" "data.survtable"ggsurv is a list including the components:
 plot: survival curves
 table: the risk table plot
You can arrange the survival plot and the risk table as follow:
ggarrange(ggsurv$plot, ggsurv$table, heights = c(2, 0.7), ncol = 1, nrow = 2)It can be seen that the axes of the survival plot and the risk table are not aligned vertically. To align them, specify the argument align as follow.
ggarrange(ggsurv$plot, ggsurv$table, heights = c(2, 0.7), ncol = 1, nrow = 2, align = "v") Change column/row span of a plot Use ggpubr R packageWe’ll use nested ggarrange() functions to change column/row span of plots.
For example, using the R code below:
 the scatter plot (sp) will live in the first row and spans over two columns
 the box plot (bxp) and the dot plot (dp) will be first arranged and will live in the second row with two different columns
The combination of the functions ggdraw() + draw_plot() + draw_plot_label() [in cowplot] can be used to place graphs at particular locations with a particular size.
ggdraw(). Initialize an empty drawing canvas:
ggdraw()Note that, by default, coordinates run from 0 to 1, and the point (0, 0) is in the lower left corner of the canvas (see the figure below).
draw_plot(). Places a plot somewhere onto the drawing canvas:
draw_plot(plot, x = 0, y = 0, width = 1, height = 1) plot: the plot to place (ggplot2 or a gtable)
 x, y: The x/y location of the lower left corner of the plot.
 width, height: the width and the height of the plot
draw_plot_label(). Adds a plot label to the upper left corner of a graph. It can handle vectors of labels with associated coordinates.
draw_plot_label(label, x = 0, y = 1, size = 16, ...) label: a vector of labels to be drawn
 x, y: Vector containing the x and y position of the labels, respectively.
 size: Font size of the label to be drawn
For example, you can combine multiple plots, with particular locations and different sizes, as follow:
library("cowplot") ggdraw() + draw_plot(bxp, x = 0, y = .5, width = .5, height = .5) + draw_plot(dp, x = .5, y = .5, width = .5, height = .5) + draw_plot(bp, x = 0, y = 0, width = 1, height = 0.5) + draw_plot_label(label = c("A", "B", "C"), size = 15, x = c(0, 0.5, 0), y = c(1, 1, 0.5)) Use gridExtra R packageThe function arrangeGrop() [in gridExtra] helps to change the row/column span of a plot.
For example, using the R code below:
 the scatter plot (sp) will live in the first row and spans over two columns
 the box plot (bxp) and the dot plot (dp) will live in the second row with two plots in two different columns
It’s also possible to use the argument layout_matrix in the grid.arrange() function, to create a complex layout.
In the R code below layout_matrix is a 2x2 matrix (2 columns and 2 rows). The first row is all 1s, that’s where the first plot lives, spanning the two columns; the second row contains plots 2 and 3 each occupying one column.
grid.arrange(bp, # bar plot spaning two columns bxp, sp, # box plot and scatter plot ncol = 2, nrow = 2, layout_matrix = rbind(c(1,1), c(2,3)))Note that, it’s also possible to annotate the output of the grid.arrange() function using the helper function draw_plot_label() [in cowplot].
To easily annotate the grid.arrange() / arrangeGrob() output (a gtable), you should first transform it to a ggplot using the function as_ggplot() [in ggpubr ]. Next you can annotate it using the function draw_plot_label() [in cowplot].
library("gridExtra") library("cowplot") # Arrange plots using arrangeGrob # returns a gtable (gt) gt < arrangeGrob(bp, # bar plot spaning two columns bxp, sp, # box plot and scatter plot ncol = 2, nrow = 2, layout_matrix = rbind(c(1,1), c(2,3))) # Add labels to the arranged plots p < as_ggplot(gt) + # transform to a ggplot draw_plot_label(label = c("A", "B", "C"), size = 15, x = c(0, 0, 0.5), y = c(1, 0.5, 0.5)) # Add labels pIn the above R code, we used arrangeGrob() instead of grid.arrange().
Note that, the main difference between these two functions is that, grid.arrange() draw automatically the output of the arranged plots.
As we want to annotate the arranged plots before drawing it, the function arrangeGrob() is preferred in this case.
Use grid R packageThe grid R package can be used to create a complex layout with the help of the function grid.layout(). It provides also the helper function viewport() to define a region or a viewport on the layout. The function print() is used to place plots in a specified region.
The different steps can be summarized as follow :
 Create plots : p1, p2, p3, ….
 Move to a new page on a grid device using the function grid.newpage()
 Create a layout 2X2  number of columns = 2; number of rows = 2
 Define a grid viewport : a rectangular region on a graphics device
 Print a plot into the viewport
To place a common unique legend in the margin of the arranged plots, the function ggarrange() [in ggpubr] can be used with the following arguments:
 common.legend = TRUE: place a common legend in a margin
 legend: specify the legend position. Allowed values include one of c(“top”, “bottom”, “left”, “right”)
In this section, we’ll show how to plot a table and text alongside a chart. The iris data set will be used.
We start by creating the following plots:
 a density plot of the variable “Sepal.Length”. R function: ggdensity() [in ggpubr]
 a plot of the summary table containing the descriptive statistics (mean, sd, … ) of Sepal.Length.
 R function for computing descriptive statistics: desc_statby() [in ggpubr].
 R function to draw a textual table: ggtexttable() [in ggpubr].
 a plot of a text paragraph. R function: ggparagraph() [in ggpubr].
We finish by arranging/combining the three plots using the function ggarrange() [in ggpubr]
# Density plot of "Sepal.Length" #:::::::::::::::::::::::::::::::::::::: density.p < ggdensity(iris, x = "Sepal.Length", fill = "Species", palette = "jco") # Draw the summary table of Sepal.Length #:::::::::::::::::::::::::::::::::::::: # Compute descriptive statistics by groups stable < desc_statby(iris, measure.var = "Sepal.Length", grps = "Species") stable < stable[, c("Species", "length", "mean", "sd")] # Summary table plot, medium orange theme stable.p < ggtexttable(stable, rows = NULL, theme = ttheme("mOrange")) # Draw text #:::::::::::::::::::::::::::::::::::::: text < paste("iris data set gives the measurements in cm", "of the variables sepal length and width", "and petal length and width, respectively,", "for 50 flowers from each of 3 species of iris.", "The species are Iris setosa, versicolor, and virginica.", sep = " ") text.p < ggparagraph(text = text, face = "italic", size = 11, color = "black") # Arrange the plots on the same page ggarrange(density.p, stable.p, text.p, ncol = 1, nrow = 3, heights = c(1, 0.5, 0.3)) Insert a graphical element inside a ggplotThe function annotation_custom() [in ggplot2] can be used for adding tables, plots or other gridbased elements within the plotting area of a ggplot. The simplified format is :
annotation_custom(grob, xmin, xmax, ymin, ymax) grob: the external graphical element to display
 xmin, xmax : x location in data coordinates (horizontal location)
 ymin, ymax : y location in data coordinates (vertical location)
We’ll use the plots  density.p and stable.p  created in the previous section (@ref(mixtabletextandggplot)).
density.p + annotation_custom(ggplotGrob(stable.p), xmin = 5.5, ymin = 0.7, xmax = 8) Place a box plot within a ggplot Create a scatter plot of y = “Sepal.Width” by x = “Sepal.Length” using the iris data set. R function ggscatter() [ggpubr]
 Create separately the box plot of x and y variables with transparent background. R function: ggboxplot() [ggpubr].
 Transform the box plots into graphical objects called a “grop” in Grid terminology. R function ggplotGrob() [ggplot2].
 Place the box plot grobs inside the scatter plot. R function: annotation_custom() [ggplot2].
As the inset box plot overlaps with some points, a transparent background is used for the box plots.
# Scatter plot colored by groups ("Species") #:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: sp < ggscatter(iris, x = "Sepal.Length", y = "Sepal.Width", color = "Species", palette = "jco", size = 3, alpha = 0.6) # Create box plots of x/y variables #:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: # Box plot of the x variable xbp < ggboxplot(iris$Sepal.Length, width = 0.3, fill = "lightgray") + rotate() + theme_transparent() # Box plot of the y variable ybp < ggboxplot(iris$Sepal.Width, width = 0.3, fill = "lightgray") + theme_transparent() # Create the external graphical objects # called a "grop" in Grid terminology xbp_grob < ggplotGrob(xbp) ybp_grob < ggplotGrob(ybp) # Place box plots inside the scatter plot #:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: xmin < min(iris$Sepal.Length); xmax < max(iris$Sepal.Length) ymin < min(iris$Sepal.Width); ymax < max(iris$Sepal.Width) yoffset < (1/15)*ymax; xoffset < (1/15)*xmax # Insert xbp_grob inside the scatter plot sp + annotation_custom(grob = xbp_grob, xmin = xmin, xmax = xmax, ymin = yminyoffset, ymax = ymin+yoffset) + # Insert ybp_grob inside the scatter plot annotation_custom(grob = ybp_grob, xmin = xminxoffset, xmax = xmin+xoffset, ymin = ymin, ymax = ymax) Add background image to ggplot2 graphsImport the background image. Use either the function readJPEG() [in jpeg package] or the function readPNG() [in png package] depending on the format of the background image.
To test the example below, make sure that the png package is installed. You can install it using install.packages(“png”) R command.
# Import the image img.file < system.file(file.path("images", "backgroundimage.png"), package = "ggpubr") img < png::readPNG(img.file)Combine a ggplot with the background image. R function: background_image() [in ggpubr].
library(ggplot2) library(ggpubr) ggplot(iris, aes(Species, Sepal.Length))+ background_image(img)+ geom_boxplot(aes(fill = Species), color = "white")+ fill_palette("jco")Change box plot fill color transparency by specifying the argument alpha. Value should be in [0, 1], where 0 is full transparency and 1 is no transparency.
library(ggplot2) library(ggpubr) ggplot(iris, aes(Species, Sepal.Length))+ background_image(img)+ geom_boxplot(aes(fill = Species), color = "white", alpha = 0.5)+ fill_palette("jco")Another example, overlaying the France map and a ggplot2:
mypngfile < download.file("https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/France_Flag_Map.svg/612pxFrance_Flag_Map.svg.png", destfile = "france.png", mode = 'wb') img < png::readPNG('france.png') ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + background_image(img)+ geom_point(aes(color = Species), alpha = 0.6, size = 5)+ color_palette("jco")+ theme(legend.position = "top") Arrange over multiple pagesIf you have a long list of ggplots, say n = 20 plots, you may want to arrange the plots and to place them on multiple pages. With 4 plots per page, you need 5 pages to hold the 20 plots.
The function ggarrange() [in ggpubr] provides a convenient solution to arrange multiple ggplots over multiple pages. After specifying the arguments nrow and ncol, the function ggarrange() computes automatically the number of pages required to hold the list of the plots. It returns a list of arranged ggplots.
For example the following R code,
multi.page < ggarrange(bxp, dp, bp, sp, nrow = 1, ncol = 2)returns a list of two pages with two plots per page. You can visualize each page as follow:
multi.page[[1]] # Visualize page 1 multi.page[[2]] # Visualize page 2You can also export the arranged plots to a pdf file using the function ggexport() [in ggpubr]:
ggexport(multi.page, filename = "multi.page.ggplot2.pdf")PDF file: Multi.page.ggplot2
Note that, it’s also possible to use the function marrangeGrob() [in gridExtra] to create a multipages output.
library(gridExtra) res < marrangeGrob(list(bxp, dp, bp, sp), nrow = 1, ncol = 2) # Export to a pdf file ggexport(res, filename = "multi.page.ggplot2.pdf") # Visualize interactively res Nested layout with ggarrange()We’ll arrange the plot created in section (@ref(mixtabletextandggplot)) and (@ref(createsomeplots)).
p1 < ggarrange(sp, bp + font("x.text", size = 9), ncol = 1, nrow = 2) p2 < ggarrange(density.p, stable.p, text.p, ncol = 1, nrow = 3, heights = c(1, 0.5, 0.3)) ggarrange(p1, p2, ncol = 2, nrow = 1) Export plotsR function: ggexport() [in ggpubr].
First, create a list of 4 ggplots corresponding to the variables Sepal.Length, Sepal.Width, Petal.Length and Petal.Width in the iris data set.
plots < ggboxplot(iris, x = "Species", y = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), color = "Species", palette = "jco" ) plots[[1]] # Print the first plot plots[[2]] # Print the second plots and so on...Next, you can export individual plots to a file (pdf, eps or png) (one plot per page). It’s also possible to arrange the plots (2 plot per page) when exporting them.
Export individual plots to a pdf file (one plot per page):
ggexport(plotlist = plots, filename = "test.pdf")Arrange and export. Specify nrow and ncol to display multiple plots on the same page:
ggexport(plotlist = plots, filename = "test.pdf", nrow = 2, ncol = 1) AcknoweledgmentWe sincerely thank all developers for their efforts behind the packages that ggpubr depends on, namely:
 Baptiste Auguie (2016). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.2.1. https://CRAN.Rproject.org/package=gridExtra
 Claus O. Wilke (2016). cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. R package version 0.7.0. https://CRAN.Rproject.org/package=cowplot
 H. Wickham. ggplot2: Elegant Graphics for Data Analysis. SpringerVerlag New York, 2009.
This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.4.999).
jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header
.content{padding:0px;}
To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
thinking with data with "Modern Data Science with R"
One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data. Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate? How are the results interpreted and communicated?
While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve realworld problems with data. The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, realworld case studies integrated with modern data science methods. (Figure 8.2 above was taken from a case study in the supervised learning chapter.)
Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics. Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation. Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data. A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.
We believe that several features of the book are distinctive:
 minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
 ethical considerations are raised early, to motivate later examples
 recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured
Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.
This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We’ve taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.
We’ve made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we’re working to create a series of lab activities (e.g., text as data). (The code to generate the above figure can be found in the supervised learning materials at http://mdsrbook.github.io/instructor.html.)
Modern Data Science with R
An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by Rbloggers, PROCX, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Oh, the places you’ll go – mapping packages
(This article was first published on Blog, and kindly contributed to Rbloggers)
by Aimée Gott, Senior Consultant
If you follow us on Twitter you might have noticed that Mango has been doing a bit of travelling this summer. He’s been to San Francisco for the EARL conference, Brussels for UseR, as well as a few other places to teach some training courses. He even managed to sneak in a little holiday in LA where he was disappointed to find there are no cats honoured on the Hollywood walk of fame.
It was when we landed in LA that we got talking about the largest airports in the world (me and our marketing manager Karis – don’t worry, I am not talking to stuffed cats just yet!). After a little guessing Google came to the rescue, and I was quite surprised by the airports that topped the list. That conversation stuck with me through the many airports I have since passed through and it came back to me again after seeing James Cheshire give another inspiring talk at LondonR recently.
James gave a great talk full of inspiring visualisations —many map based— that made me think it was about time to dust off my notes on mapping tools in R and have a play. Along the way I got slightly distracted by the data itself but it was a great chance to refresh myself on the pros and cons of a couple of my favourite mapping packages – ggmap and leaflet.
The dataI didn’t want to spend long collecting and cleaning so I used data available on Wikipedia showing the top 50 airports in terms of passenger numbers from 2000 to 2016. Frustratingly, for 2000 to 2004 the data for only 30 airports were available. For this reason, I only used the top 30 for all years.
ggmap packageI started out looking at ggmap. In general, my mapping requirements are not worldwide data but local data —country or region level— so this is a great tool. Because it’s built on top of ggplot2 it meant I didn’t have to learn another tool, but simply needed to extend the capability I already have. Unfortunately, it’s not the best tool for worldwide data as you can’t zoom to show the whole world map. Because of this, I decided to focus on Europe.
With very little effort I was able to use ggmap to obtain the latitude and longitude of the airports in the data, generate a Google map of Europe and overlay the airports, sized based on the number of passengers that passed through them in the given year. The geocode function was in fact what made all of this task much simpler, as it can find the locations of the airports automatically rather than having to search for additional data.
I didn’t like the Google map though and decided to play with some of the other options. There are quite a few for the base map in ggmap – some from Google, such as satellite, and some completely different ones like ‘watercolor’, which gives a completely different look that I personally like because it takes away a lot of the additional information that we don’t really need in this particular plot i.e. borders – although I would have liked to see some of the major cities.
Leaflet packageMoving on to leaflet I was reminded that I need to use it more often. It’s (almost too) rich in features and I had to stop myself from making my dots appear as planes.
In terms of the code, it uses magrittr to add layers to visualisations and it wasn’t a lot of work to get the data appearing on a world map with popups to identify the airport, year and passenger numbers.
The biggest challenge was setting the relative sizes of the points. Unlike ggmap —which uses ggplot2 and handles the sizing for us— we have to be more specific with leaflet. It took a few runs of the plot to get to a size I was happy with. But once I was at this point I could easily repeat the layer to view other years on the same graphic. This is more clunky as you have to manually filter the data and add the layers but the tradeoff is that you can add switches to turn layers on and off.
This was where I got distracted by the data. By being able to see the global data for multiple years in one graphic it was clear that over the last 16 years there has been a shift in the location of the airports that carry the most passengers from North America to Asia.
So, one last graphic took me back to ggplot2. The data has been scaled to account for the fact that air passenger numbers have continued to increase over the last 20 years. Interestingly, there has been a very clear increase in the number of passengers travelling through Asian airports; in fact, in 2016 half of the top 30 airports were located in Asia.
Is there a better package?To return to the two packages, they both have their strengths and weaknesses and it would really depend on my needs as to which I would chose for a particular visualisation. For interactivity, I would without a doubt go straight to leaflet, for something more suited to local data or intended for static reporting ggmap would be my preference.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Cleaner Generic Functions with RCPP_RETURN Macros
 C++ templates and function overloading are incompatible with R’s C API, so
polymorphism must be achieved via runtime dispatch, handled explicitly by
the programmer.  The traditional technique for operating on SEXP objects in a generic
manner entails a great deal of boilerplate code, which can be unsightly,
unmaintainable, and errorprone.  The desire to provide polymorphic functions which operate on vectors
and matrices is common enough that Rcpp provides the utility macros
RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX to simplify the process.  Subsequently, these macros were extended to handle an (essentially)
arbitrary number of arguments, provided that a C++11 compiler is used.
To motivate a discussion of polymorphic functions, imagine that we desire a
function (ends) which, given an input vector x and an integer n, returns
a vector containing the first and last n elements of x concatenated.
Furthermore, we require ends to be a single interface which is capable of
handling multiple types of input vectors (integers, floating point values,
strings, etc.), rather than having a separate function for each case. How can
this be achieved?
A naïve implementation in R might look something like this:
ends < function(x, n = 6L) { n < min(n, length(x) %/% 2) c(head(x, n), tail(x, n)) } ends(1:9) [1] 1 2 3 4 6 7 8 9 ends(letters, 3) [1] "a" "b" "c" "x" "y" "z" ends(rnorm(20), 2) [1] 0.560476 0.230177 0.701356 0.472791The simple function above demonstates a key feature of many dynamicallytyped
programming languages, one which has undoubtably been a significant factor in their
rise to popularity: the ability to write generic code with littletono
additional effort on the part of the developer. Without getting into a discussion
of the pros and cons of static vs. dynamic typing, it is evident that being able
to dispatch a single function generically on multiple object types, as opposed to,
e.g. having to manage separate impementations of ends for each vector type,
helps us to write more concise, expressive code. Being an article about Rcpp,
however, the story does not end here, and we consider how this problem might
be approached in C++, which has a much more strict type system than R.
For simplicity, we begin by considering solutions in the context of a “pure”
(re: not called from R) C++ program. Eschewing more complicated tactics
involving runtime dispatch (virtual functions, etc.), the C++ language
provides us with two straightforward methods of achieving this at compile time:
 Function Overloading (ad hoc polymorphism)
 Templates (parametric polymorphism)
The first case can be demonstrated as follows:
#include #include #include #include typedef std::vector<int> ivec; ivec ends(const ivec& x, std::size_t n = 6) { n = std::min(n, x.size() / 2); ivec res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end()  n, x.end(), res.begin() + n); return res; } typedef std::vector<double> dvec; dvec ends(const dvec& x, std::size_t n = 6) { n = std::min(n, x.size() / 2); dvec res(2 * n); std::copy(x.begin(), x.begin() + n, res.begin()); std::copy(x.end()  n, x.end(), res.begin() + n); return res; } typedef std::vector<std::string> svec; // and so on... int main() { ivec x, xres; dvec y, yres; for (int i = 0; i < 20; i++) { x.push_back(i); y.push_back(i + 0.5); } xres = ends(x, 4); yres = ends(y); for (std::size_t i = 0; i < xres.size(); i++) { std::cout << xres[i] << "\n"; } for (std::size_t i = 0; i < yres.size(); i++) { std::cout << yres[i] << "\n"; } }Although the above program meets our criteria, the code duplication is profound.
Being seasoned C++ programmers, we recognize this
as a textbook use case for templates and refactor accordingly:
This approach is much more maintainable as we have a single implementation
of ends rather than one implementation per typedef. With this in hand, we
now look to make our C++ version of ends callable from R via Rcpp.
Many people, myself included, have attempted some variation of the following at
one point or another:
Sadly this does not work: magical as Rcpp attributes may be, there are limits
to what they can do, and at least for the time being, translating C++ template
functions into something compatible with R’s C API is out of the question. Similarly,
the first C++ approach from earlier is also not viable, as the C programming
language does not support function overloading. In fact, C does not
support any flavor of typesafe static polymorphism, meaning that our generic
function must be implemented through runtime polymorphism, as touched on in
Kevin Ushey’s Gallery article Dynamic Wrapping and Recursion with Rcpp.
Armed with the almighty TYPEOF macro and a SEXPTYPE cheatsheat, we
modify the template code like so:
Some key remarks:
 Following the ubiquitous Rcpp idiom, we have converted our ends template to use
an integer parameter instead of a type parameter. This is a crucial point, and
later on, we will exploit it to our benefit.  The template implementation is wrapped in a namespace in order to avoid a
naming conflict; this is a personal preference but not strictly necessary.
Alternatively, we could get rid of the namespace and rename either the template
function or the exported function (or both).  We use the opaque type SEXP for our input / output vector since we need a
single input / output type. In this particular situation, replacing SEXP with
the Rcpp type RObject would also be suitable as it is a generic class capable
of representing any SEXP type.  Since we have used an opaque type for our input vector, we must cast it
to the appropriate Rcpp::Vector type accordingly within each case label. (For
further reference, the list of vector aliases can be found here). Finally, we could dress each return value in Rcpp::wrap to convert
the Rcpp::Vector to a SEXP, but it isn’t necessary because Rcpp attributes
will do this automatically (if possible).
At this point we have a polymorphic function, written in C++, and callable from
R. But that switch statement sure is an eyesore, and it will need to be
implemented every time we wish to export a generic function to R. Aesthetics
aside, a more pressing concern is that boilerplate such as this increases the
likelihood of introducing bugs into our codebase – and since we are leveraging
runtime dispatch, these bugs will not be caught by the compiler. For example,
there is nothing to prevent this from compiling:
In our particular case, such mistakes likely would not be too disastrous, but
it should not be difficult to see how situations like this can put you (or a
user of your library!) on the fast track to segfault.
The C preprocessor is undeniably one of the more controversial aspects of the
C++ programming language, as its utility as a metaprogramming tool is rivaled
only by its potential for abuse. A proper discussion of the various pitfalls
associated with Cstyle macros is well beyond the scope of this article, so
the reader is encouraged explore this topic on their own. On the bright side,
the particular macros that we will be discussing are sufficiently complex
and limited in scope that misuse is much more likely to result in a compiler
error than a silent bug, so practically speaking, one can expect a fair bit of
return for relatively little risk.
At a high level, we summarize the RCPP_RETURN macros as follows:
 There are two separate macros for dealing with vectors and matrices,
RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX, respectively.  In either case, code is generated for the following SEXPTYPEs:
 INTSXP (integers)
 REALSXP (numerics)
 RAWSXP (raw bits)
 LGLSXP (logicals)
 CPLXSXP (complex numbers)
 STRSXP (characters / strings)
 VECSXP (lists)
 EXPRSXP (expressions)
 In C++98 mode, each macro accepts two arguments:
 A template function
 A SEXP object
 In C++11 mode (or higher), each macro additionally accepts zero or more
arguments which are forwarded to the template function.
Finally, the template function must meet the following criteria:
 It is templated on a single, integer parameter.
 In the C++98 case, it accepts a single SEXP (or something convertible to
SEXP) argument.  In the C++11 case, it may accept more than one argument, but the first
argument is subject to the previous constraint.
Examining our templated impl::ends function from the previous section, we see
that it meets the first requirement, but fails the second, due to its second
parameter n. Before exploring how ends might be adapted to meet the (C++98)
template requirements, it will be helpful demonstrate correct usage with a few
simple examples.
We consider two situations where our input type is generic, but our output
type is fixed:
 Determining the length (number of elements) of
a vector, in which an int is always returned.  Determining the dimensions (number of rows and number of columns)
of a matrix, in which a lengthtwo IntegerVector is always returned.
First, our len function:
#include using namespace Rcpp; namespace impl { template <int RTYPE> int len(const Vector<RTYPE>& x) { return static_cast<int>(x.size()); } } // impl // [[Rcpp::export]] int len(RObject x) { RCPP_RETURN_VECTOR(impl::len, x); }(Note that we omit the return keyword, as it is part of the macro definition.)
Testing this out on the various supported vector types:
Similarly, creating a generic function that determines the dimensions of an
input matrix is trivial:
And checking this against base::dim,
classes < c( "integer", "numeric", "raw", "logical", "complex", "character", "list", "expression" ) sapply(seq_along(classes), function(i) { x < matrix( vector(mode = classes[i], length = i ^ 2), nrow = i ) all.equal(dims(x), dim(x)) }) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUEeverything seems to be in order.
It’s worth pointing out that, for various reasons, it is possible to pass a
matrix object to an Rcpp function which calls RCPP_RETURN_VECTOR:
Although this is sensible in the case of len – and even saves us from
implementing a matrixspecific version – there may be situations where
this behavior is undesirable. To distinguish between the two object types we
can rely on the API function Rf_isMatrix:
We don’t have to worry about the opposite scenario, as this is already handled
within Rcpp library code:
In many cases our return type will correspond to our input type. For example,
exposing the Rcpp sugar function rev is trivial:
As a slightly more complex example, suppose we would like to write a function
to sort matrices which preserves the dimensions of the input, since
base::sort falls short of the latter stipulation:
There are two obstacles we need to overcome:
 The Matrix class does not implement its own sort method. However,
since Matrix inherits from Vector,
we can sort the matrix as a Vector and construct the result from this
sorted data with the appropriate dimensions.  As noted previously, the RCPP_RETURN macros will generate code to handle
exactly 8 SEXPTYPEs; no less, no more. Some functions, like Vector::sort,
are not implemented for all eight of these types, so in order to avoid a
compilation error, we need to add template specializations.
With this in mind, we have the following implementation of msort:
#include using namespace Rcpp; // primary template template <int RTYPE> Matrix<RTYPE> Msort(const Matrix<RTYPE>& x) { return Matrix<RTYPE>( x.nrow(), x.ncol(), clone(x).sort().begin() ); } // template specializations for raw vectors, // lists, and expression vectors // // we can just throw an exception, as base::sort // does the same template <> Matrix<RAWSXP> Msort(const Matrix<RAWSXP>& x) { stop("sort not allowed for raw vectors."); } template <> Matrix<VECSXP> Msort(const Matrix<VECSXP>& x) { stop("sort not allowed for lists."); } template <> Matrix<EXPRSXP> Msort(const Matrix<EXPRSXP>& x) { stop("sort not allowed for expression vectors."); } // [[Rcpp::export]] RObject msort(RObject x) { RCPP_RETURN_MATRIX(Msort, x); }Note that elements will be sorted in columnmajor order since we filled our
result using this constructor. We can verify that msort works as intended by checking a few test cases:
Having familiarized ourselves with basic usage of the RCPP_RETURN macros, we
can return to the problem of implementing our ends function with
RCPP_RETURN_VECTOR. Just to recap the situation, the template function
passed to the macro must meet the following two criteria in C++98 mode:
 It is templated on a single, integer parameter (representing the
Vector type).  It accepts a single SEXP (or convertible to SEXP) argument.
Currently ends has the signature
template <int RTYPE> Vector<RTYPE> ends(const Vector<RTYPE>&, int);meaning that the first criterion is met, but the second is not. In order
preserve the functionality provided by the int parameter, we effectively
need to generate a new template function which has access to the userprovided
value at runtime, but without passing it as a function parameter.
The technique we are looking for is called partial function application, and it can be implemented
using one of my favorite C++ tools: the functor. Contrary to typical functor
usage, however, our implementation features a slight twist: rather than
using a template class with a nontemplate function call operator, as is the
case with std::greater, etc., we are
going to make operator() a template itself:
Not bad, right? All in all, the changes are fairly minor:
 The function body of Ends::operator() is identical to that of
impl::ends.  n is now a private data member rather than a function parameter, which
gets initialized in the constructor.  Instead of passing a freestanding template function to RCPP_RETURN_VECTOR,
we pass the expression Ends(n), where n is supplied at runtime from the
R session. In turn, the macro will invoke Ends::operator() on the SEXP
(RObject, in our case), using the specified n value.
We can demonstrate this on various test cases:
ends(1:9) [1] 1 2 3 4 6 7 8 9 ends(letters, 3) [1] "a" "b" "c" "x" "y" "z" ends(rnorm(20), 2) [1] 0.694707 0.207917 0.123854 0.215942 A Modern AlternativeAs alluded to earlier, a more modern compiler (supporting C++11 or later)
will free us from the “single SEXP argument” restriction, which means
that we no longer have to move additional parameters into a function
object. Here is ends reimplemented using the C++11 version of
RCPP_RETURN_VECTOR (note the // [[Rcpp::plugins(cpp11)]]
attribute declaration):
The current definition of RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX allows for up
to 24 arguments to be passed; although in principal, the true upper bound
depends on your compiler’s implementation of the __VA_ARGS__ macro, which
is likely greater than 24. Having said this, if you find yourself trying
to pass around more than 3 or 4 parameters at once, it’s probably time
to do some refactoring.