R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 31 min ago

### The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

Tue, 12/05/2017 - 19:13

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m in the market for Christmas presents for my true love, @mrshrbrmstr, and thought I’d look to an age-old shopping list for inspiration. Just what would it set me back if I decided to mimic the 12 Days of Christmas in this modern day and age?

Let’s try to do the whole thing in R (of course!).

We’ll need to:

• Grab the lyrics
• Parse the lyrics
• Get pricing data
• Compute some statistics
• Make some (hopefully) pretty charts

This one was complex enough formatting-wise that I needed to iframe it below. Feel free to bust out of the iframe at any time.

Some good follow-ups to this (if you’re so inclined) would be to predict prices next year and/or clean up the charts a bit.

Grab the code up on GitHub.

(Note: ColourLovers API times out occasionally so just try that snippet again if you get an error).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Computing wind average in an area using rWind

Tue, 12/05/2017 - 17:54

(This article was first published on long time ago..., and kindly contributed to R-bloggers)

p { margin-bottom: 0.25cm; line-height: 120%; }

Hi all! A researcher asked me last week about how to compute wind average for an area using rWind. I wrote a simple function to do this using dplyr library (following the advice of my friend Javier Fajardo). The function will be added to rWind package as soon as possible. Meanwhile, you can test the results… enjoy!

# First, charge the new function
library(dplyr)
wind.region <- function (X){
X[,3] <- X[,3] %% 360
X[X[,3]>=180,3] <- X[X[,3]>=180,3] - 360
avg<-summarise_all(X[,-1], .funs = mean)
wind_region <- cbind(X[1,1],avg)
return(wind_region)
}

Once you have charged the function, let’s do some example

# Get some wind data and convert it into a raster to be plotted later
library(rWind)
library(raster)
wind_data<-wind.dl(2015,2,12,0,-10,5,35,45)
wind_fitted_data <- wind.fit(wind_data)
r_speed <- wind2raster(wind_fitted_data, type="speed")

Now, you can use the new function to obtain wind average in the study area:

myMean <- wind.region(wind_data)
myMean

# Now, you can use wind.fit to get wind speed and direction.
myMean_fitted <- wind.fit(myMean)
myMean_fitted

# Finally, let's plot the results!
library(rworldmap)
library(shape)
plot(r_speed)
lines(getMap(resolution = "low"), lwd=4)
alpha <- arrowDir(myMean_fitted)
Arrowhead(myMean_fitted$lon, myMean_fitted$lat, angle=alpha,
arr.length = 2, arr.type="curved")
text(myMean_fitted$lon+1, myMean_fitted$lat+1,
paste(round(myMean_fitted$speed,2), "m/s"), cex = 2) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: long time ago.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### On the biases in data Tue, 12/05/2017 - 16:15 (This article was first published on Revolutions, and kindly contributed to R-bloggers) Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets: “The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.” Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute. “When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’” Hanna Wallach, Senior Researcher at Microsoft Research NYC. The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below. Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Building a Telecom Dictionary scraping web using rvest in R Tue, 12/05/2017 - 15:00 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise. Disclaimer Web Scraping is not allowed by some websites. Kindly check the site’s Terms of Service before Scraping. This post is only for Educational-purpose. A simple Google search for Telecom Glossary results in this URL: Atis.org of which the required Telecom Dictionary could be built. Let us start by loading the required libraries: #load rvest and stringr library library(rvest) library(stringr) library(dplyr) rvest is a web scraping library in R that makes it easier to write common scraping tasks (to scrape useful information from web pages) without getting our head into xml parsing. rvest can be downloaded from CRAN and the development version is also available on Github. It could be seen in the above-mentioned URL that the Glossary words are listed alphabetically from A to Z with a separate link / URL for every (starting) letter. Clicking ‘A’ takes us to this link: atis.org that lists all the keywords with starting letter ‘A’. If you closely look at the URL, the code that’d be written for one letter (here in our case, ‘A’) could be easily replicated for other letters since ‘A’ is part of the URL which will be the only change for the link of other Alphabets. Let us assign this URL to an R object url which could be passed on as the paramater to rvest’s read_html() function. #url whose content to be scraped url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0' #extracting the content of the given url #url_content <- read_html('http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0') url_content <- read_html(url) read_html() parses the html page of the given url (as its parameter) and saves the result as an xml object. To reiterate the objective, we are trying get the list of Telecom Keywords and as per the screenshot above, You could see that the Keywords are listed as Hyperlinks in the given url. Hyperlinks in HTML is written in the following syntax: Google Google is the Link Text Label that a browser would render and when clicked would take us to www.google.com. Hence it is evident that anchor tags in the page is what we are supposed to scrape/extract. But the issue with the current page is that, It’s not just the Keywords that are represented as Anchor Text (links) in the page but also there are a lot of other links (anchor tags) in the page. Hence to extract only the required information and filter out the irrelevant information, we need to find a pattern that helps us extract only the keywords links. Have a look at this screenshot: This screenshot shows that while the keywords are also represented as hyperlinks (Anchor Text), the differentiator is this ‘id’ element in the url. Only the links of the keywords have got this ‘id’ in the url and hence we can try extracting two information from the current page to get only the relevant information which in our case is – The Keywords: 1. Href value / Actual URL 2. Link Text. #extracting all the links from the page links <- url_content %>% html_nodes('a') %>% html_attr('href') #extracting all thhe link text from the page text <- url_content %>% html_nodes('a') %>% html_text() With the example Hyperlink discussed earlier, the above code gives two informations. www.google.com is saved in links and Google is saved in text. With links and text as columns, Let us build a rough dictionary (that’s not yet cleaned/filtered). #creating a new dictonary of links and text extracted above rough_dict <- data.frame(links,text, stringsAsFactors = F) head(rough_dict) links text 1 http://www.atis.org 2 definitionsList.aspx?find=A&kw=0 A 3 definitionsList.aspx?find=B&kw=0 B 4 definitionsList.aspx?find=C&kw=0 C 5 definitionsList.aspx?find=D&kw=0 D 6 definitionsList.aspx?find=E&kw=0 E As displayed above, rough_dict contains both signal (keywords) and noise (irrelevant links) and we have to filter the irrelevant out with the ‘id‘ pattern that we learnt earlier. #filtering glossary terms leaving out irrelevant information fair_dict <- rough_dict %>% filter(str_detect(links, 'id')) %>% select(text) tail(fair_dict) text 657 AN 658 axial propagation constant 659 analog component 660 axial ratio 661 analog computer 662 axial ray And that’s how using str_detect() we can keep only links with ‘id’ in it and filter out the rest and building our fair_dict. As displayed in the above output, We have got 662 Keywords just for the letter ‘A’ and the same exercise could be repeated for the other letters available on the site. The only change that’s required is the url object. For example, Like this: url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=B&kw=0' Note the ‘B’ in it and the same could be done for other available letters. This process could be further improved by making this scraping part a function and looping the function call over a character vector with all the available letters (which ideally is beyond the scope of this article’s objective and hence left out). The complete code used in this article is available on my Github. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### visualizing reassortment history using seqcombo Tue, 12/05/2017 - 04:09 (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers) Reassortment is an important strategy for influenza A viruses to introduce a HA subtype that is new to human populations, which creates the possibilities of pandemic. A diagram showed above (Figure 2 of doi:10.1038/srep25549) is widely used to illustrate the reassortment events. While such diagrams are mostly manually draw and edit without software tool to automatically generate. Here, I implemented the hybrid_plot function for producing publication quality figure of reassortment events. library(tibble) library(ggplot2) n <- 8 virus_info <- tibble( id = 1:7, x = c(rep(1990, 4), rep(2000, 2), 2009), y = c(1,2,3,5, 1.5, 3, 4), segment_color = list( rep('purple', n), rep('red', n), rep('darkgreen', n), rep('lightgreen', n), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'red', 'purple', 'red', 'purple'), c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple'), c('darkgreen', 'lightgreen', 'lightgreen', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple')) ) flow_info <- tibble(from = c(1,2,3,3,4,5,6), to = c(5,5,5,6,7,6,7)) hybrid_plot(virus_info, flow_info) The hybrid_plot requires two tibble data frame of virus information and genetic flow information. Users need to provide x and y positions to plot the virus, this make sense for geographically and temporally information are usually available in such phylodynamic study and can be employed to set x or y to provide more information and help interpretation of the reassortment events. We use hexagon to represent virus. Users can set the virus outer boundary color by v_color and fill the virus by v_fill. Color of line segments that indicate the genetic flow relationship can be specify via l_color parameter. hybrid_plot(virus_info, flow_info, v_color='firebrick', v_fill='darkgreen', l_color='steelblue') We usually have more information to present, for example host information and HA subtype etc. and these information can be used to color the virus either by v_color or v_fill virus_info$Host = c("Avian", "Human", rep("Swine", 4), "Human") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

The relative virus size can also be specify if a virus_size column is
available in the input virus_info data.

virus_info$virus_size <- c(rep(1, 3), 2, 1, 1, 1.5) hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) If label and label_position coloumns are available, the virus labels (virus name or other information) will be added automatically. virus_info$label <- c("Avian", "Human\nH3N2", "Classic\nswine\nH1N1", "Eurasian swine", "North American swine\n triple reassrotant H3N2", "North American swine\n triple reassortant H1N2", "2009 Human H1N1") virus_info$label_position <- c('left', 'left', 'left', 'below', 'below', 'upper', 'below') hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) User can use asp to set the aspect ratio of hexagons, asp < 1 for thick/short and asp > 1 for thin/tall. hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, asp=2) The output of hybrid_plot is a ggplot object and users can use ggplot2 to modify the details. title <- "Reassortment events in evolution of the 2009 influenza A (H1N1) virus" caption <- 'Gene segments: PB2, PB1, PA, HA, NP, NA, M, NS' color <- c(Avian="purple", Human="red", Swine="darkgreen") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + labs(caption=caption, title=title) + scale_color_manual(values=color) + scale_fill_manual(values=color) + scale_x_continuous(breaks=c(1990, 2000, 2009)) + xlab(NULL) + ylab(NULL) + theme_minimal() + theme(axis.line.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.minor=element_blank(), panel.grid.major.y=element_blank(), legend.position = c(.95, .1) ) Top-down or bottom-up style is also supported. x <- virus_info$x virus_info$x <- virus_info$y virus_info$y <- x virus_info$label_position <- c(rep("right", 3), "left", "left", "right", "right") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) + scale_y_reverse() + scale_x_continuous(limits=c(0, 5.5))

User can also use Emoji to label the virus (host information in this
example):

virus_info$label <- c("chicken", "woman", "pig", "pig", "pig", "pig", "woman") hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, parse='emoji', t_size=8, t_color='firebrick') + scale_y_reverse() In case you don’t have xy-coordination information, you can use set_layout function to auto setting the xy position using selected layout function. virus_info <- set_layout(virus_info, flow_info, layout="layout.kamada.kawai") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick') virus_info <- set_layout(virus_info, flow_info, layout="layout.fruchterman.reingold") hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick') Please let me know if you know any published reassortment data that contain spatial information, I will demonstrate how to visualize reassortment history on a map. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### My book ‘Practical Machine Learning with R and Python’ on Amazon Tue, 12/05/2017 - 03:26 (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers) My book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($9.99) and kindle ($6.97/Rs449) versions. In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code. This is almost like listening to parallel channels of music in stereo! 1. Practical machine with R and Python – Machine Learning in Stereo (Paperback) 2. Practical machine with R and Python – Machine Learning in Stereo (Kindle) This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better. Here is a look at the topics covered Table of Contents Essential R …………………………………….. 7 Essential Python for Datascience ……………….. 54 R vs Python ……………………………………. 77 Regression of a continuous variable ………………. 96 Classification and Cross Validation ……………….113 Regression techniques and regularization …………. 134 SVMs, Decision Trees and Validation curves …………175 Splines, GAMs, Random Forests and Boosting …………202 PCA, K-Means and Hierarchical Clustering …………. 234 Pick up your copy today!! Hope you have a great time learning as I did while implementing these algorithms! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Version 0.6-8 of NIMBLE released Tue, 12/05/2017 - 02:19 (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers) We’ve released the newest version of NIMBLE on CRAN and on our website a week ago. Version 0.6-8 has a few new features, and more are on the way in the next few months. New features include: • the proper Gaussian CAR (conditional autoregressive) model can now be used in BUGS code as dcar_proper, which behaves similarly to BUGS’ car.proper distribution; • a new nimbleMCMC function that provides one-line invocation of NIMBLE’s MCMC engine, akin to usage of JAGS and WinBUGS through R; • a new runCrossValidate function that will conduct k-fold cross-validation of NIMBLE models fit by MCMC; • dynamic indexing in BUGS code is now allowed by default; • and a variety of bug fixes and efficiency improvements. Please see the NEWS file in the installed package for more details. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### #R – Deprecate functions with roxygen2 Tue, 12/05/2017 - 01:00 (This article was first published on R-feed, and kindly contributed to R-bloggers) We show how to use roxygen2 tags and templates for deprecating existing documented functions. Introduction Package roxygen2 provides a powerful way of documenting packages and objects inside them. In particular, great flexibility is available for organizing the documentation content in source files and templates, and render it as desired in the corresponding help pages. We can leverage on this flexibility to adapt an existing package documentation upon making a function deprecated (and similarly defunct). As stated in the base-R ‘Marking Objects as Deprecated’ documentation (help(Deprecated)): The original help page for these functions is often available at help(“oldName-deprecated”) (note the quotes). Functions should be listed in help(“pkg-deprecated”) for an appropriate pkg, including base. Deprecate existing functions We show how an existing and documented function myFun in package ourPkg can be deprecated, and the documentation adjusted accordingly by re-arranging the roxygen2 documentation content. ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' @export myFun <- function(arg1, arg2) { "My return value" } Let us assume a new function myNewFunction exists as a replacement for the old myFun, which becomes deprecated. We want to achieve the following • list myFun in help("ourPkg-deprecated""), along with its replacement; • make its original documentation available via help("myFun-deprecated"). This can be obtained by • creating generic content for help("ourPkg-deprecated""); • adjusting the existing myFun documentation content. Package-level documentation Generic content for help("ourPkg-deprecated") is defined in its own source file: ## ourPkg-deprecated.r #' @title Deprecated functions in package \pkg{ourPkg}. #' @description The functions listed below are deprecated and will be defunct in #' the near future. When possible, alternative functions with similar #' functionality are also mentioned. Help pages for deprecated functions are #' available at \code{help("-deprecated")}. #' @name ourPkg-deprecated #' @keywords internal NULL Function-specific documentation Function-specific content is added to the ourPkg-deprecated help page from the corresponding source file, where we also want to make the original help page available via help("myFun-deprecated"). The source file of myFun is then modified as follows: ## myFun.r #' @title My function. #' @description My function. #' @param arg1 My first argument. #' @param arg2 My second argument. #' @return My return value. #' #' @name myFun-deprecated #' @usage myFun(arg1, arg2) #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal NULL #' @rdname ourPkg-deprecated #' @section \code{myFun}: #' For \code{myFun}, use \code{\link{myNewFun}}. #' #' @export myFun <- function(arg1, arg2) { .Deprecated("myNewFun") "My return value" } Two blocks of roxygen2 tags have been added to the existing documentation content. • The first block is used to create the myFun-deprecated help page, detached from the myFun object. For this reason we need to explicitly add the original function signature as ‘Usage’ section. We also add a ‘See Also’ link to ourPkg-deprecated. • The second block adds myFun-specific content to the ourPkg-deprecated help page. The standard ‘Usage’ section is added automatically, and we create a function-specific section where using myNewFun is suggested. Such content will be collected and shown in help("ourPkg-deprecated") for all deprecated functions with similar tags structure. Note also that help(myFun) will point to the ourPkg-deprecated help page, and that @keywords internal prevents the corresponding topics from appearing in the package documentation index. Using roxygen2 templates Deprecating functions as described above can be automated using templates. One can define a template for each additional block. ## template-depr_pkg.r #' @name ourPkg-deprecated #' @section \code{<%= old %>}: #' For \code{<%= old %>}, use \code{\link{<%= new %>}}. ## template-depr_fun.r #' @name <%= fun %>-deprecated #' @usage <%= gsub("\n", "\n#' ", roxygen2:::wrapString(roxygen2:::function_usage(fun, formals(fun)))) %> #' @seealso \code{\link{ourPkg-deprecated}} #' @keywords internal Note the presence of template variables and of some roxygen2 internals for automating the construction of the ‘Usage’ section (handling multiple lines in case of many arguments). Deprecating herFun in favor of herNewFun then results in the simple inclusion of the templates in the function source. ## herFun.r #' @title Her function. #' @description Her function. #' @param arg1 Her first argument. #' @param arg2 Her second argument. #' @return Her return value. #' #' @templateVar fun herFun #' @template template-depr_fun NULL #' @templateVar old herFun #' @templateVar new herNewFun #' @template template-depr_pkg #' #' @export herFun <- function(arg1, arg2) { .Deprecated("herNewFun") "Her return value" } Example package manual and sources A complete example of ourPkg is avalailable for download. The package contains functions myFun, yourFun, herFun, hisFun, where all but yourFun are deprecated in favor of *NewFun, using roxygen2 templates for herFun and hisFun. The resulting documentation content can be assessed in the corresponding PDF reference manual. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-feed. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Exploratory Data Analysis of Ancient Texts with rperseus Tue, 12/05/2017 - 01:00 (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers) Introduction When I was in grad school at Emory, I had a favorite desk in the library. The desk wasn’t particularly cozy or private, but what it lacked in comfort it made up for in real estate. My books and I needed room to operate. Students of the ancient world require many tools, and when jumping between commentaries, lexicons, and interlinears, additional clutter is additional “friction”, i.e., lapses in thought due to frustration. Technical solutions to this clutter exist, but the best ones are proprietary and expensive. Furthermore, they are somewhat inflexible, and you may have to shoehorn your thoughts into their framework. More friction. Interfacing with the Perseus Digital Library was a popular online alternative. The library includes a catalog of classical texts, a Greek and Latin lexicon, and a word study tool for appearances and references in other literature. If the university library’s reference copies of BDAG1 and Synopsis Quattuor Evangeliorum2 were unavailable, Perseus was our next best thing. Fast forward several years, and I’ve abandoned my quest to become a biblical scholar. Much to my father’s dismay, I’ve learned writing code is more fun than writing exegesis papers. Still, I enjoy dabbling with dead languages, and it was the desire to wed my two loves, biblical studies and R, that birthed my latest package, rperseus. The goal of this package is to furnish classicists with texts of the ancient world and a toolkit to unpack them. Exploratory Data Analysis in Biblical Studies Working with the Perseus Digital Library was already a trip down memory lane, but here’s an example of how I would have leveraged rperseus many years ago. My best papers often sprung from the outer margins of my Nestle-Aland Novum Testamentum Graece. Here the editors inserted cross references to parallel vocabulary, themes, and even grammatical constructions. Given the intertextuality of biblical literature, the margins are a rich source of questions: Where else does the author use similar vocabulary? How is the source material used differently? Does the literary context affect our interpretation of a particular word? This is exploratory data analysis in biblical studies. Unfortunately the excitement of your questions is incommensurate with the tedium of the process–EDA continues by flipping back and forth between books, dog-earring pages, and avoiding paper cuts. rperseus aims to streamline this process with two functions: get_perseus_text and perseus_parallel. The former returns a data frame containing the text from any work in the Perseus Digital Library, and the latter renders a parallel in ggplot2. Suppose I am writing a paper on different expressions of love in Paul’s letters. Naturally, I start in 1 Corinthians 13, the famed “Love Chapter” often heard at weddings and seen on bumper stickers. I finish the chapter and turn to the margins. In the image below, I see references to Colossians 1:4, 1 Thessalonians 1:3, 5:8, Hebrews 10:22-24, and Romans 8:35-39. 1 Corinithians 13 in Nestle-Aland Novum Testamentum Graece Ignoring that some scholars exclude Colossians from the “authentic” letters, let’s see the references alongside each other: library(rperseus) #devtools::install_github(“ropensci/rperseus”) library(tidyverse) tribble( ~label, ~excerpt, "Colossians", "1.4", "1 Thessalonians", "1.3", "1 Thessalonians", "5.8", "Romans", "8.35-8.39" ) %>% left_join(perseus_catalog) %>% filter(language == "grc") %>% select(urn, excerpt) %>% pmap_df(get_perseus_text) %>% perseus_parallel(words_per_row = 4) A brief explanation: First, I specify the labels and excerpts within a tibble. Second, I join the lazily loaded perseus_catalog onto the data frame. Third, I filter for the Greek3 and select the columns containing the arguments required for get_perseus_text. Fourth, I map over each urn and excerpt, returning another data frame. Finally, I pipe the output into perseus_parallel. The key word shared by each passage is agape (“love”). Without going into detail, it might be fruitful to consider the references alongside each other, pondering how the semantic range of agape expands or contracts within the Pauline corpus. Paul had a penchant for appropriating and recasting old ideas–often in slippery and unexpected ways–and your Greek lexicon provides a mere approximation. In other words, how can we move from the dictionary definition of agape towards Paul’s unique vision? If your Greek is rusty, you can parse each word with parse_excerpt by locating the text’s urn within the perseus_catalog object. parse_excerpt(urn = "urn:cts:greekLit:tlg0031.tlg012.perseus-grc2", excerpt = "1.4") word form verse part_of_speech person number tense mood voice gender case degree ἀκούω ἀκούσαντες 1.4 verb NA plural aorist participle active masculine nominative NA ὁ τὴν 1.4 article NA singular NA NA NA feminine accusative NA πίστις πίστιν 1.4 noun NA singular NA NA NA feminine accusative NA ὑμός ὑμῶν 1.4 pronoun NA plural NA NA NA masculine genative NA If your Greek is really rusty, you can also flip the language filter to “eng” to view an older English translation.4 And if the margin references a text from the Old Testament, you can call the Septuagint as well as the original Hebrew.5 tribble( ~label, ~excerpt, "Genesis", "32.31", "Genesis, pointed", "32.31", "Numeri", "12.8", "Numbers, pointed", "12.8" ) %>% left_join(perseus_catalog) %>% filter(language %in% c("grc", "hpt")) %>% select(urn, excerpt) %>% pmap_df(get_perseus_text) %>% perseus_parallel() Admittedly, there is some “friction” here in joining the perseus_catalog onto the initial tibble. There is a learning curve with getting acquainted with the idiosyncrasies of the catalog object. A later release will aim to streamline this workflow. Future Work Check the vignette for a more general overview of rperseus. In the meantime, I look forward to getting more intimately acquainted with the Perseus Digital Library. Tentative plans to extend rperseus a Shiny interface to further reduce “friction” and a method of creating a “book” of custom parallels with bookdown. Acknowledgements I want to thank my two rOpenSci reviewers, Ildikó Czeller and François Michonneau, for coaching me through the review process. They were the first two individuals to ever scrutinize my code, and I was lucky to hear their feedback. rOpenSci onboarding is truly a wonderful process. 1. Bauer, Walter. A Greek-English Lexicon of the New Testament and Other Early Christian Literature. Edited by Frederick W. Danker. 3rd ed. Chicago: University of Chicago Press, 2000. 2. Aland, Kurt. Synopsis Quattuor Evangeliorum. Deutsche Bibelgesellschaft, 1997. 3. The Greek text from the Perseus Digital Library is from 1885 standards. The advancement of textual criticism in the 20th century led to a more stable text you would find in current editions of the Greek New Testament. 4. The English translation is from Rainbow Missions, Inc. World English Bible. Rainbow Missions, Inc.; revision of the American Standard Version of 1901. I’ve toyed with the idea of incorporating more modern translations, but that would require require resources beyond the Perseus Digital Library. 5. “hpt” is the pointed Hebrew text from Codex Leningradensis. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Usage of ruler package Tue, 12/05/2017 - 01:00 (This article was first published on QuestionFlow , and kindly contributed to R-bloggers) Usage examples of ruler package: dplyr-style exploration and validation of data frame like objects. Prologue My previous post tells a story about design of my ruler package, which presents tools for “… creating data validation pipelines and tidy reports”. This package offers a framework for exploring and validating data frame like objects using dplyr grammar of data manipulation. This post is intended to show some close to reality ruler usage examples. Described methods and approaches reflect package design. Along the way you will learn why Yoda and Jabba the Hutt are “outliers” among core “Star Wars” characters. For more information see README (for relatively brief comprehensive introduction) or vignettes (for more thorough description of package capabilities). Beware of a lot of code. Overview suppressMessages(library(dplyr)) suppressMessages(library(purrr)) library(ruler) The general way of performing validation with ruler can be described with following steps: • Formulate a validation task. It is usually stated in the form of a yes-no question or true-false statement about some part (data unit) of an input data frame. Data unit can be one of: data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole], cell. For example, does every column contain elements with sum more than 100?. • Create a dplyr-style validation function (rule pack) which checks desired data unit for obedience to [possibly] several rules: mtcars %>% summarise_all(funs(enough_sum = sum(.) > 100)) • Use ruler’s function rules() instead of explicit or implicit usage of funs(): mtcars %>% summarise_all(rules(enough_sum = sum(.) > 100)) . %>% summarise_all(rules(enough_sum = sum(.) > 100)) • Wrap with rule specification function to explicitly identify validated data unit and to name rule pack. In this case it is col_packs() for column data unit with “is_enough_sum” as rule pack name: col_packs( is_enough_sum = . %>% summarise_all(rules(is_enough = sum(.) > 100)) ) • Expose data to rules to obtain validation result (exposure). Use ruler’s expose() function for that. It doesn’t modify contents of input data frame but creates/updates exposure attribute. Exposure is a list with information about used rule packs (packs_info) and tidy data validation report (report). • Act after exposure. It can be: • Observing validation results with get_exposure(), get_packs_info() or get_report(). • Making assertions if specific rules are not followed in desired way. • Imputing input data frame based on report. In examples we will use starwars data from dplyr package (to celebrate an upcoming new episode). It is a tibble with every row describing one “Star Wars” character. Every example starts with a validation task stated in italic and performs validation from beginning to end. Create rule packs Data Does starwars have 1) number of rows 1a) more than 50; 1b) less than 60; 2) number of columns 2a) more than 10; 2b) less than 15? check_data_dims <- data_packs( check_dims = . %>% summarise( nrow_low = nrow(.) >= 50, nrow_up = nrow(.) <= 60, ncol_low = ncol(.) >= 10, ncol_up = ncol(.) <= 15 ) ) starwars %>% expose(check_data_dims) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 1 x 4 ## name type fun remove_obeyers ## ## 1 check_dims data_pack TRUE ## ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 check_dims nrow_up .all 0 FALSE The result is interpreted as follows: • Data was exposed to one rule pack for data as a whole (data rule pack) named “check_dims”. For it all obeyers (data units which follow specified rule) were removed from validation report. • Combination of var equals .all and id equals 0 means that data as a whole is validated. • Input data frame doesn’t obey (because value is equal to FALSE) rule nrow_up from rule pack check_dims. Does starwars have enough rows for characters 1) with blond hair; 2) humans; 3) humans with blond hair? check_enough_rows <- data_packs( enough_blond = . %>% filter(hair_color == "blond") %>% summarise(is_enough = n() > 10), enough_humans = . %>% summarise( is_enough = sum(species == "Human", na.rm = TRUE) > 30 ), ehough_blond_humans = . %>% filter( hair_color == "blond", species == "Human" ) %>% summarise(is_enough = n() > 5) ) starwars %>% expose(check_enough_rows) %>% get_exposure() ## Exposure ## ## Packs info: ## # A tibble: 3 x 4 ## name type fun remove_obeyers ## ## 1 enough_blond data_pack TRUE ## 2 enough_humans data_pack TRUE ## 3 ehough_blond_humans data_pack TRUE ## ## Tidy data validation report: ## # A tibble: 2 x 5 ## pack rule var id value ## ## 1 enough_blond is_enough .all 0 FALSE ## 2 ehough_blond_humans is_enough .all 0 FALSE New information gained from example: • Rule specification functions can be supplied with multiple rule packs all of which will be independently used during exposing. Does starwars have enough numeric columns? check_enough_num_cols <- data_packs( enough_num_cols = . %>% summarise( is_enough = sum(map_lgl(., is.numeric)) > 1 ) ) starwars %>% expose(check_enough_num_cols) %>% get_report() ## Tidy data validation report: ## # A tibble: 0 x 5 ## # ... with 5 variables: pack , rule , var , id , ## # value • If no breaker is found get_report() returns tibble with zero rows and usual columns. Group Does group defined by hair color and gender have a member from Tatooine? has_hair_gender_tatooine <- group_packs( hair_gender_tatooine = . %>% group_by(hair_color, gender) %>% summarise(has_tatooine = any(homeworld == "Tatooine")), .group_vars = c("hair_color", "gender"), .group_sep = "__" ) starwars %>% expose(has_hair_gender_tatooine) %>% get_report() ## Tidy data validation report: ## # A tibble: 12 x 5 ## pack rule var id value ## ## 1 hair_gender_tatooine has_tatooine auburn__female 0 FALSE ## 2 hair_gender_tatooine has_tatooine auburn, grey__male 0 FALSE ## 3 hair_gender_tatooine has_tatooine auburn, white__male 0 FALSE ## 4 hair_gender_tatooine has_tatooine blonde__female 0 FALSE ## 5 hair_gender_tatooine has_tatooine grey__male 0 FALSE ## # ... with 7 more rows • group_packs() needs grouping columns supplied via .group_vars. • Column var of validation report contains levels of grouping columns to identify group. By default their are pasted together with .. To change that supply .group_sep argument. • 12 combinations of hair_color and gender don’t have a character from Tatooine. They are “auburn”-“female”, “auburn, grey”-“male” and so on. Column Does every list-column have 1) enough average length; 2) enough unique elements? check_list_cols <- col_packs( check_list_cols = . %>% summarise_if( is.list, rules( is_enough_mean = mean(map_int(., length)) >= 1, length(unique(unlist(.))) >= 10 ) ) ) starwars %>% expose(check_list_cols) %>% get_report() ## Tidy data validation report: ## # A tibble: 3 x 5 ## pack rule var id value ## ## 1 check_list_cols is_enough_mean vehicles 0 FALSE ## 2 check_list_cols is_enough_mean starships 0 FALSE ## 3 check_list_cols rule..2 films 0 FALSE • To specify rule functions inside dplyr’s scoped verbs use ruler::rules(). It powers correct output interpretation during exposing process and imputes missing rule names based on the present rules in current rule pack. • Columns vehicles and starships don’t have enough average length and column films doesn’t have enough unique elements. Are all values of column birth_year non-NA? starwars %>% expose( col_packs( . %>% summarise_at( vars(birth_year = "birth_year"), rules(all_present = all(!is.na(.))) ) ) ) %>% get_report() ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 col_pack..1 all_present birth_year 0 FALSE • To correctly validate one column with scoped dplyr verb it should be a named argument inside vars. It is needed for correct interpretation of rule pack output. Row Has character appeared in enough films? As character is defined by row, this is a row pack. has_enough_films <- row_packs( enough_films = . %>% transmute(is_enough = map_int(films, length) >= 3) ) starwars %>% expose(has_enough_films) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 64 x 6 ## pack rule var id value name ## ## 1 enough_films is_enough .all 8 FALSE R5-D4 ## 2 enough_films is_enough .all 9 FALSE Biggs Darklighter ## 3 enough_films is_enough .all 12 FALSE Wilhuff Tarkin ## 4 enough_films is_enough .all 15 FALSE Greedo ## 5 enough_films is_enough .all 18 FALSE Jek Tono Porkins ## # ... with 59 more rows • 64 characters haven’t appeared in 3 films or more. Those are characters described in starwars in rows 8, 9, etc. (counting based on input data). Does character with height less than 100 is a droid? is_short_droid <- row_packs( is_short_droid = . %>% filter(height < 100) %>% transmute(is_droid = species == "Droid") ) starwars %>% expose(is_short_droid) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name, height), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 5 x 7 ## pack rule var id value name height ## ## 1 is_short_droid is_droid .all 19 FALSE Yoda 66 ## 2 is_short_droid is_droid .all 29 FALSE Wicket Systri Warrick 88 ## 3 is_short_droid is_droid .all 45 FALSE Dud Bolt 94 ## 4 is_short_droid is_droid .all 72 FALSE Ratts Tyerell 79 ## 5 is_short_droid is_droid .all 73 NA R4-P17 96 • One can expose only subset of rows by using filter or slice. The value of id column in result will reflect row number in the original input data frame. This feature is powered by keyholder package. In order to use it, rule pack should be created using its supported functions. • value equal to NA is treated as rule breaker. • 5 “not tall” characters are not droids. Cell Is non-NA numeric cell not an outlier based on z-score? This is a bit tricky. To present outliers as rule breakers one should ask whether cell is not outlier. z_score <- function(x, ...) {abs(x - mean(x, ...)) / sd(x, ...)} cell_isnt_outlier <- cell_packs( dbl_not_outlier = . %>% transmute_if( is.numeric, rules(isnt_out = z_score(., na.rm = TRUE) < 3 | is.na(.)) ) ) starwars %>% expose(cell_isnt_outlier) %>% get_report() %>% left_join(y = starwars %>% transmute(id = 1:n(), name), by = "id") %>% print(.validate = FALSE) ## Tidy data validation report: ## # A tibble: 4 x 6 ## pack rule var id value name ## ## 1 dbl_not_outlier isnt_out height 19 FALSE Yoda ## 2 dbl_not_outlier isnt_out mass 16 FALSE Jabba Desilijic Tiure ## 3 dbl_not_outlier isnt_out birth_year 16 FALSE Jabba Desilijic Tiure ## 4 dbl_not_outlier isnt_out birth_year 19 FALSE Yoda • 4 non-NA numeric cells appear to be an outlier within their column. Expose data to rules Do groups defined by species, gender and eye_color (3 different checks) have appropriate size? starwars %>% expose( group_packs(. %>% group_by(species) %>% summarise(isnt_many = n() <= 5), .group_vars = "species") ) %>% expose( group_packs(. %>% group_by(gender) %>% summarise(isnt_many = n() <= 60), .group_vars = "gender"), .remove_obeyers = FALSE ) %>% expose(is_enough_eye_color = . %>% group_by(eye_color) %>% summarise(isnt_many = n() <= 20)) %>% get_exposure() %>% print(n_report = Inf) ## Exposure ## ## Packs info: ## # A tibble: 3 x 4 ## name type fun remove_obeyers ## ## 1 group_pack..1 group_pack TRUE ## 2 group_pack..2 group_pack FALSE ## 3 is_enough_eye_color group_pack TRUE ## ## Tidy data validation report: ## # A tibble: 7 x 5 ## pack rule var id value ## ## 1 group_pack..1 isnt_many Human 0 FALSE ## 2 group_pack..2 isnt_many female 0 TRUE ## 3 group_pack..2 isnt_many hermaphrodite 0 TRUE ## 4 group_pack..2 isnt_many male 0 FALSE ## 5 group_pack..2 isnt_many none 0 TRUE ## 6 group_pack..2 isnt_many NA 0 TRUE ## 7 is_enough_eye_color isnt_many brown 0 FALSE • expose() can be applied sequentially which results into updating existing exposure with new information. • expose() imputes names of supplied unnamed rule packs based on the present rule packs for the same data unit type. • expose() by default removes obeyers (rows with data units that obey respective rules) from validation report. To stop doing that use .remove_obeyers = FALSE during expose() call. • expose() by default guesses the type of the supplied rule pack based only on its output. This has some annoying edge cases but is suitable for interactive usage. To turn this feature off use .guess = FALSE as an argument for expose(). Also, to avoid edge cases create rule packs with appropriate wrappers. Perform some previous checks with one expose(). my_packs <- list(check_data_dims, is_short_droid, cell_isnt_outlier) str(my_packs) ## List of 3 ##$ :List of 1 ## ..$check_dims:function (value) ## .. ..- attr(*, "class")= chr [1:4] "data_pack" "rule_pack" "fseq" "function" ##$ :List of 1 ## ..$is_short_droid:function (value) ## .. ..- attr(*, "class")= chr [1:4] "row_pack" "rule_pack" "fseq" "function" ##$ :List of 1 ## ..$dbl_not_outlier:function (value) ## .. ..- attr(*, "class")= chr [1:4] "cell_pack" "rule_pack" "fseq" "function" starwars_exposed_list <- starwars %>% expose(my_packs) starwars_exposed_arguments <- starwars %>% expose(check_data_dims, is_short_droid, cell_isnt_outlier) identical(starwars_exposed_list, starwars_exposed_arguments) ## [1] TRUE • expose() can have for rule pack argument a list of lists [of lists, of lists, …] with functions at any depth. This enables creating a list of rule packs wrapped with *_packs() functions (which all return a list of functions). • expose() can have multiple rule packs as separate arguments. Act after exposure Throw an error if any non-NA value of mass is more than 1000. starwars %>% expose( col_packs( low_mass = . %>% summarise_at( vars(mass = "mass"), rules(is_small_mass = all(. <= 1000, na.rm = TRUE)) ) ) ) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 low_mass is_small_mass mass 0 FALSE ## Error: assert_any_breaker: Some breakers found in exposure. • assert_any_breaker() is used to assert presence of at least one breaker in validation report. However, offered solution via column pack doesn’t show rows which break the rule. To do that one can use cell pack: starwars %>% expose( cell_packs( low_mass = . %>% transmute_at( vars(mass = "mass"), rules(is_small_mass = (. <= 1000) | is.na(.)) ) ) ) %>% assert_any_breaker() ## Breakers report ## Tidy data validation report: ## # A tibble: 1 x 5 ## pack rule var id value ## ## 1 low_mass is_small_mass mass 16 FALSE ## Error: assert_any_breaker: Some breakers found in exposure. Remove numeric columns with mean value below certain threshold. To achieve that one should formulate rule as “column mean should be above threshold”, identify breakers and act upon this information. remove_bad_cols <- function(.tbl) { bad_cols <- .tbl %>% get_report() %>% pull(var) %>% unique() .tbl[, setdiff(colnames(.tbl), bad_cols)] } starwars %>% expose( col_packs( . %>% summarise_if(is.numeric, rules(mean(., na.rm = TRUE) >= 100)) ) ) %>% act_after_exposure( .trigger = any_breaker, .actor = remove_bad_cols ) %>% remove_exposure() ## # A tibble: 87 x 11 ## name height hair_color skin_color eye_color gender homeworld ## ## 1 Luke Skywalker 172 blond fair blue male Tatooine ## 2 C-3PO 167 gold yellow Tatooine ## 3 R2-D2 96 white, blue red Naboo ## 4 Darth Vader 202 none white yellow male Tatooine ## 5 Leia Organa 150 brown light brown female Alderaan ## # ... with 82 more rows, and 4 more variables: species , ## # films , vehicles , starships • act_after_exposure is a wrapper for performing actions after exposing. It takes .trigger function to trigger action and .actor function to perform action and return its result. • any_breaker is a function which return TRUE if tidy validation report attached to it has any breaker and FALSE otherwise. Conclusions • Yoda and Jabba the Hutt are outliers among other “Star Wars” characters: Yoda is by height and birth year, Jabba is by mass and also birth year. • There are less than 10 “Star Wars” films yet. • ruler offers flexible and extendable functionality for common validation tasks. Validation can be done for data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole] and cell. After exposing data frame of interest to rules and obtaining tidy validation report, one can perform any action based on this information: explore report, throw error, impute input data frame, etc. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### AI School: Microsoft R and SQL Server ML Services Mon, 12/04/2017 - 21:40 (This article was first published on Revolutions, and kindly contributed to R-bloggers) If you'd like to learn how you use R to develop AI applications, the Microsoft AI School now features a learning path focused on Microsoft R and SQL Server ML Services. This learning path includes eight modules, each comprising detailed tutorials and examples: All of the Microsoft AI School learning paths are free to access, and the content is hosted on Github (where feedback is welcome!). You can access this course and many others at the link below. Microsoft AI School: Microsoft R and SQL Server ML Services var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### 5 ways a SAS Health Check can simplify a transition to R Mon, 12/04/2017 - 18:26 (This article was first published on Mango Solutions, and kindly contributed to R-bloggers) Nic Crane, Data Scientist At Mango, we’re seeing more and more clients making the decision to modernise their analytics process; moving away from SAS and on to R, Python, and other technologies. There are a variety of reasons for this, including SAS license costs, the increase of recent graduates with R and Python skills, SAS becoming increasingly uncommon, or the need for flexible technologies which have the capability for advanced analytics and quality graphics output. While such transitions are typically about much more than just technology migration, the code accounts for a significant degree of the complexity. So, in order to support our clients, we have developed a software suite to analyse the existing SAS code and simplify this process. So how can a SAS Code Health Check help you decide on how to tackle this kind of transformation? 1. Analyse procedure calls to inform technology choice Using the right technology for the right job is important if we want to create code which is easy to maintain for years, saving us time and resources. But how can we determine the best tool for the job? A key part of any SAS code analysis involves looking at the procedure calls in the SAS codebase to get a quick view of the key functionality. For example, we can see from the analysis above that this codebase mainly consists of calls to PROC SORT and PROC SQL –SAS procedures which reorder data and execute SQL commands used for interacting with databases or tables of data. As there are no statistics related procs, we may decide —if we migrate this application away from SAS— to move this functionality directly into the database. The second graph shows an application which has a high degree of statistical functionality, using the FORECAST, TIMESERIES, and ARIMA procedures to fit complex predictive time series models. As R has sophisticated time series modelling packages, we might decide to move this application to R. 2. Use macro analysis to find the most and least important components of an application Looking at the raw source code doesn’t give us any context about what the most important components of our codebase are. How do we know which code is most important and needs to be prioritised? And how can we avoid spending time redeveloping code which has been written, but is never actually used? We can answer these questions by taking a look at the analysis of the macros and how often they’re used in the code. Macros are like user-defined functions which can combine multiple data steps, proc steps, and logic, and are useful for grouping commands we want to call more than once. Looking at the plot above, we can see that the transfer_data macro is called 17 times, so we know it’s important to our codebase. When redeveloping the code, we might want to pay extra attention to this macro as it’s crucial to the application’s functionality. On the other hand, looking at load_other, we can see that it’s never called – this is known as ‘orphaned code’ and is common in large legacy codebases. With this knowledge, we can automatically exclude this to avoid wasting time and resource examining it. 3. Looking at the interrelated components to understand process flow When redeveloping individual applications, planning the project and allocating resources requires an understanding of how the different components fit together and which parts are more complex than others. How do we gain this understanding without spending hours reading every line of code? The process flow diagram above allows us to see which scripts are linked to other scripts. Each node represents a script in the codebase, and is scaled by size. The nodes are coloured by complexity. Looking at the diagram above, we can instantly see that the create_metadata script is both large and complex, and so we might choose to assign this to a more experienced developer, or look to restructure it first. 4. Examine code complexity to assess what needs redeveloping and redesigning Even with organisational best practice guidelines, there can still be discrepancies in the quality and style of code produced when it was first created. How do we know which code is fit for purpose, and which code needs restructuring so we can allocate resources more effectively? Thankfully, we can use ‘cyclomatic complexity’ which will assess how complex the code is. The results of this analysis will determine: whether it needs to be broken down into smaller chunks, how much testing is needed, and which code needs to be assigned to more experienced developers. 5. Use the high level overview to get an informed perspective which ties into your strategic objectives Analytics modernisation programs can be large and complex projects, and the focus of a SAS Code Health Check is to allow people to make well-informed decisions by reducing the number of unknowns. So, how do we prioritise our applications in a way that ties into our strategic objectives? The overall summary can be used to answer questions around the relative size and complexity of multiple applications; making it possible to estimate more accurately on the time and effort required for redevelopment. Custom comparison metrics can be created on the basis of strategic decisions. For example, if your key priority is to consolidate your ETL process and you might first focus on the apps which have a high number of calls to proc SQL. Or you might have a goal of improving the quality of your graphics and so you’ll focus on the applications which produce a large number of plots. Either way, a high level summary like the one below collects all the information you need in one place and simplifies the decision-making process. SAS conversion projects tend to be large and complicated, and require deep expertise to ensure their success. A SAS Health Check can help reduce uncertainty, guide your decisions and save you time, resources and, ultimately, money. If you’re thinking of reducing or completely redeveloping your SAS estate, and want to know more about how Mango Solutions can help, get in touch with with our team today via sales@mango-solutions.com or +44 (0)1249 705 450. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Announcing the Plumber v0.4.4 Release Mon, 12/04/2017 - 17:11 (This article was first published on Trestle Technology, LLC - R, and kindly contributed to R-bloggers) Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can find out more at www.rplumber.io. We’re excited to announce the v0.4.4 release of Plumber! This release adds a handful of oft-requested features and cleans up a few issues that came out of the major refactor that took place in the 0.4.2 release. We’ve also continued to expand the official Plumber documentation. We’ll mention the highlights below, but you can see the full release notes for v0.4.4 here. The release is on CRAN now, so you can update using: install.packages("plumber") Plumber v0.4.4 Highlights 1. Support customized image sizes on the @png and @jpeg annotations. More details here. 2. Support expiration, HTTPOnly, and Secure flags on cookies as discussed here (but see the “Known Bugs” section below for an important note about cookie expiration in v0.4.4). 3. Restore functionality of PlumberStatic routers (#156). 4. Support arguments sent from clients to nested subrouters. 5. For APIs deployed using DigitalOcean, set the working directory to the root of the API before starting. 6. Restore functionality of setErrorHandler. 7. Ignore capitalization when searching for plumber.r and entrypoint.r files when plumb()ing a directory. 8. Support query strings with keys that appear more than once (#165) 9. Fix the validation error warning at the bottom of deployed Swagger files which would have appeared any time your swagger.json file was hosted in such a way that a hosted validator service would not have been able to access it. For now we just suppress validation of swagger.json files. (#149) 10. Support for floating IPs in DNS check that occurs in do_configure_https() 11. Make adding swap file idempotent in do_provision() so you can now call that function on a single droplet multiple times. 12. Support an exit hook which can define a function that will be evaluated when the API is interrupted. e.g. pr <- plumb("plumber.R"); pr$registerHook("exit", function(){ print("Bye bye!") })
13. Fixed bug in which a single function couldn’t support multiple paths for a
single verb (#203).
Known Bugs

At the time of writing, one bug has already been fixed since the v0.4.4 release. Cookie expiration times were not properly being sent to clients which caused most clients to ignore these times altogether and revert back to using session cookies. If you wish to set an expiration time on a cookie, you will need to use the development release of Plumber which you can install using:

devtools::install_github("trestletech/plumber") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology, LLC - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Outliers Detection and Intervention Analysis

Mon, 12/04/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In my previous tutorial Arima Models and Intervention Analysis we took advantage of the strucchange package to identify and date time series level shifts structural changes. Based on that, we were able to define ARIMA models with improved AIC metrics. Furthermore, the attentive analysis of the ACF/PACF plots highlighted the presence of seasonal patterns. In this tutorial we will investigate another flavour of intervention variable, the transient change. We will take advantage of two fundamental packages for the purpose:

* tsoutliers * TSA

Specifically, we will compare results obtained by modeling the transient change.

Outliers Analysis

Outliers detection relates with intervention analysis as the latter can be argued as a special case of the former one. A basic list of intervention variable is:

* step response intervention * pulse response intervention

A basic list of outliers is:

* Additive Outliers * Level Shifts * Transient Change * Innovation Outliers * Seasonal Level Shifts

An Additive Outlier (AO) represents an isolated spike.
A Level Shift (LS) represents an abrupt change in the mean level and it may be seasonal (Seasonal Level Shift, SLS) or not.
A Transient Change (TC) represents a spike that takes a few periods to disappear.
An Intervention Outlier (IO) represents a shock in the innovations of the model.

Pre-specified outliers are able to satisfactorily describe the dynamic pattern of untypical effects and can be captured by means of intervention variables.

Intervention Analysis – Common Models

A short introduction of the very basic common models of functions useful for intervention analysis follows.

Step function

The step function is useful to represent level shift outliers.

\begin{aligned} S_{T}(t) &=\left\{ \begin{array}{@{}ll@{}} 0 & \text{if}\ t < T \\ 1 & \text{otherwise}\ \end{array} \right. \end{aligned}

It can be thought as a special case of the transient change intervention model with delta = 1 (see ahead the transient change model). We can represent it with the help of the filter() function as follows. .

tc <- rep(0, 50) tc[20] <- 1 ls <- filter(tc, filter = 1, method = "recursive") plot(ls, main = "Level Shift - TC delta = 1", type = "s")

By adding up two step functions at different lags, it is possible to represent additive outliers or transitory level shifts, as we will see soon.

Pulse function

The pulse function is useful to represent additive outliers.

\begin{aligned} P_{T}(t) = S_{T}(t)\ -\ S_{T}(t-1) \end{aligned}

It can be thought as a special case of the transient change intervention model with delta = 0 (see ahead the transient change model). We can graphically represent it with the help of the filter() function as herein shown.

ao <- filter(tc, filter = 0, method = "recursive") plot(ao, main = "Additive Outlier - TC delta = 0", type = "s")

Level Shift function

The level shift function is useful to represent level shift outliers. It can be modeled in terms of step function with magnitude equal to the omega parameter.

\begin{aligned} m(t) = \omega S_{T}(t) \end{aligned}

The graphical representation is the same of the step function with magnitude equal to the omega parameter of the formula above.

Transient change function

The transient change function is useful to represent transient change outliers.

\begin{aligned} \ C(t) = \dfrac{\omega B}{1 – \delta B} P_{T}(t) \end{aligned}

We can graphically represent it by the help of the filter() function as herein shown. Two delta values are considered to show how the transient change varies accordingly.

tc_0_4 <- filter(tc, filter = 0.4, method = "recursive") tc_0_8 <- filter(tc, filter = 0.8, method = "recursive") plot(tc_0_4, main = "TC delta = 0.4") plot(tc_0_8, main = "TC delta = 0.8")

Packages suppressPackageStartupMessages(library(tsoutliers)) suppressPackageStartupMessages(library(TSA)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(astsa)) Analysis

In the following, I will analyse the sex ratio at birth as based on the Arbuthnot dataset which provides information of male and female births in London from year 1639 to 1710. As done in ref. [1], a metric representing the fractional excess of boys births versus girls is defined as:

\begin{aligned} \dfrac{(Boys – Girls)}{Girls} \end{aligned}

url <- "https://www.openintro.org/stat/data/arbuthnot.csv" abhutondot <- read.csv(url, header=TRUE) boys_ts <- ts(abhutondot$boys, frequency=1, start = abhutondot$year[1]) girls_ts <- ts(abhutondot$girls, frequency=1, start = abhutondot$year[1]) delta_ts <- boys_ts - girls_ts excess_ts <- delta_ts/girls_ts plot(excess_ts)

Gives this plot.

With the help of the tso() function within tsoutliers package, we identify if any outliers are present in our excess_ts time series.

outliers_excess_ts <- tso(excess_ts, types = c("TC", "AO", "LS", "IO", "SLS")) outliers_excess_ts Series: excess_ts Regression with ARIMA(0,0,0) errors Coefficients: intercept TC31 0.0665 0.1049 s.e. 0.0031 0.0199 sigma^2 estimated as 0.0007378: log likelihood=180.34 AIC=-354.69 AICc=-354.38 BIC=-347.47 Outliers: type ind time coefhat tstat 1 TC 31 1659 0.1049 5.283

A transient change outlier occurring on year 1659 was identified. We can inspect graphically the results too.

plot(outliers_excess_ts)

Gives this plot.

We found an outlier of Transient Change flavour occurring on year 1659. Specific details are herein shown.

outliers_excess_ts$outliers type ind time coefhat tstat 1 TC 31 1659 0.1049228 5.28339 # time index where the outliers have been detected (outliers_idx <- outliers_excess_ts$outliers$ind) [1] 31 # calendar years where the outliers have been detected outliers_excess_ts$outliers$time [1] 1659 We now want to evaluate the effect of such transient change, comparing the time series under analysis pre and post intervention. #length of our time series n <- length(excess_ts) # transient change outlier at the same time index as found for our time series mo_tc <- outliers("TC", outliers_idx) # transient change effect data is stored into a one-column matrix, tc tc <- outliers.effects(mo_tc, n) TC31 [1,] 0.000000e+00 [2,] 0.000000e+00 [3,] 0.000000e+00 [4,] 0.000000e+00 [5,] 0.000000e+00 [6,] 0.000000e+00 [7,] 0.000000e+00 [8,] 0.000000e+00 [9,] 0.000000e+00 [10,] 0.000000e+00 ... The “coefhat” named data frame stores the coefficient used as multiplier for our transient change tc data matrix. # converting to a number coefhat <- as.numeric(outliers_excess_ts$outliers["coefhat"]) # obtaining the transient change data with same magnitude as determined by the tso() function tc_effect <- coefhat*tc # definining a time series for the transient change data tc_effect_ts <- ts(tc_effect, frequency = frequency(excess_ts), start = start(excess_ts)) # subtracting the transient change intervention to the original time series, obtaining the pre-intervention time series excess_wo_ts <- excess_ts - tc_effect_ts # plot of the original, the pre-intervention and transient change time series plot(cbind(excess_ts, excess_wo_ts, tc_effect_ts))

Gives this plot.

We can further highlight the difference between the original time series and the pre-intervention one.

plot(excess_ts, type ='b', ylab = "excess birth ratio") lines(excess_wo_ts, col = 'red', lty = 3, type ='b')

Gives this plot.

A quick check on the residuals of the pre-intervention time series confirms validity of the ARIMA(0,0,0) model for.

sarima(excess_wo_ts, p=0, d=0, q=0)

Gives this plot.

Now, we implement a similar representation of the transient change outlier by taking advantage of the arimax() function within the TSA package. The arimax() function requires to specify some ARMA parameters, and that is done by capturing the seasonality as discussed in ref. [1]. Further, the transient change is specified by means of xtransf and transfer input parameters. The xtransf parameter is a matrix with each column containing a covariate that affects the time series response in terms of an ARMA filter of order (p,q). For our scenario, it provides a value equal to 1 at the outliers time index and zero at others. The transfer parameter is a list consisting of the ARMA orders for each transfer covariate. For our scenario, we specify an AR order equal to 1.

arimax_model <- arimax(excess_ts, order = c(0,0,0), seasonal = list(order = c(1,0,0), period = 10), xtransf = data.frame(I1 = (1*(seq(excess_ts) == outliers_idx))), transfer = list(c(1,0)), method='ML') summary(arimax_model) Call: arimax(x = excess_ts, order = c(0, 0, 0), seasonal = list(order = c(1, 0, 0), period = 10), method = "ML", xtransf = data.frame(I1 = (1 * (seq(excess_ts) == outliers_idx))), transfer = list(c(1, 0))) Coefficients: sar1 intercept I1-AR1 I1-MA0 0.2373 0.0668 0.7601 0.0794 s.e. 0.1199 0.0039 0.0896 0.0220 sigma^2 estimated as 0.0006825: log likelihood = 182.24, aic = -356.48 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.0001754497 0.0261243 0.02163487 -20.98443 42.09192 0.7459687 0.1429339

The significance of the coefficients is then verified.

coeftest(arimax_model) z test of coefficients: Estimate Std. Error z value Pr(>|z|) sar1 0.2372520 0.1199420 1.9781 0.0479224 * intercept 0.0667816 0.0038564 17.3173 < 2.2e-16 *** I1-AR1 0.7600662 0.0895745 8.4853 < 2.2e-16 *** I1-MA0 0.0794284 0.0219683 3.6156 0.0002997 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The model coefficients are all statistically significative. We remark that the resulting model shows a slight improved AIC result with respect the previous model. Further, both models show improved AIC values with respect previous tutorial discussed ARIMA models.

A quick check on the residuals of the pre-intervention time series confirms validity of the ARIMA(0,0,0)(1,0,0)[10] model for.

sarima(excess_wo_arimax_ts, p=0, d=0, q=0, P=1, D=0, Q=0, S=10)

Gives this plot.

Let us plot the original time series against the fitted one.

plot(excess_ts) lines(fitted(arimax_model), col = 'blue')

Gives this plot.

Considering the transient change model formula, we can elaborate a linear filter with the AR parameter as coefficient and the MA parameter to multiply the filter() function result.

# pulse intervention variable int_var <- 1*(seq(excess_ts) == outliers_idx) # transient change intervention variable obtained by filtering the pulse according to the # definition of transient change and parameters obtained by the ARIMAX model tc_effect_arimax <- filter(int_var, filter = coef(arimax_model)["I1-AR1"], method = "rec", sides = 1) * coef(arimax_model)["I1-MA0"] # defining the time series for the intervention effect tc_effect_arimax_ts <- ts(tc_effect_arimax, frequency = frequency(excess_ts), start = start(excess_ts))

It is interesting to compare two transient change effects, as obtained by the arimax() and the tso() functions.

# comparing transient change effect resulting by ARIMAX (red) with the tso() one (blue) plot(tc_effect_arimax_ts, col ='red', type='l', ylab = "transient change") lines(tc_effect_ts, col ='blue', lty = 3)

By eyeballing the plot above, they appear pretty close.

If you have any questions, please feel free to comment below.

References

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RTutor: Water Pollution and Cancer

Mon, 12/04/2017 - 10:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

One very important benefit of stronger environmental protection is to reduce the damaging effects of pollution on human health.

In his very interesting article “The Consequences of Industrialization: Evidence from Water Pollution and Digestive Cancers in China” (Review of Economics and Statistics, 2012) Avraham Ebenstein studies the impact of water pollution on the rate of digestive cancers for several Chinese river systems. He convincingly argues that there is a causal effect of substantial size and a cost-benefit analysis shows how stricter environmental regulations would allow to statistically save a human life at relatively low cost.

As part of her Master Thesis at Ulm University, Brigitte Peter has generated a very nice RTutor problem set that allows you to replicate the main insights of the article in an interactive fashion. You learn about R, econometrics and the identification of causal effects from field data, as well as the relationship between water pollution and digestive cancer.

Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by many multiple choice quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/brigittepeter/RTutorWaterPollutionChina

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 25 hours total usage time per month. So it may be greyed out when you click at it.)

https://brigittepeter.shinyapps.io/RTutorWaterPollutionChina/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

You can also install RTutor as a docker container:
https://hub.docker.com/r/skranz/rtutor/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to Show R Inline Code Blocks in R Markdown

Mon, 12/04/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Inline code with R Markdown

R Markdown is a well-known tool for reproducible science in R. In this article, I will focus on a few tricks with R inline code.

Some time ago, I was writing a vignette for my package WordR. I was using R Markdown. At one point I wanted to show r expression in the output, exactly as it is shown here, as an inline code block.

In both R Markdown and Markdown, we can write abc to show abc (I will use the blue colour for code blocks showing code as it appears in Rmd file, whereas the default colour will be showing the rendered output). What is not obvious is that you can use double backticks to escape single backticks in the code block. So code like this:  abc  (mind the spaces!) produces this abc.

Now as an exercise, you can guess how I produced the  abc  block above. Yes, indeed, I have   abc   in the Rmd source file. And we can go on like this ad infinitum (can we?).

OK, but I wanted to produce r expression. Learning the lesson above, we can try  r expression . But trying this, I was getting an error:

processing file: codeBlocks.Rmd Quitting from lines 12-22 (codeBlocks.Rmd) Error in vapply(x, format_sci_one, character(1L), ..., USE.NAMES = FALSE) : values must be length 1, but FUN(X[[1]]) result is length 0 Calls: ... paste -> hook -> .inline.hook -> format_sci -> vapply Execution halted

Obviously, the R Markdown renderer is trying to evaluate the expression. So it seems that R Markdown renderer does not know that it should (should it?) skip R inline code blocks which are enclosed by double backticks.

Solution

Making a long (and yes, I spent some time to find a solution) story short. The correct code block to produce r expression is  r "\u0060r expression\u0060" .

Short explanation how it works: \\u0060 is an Unicode representation of the backtick (). So first, the R Markdown renderer finds the R expression within the double backticks and it evaluates it. Important here is the usage of the Unicode for backtick, since using backtick within the expression would result in an error. (We are lucky, that the R Markdown renderer is not running recursively, finding again the R code block and evaluating it again.) So once the R Markdown is done, the Markdown is just seeing  r expression  in the temporary .md file, and it evaluates it correctly to r expression in the HTML output.

If you want to see (much) more, just look at the source R Markdown file for this article here. Do you know a better, more elegant solution? If you do, please use the discussion below.

Epilogue

Some time after I sent the draft of this blog to the RViews admin, I got a reply (thank you!) which pointed to the knitr FAQ page, specifically question number 7 (and a new post from author of knitr package explaining it a little further). It suggests probably more elegant solution of using

Some text before inline code  r expression  and some code after

(mind the newline!) that will produce Some text before inline code r expression and some text after or use  r knitr::inline_expr("expression")  which produces similarly r expression`.

But, I believe this post (especially its source) might still help someone to understand how the R inline code is evaluated.

_____='https://rviews.rstudio.com/2017/12/04/how-to-show-r-inline-code-blocks-in-r-markdown/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### When you use inverse probability weighting for estimation, what are the weights actually doing?

Mon, 12/04/2017 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Towards the end of Part 1 of this short series on confounding, IPW, and (hopefully) marginal structural models, I talked a little bit about the fact that inverse probability weighting (IPW) can provide unbiased estimates of marginal causal effects in the context of confounding just as more traditional regression models like OLS can. I used an example based on a normally distributed outcome. Now, that example wasn’t super interesting, because in the case of a linear model with homogeneous treatment effects (i.e. no interaction), the marginal causal effect is the same as the conditional effect (that is, conditional on the confounders.) There was no real reason to use IPW in that example – I just wanted to illustrate that the estimates looked reasonable.

But in many cases, the conditional effect is different from the marginal effect. (And in other cases, there may not even be an obvious way to estimate the conditional effect – that will be the topic for the last post in this series). When the outcome is binary, the notion that conditional effects are equal to marginal effects is no longer the case. (I’ve touched on this before.) What this means, is that we can recover the true conditional effects using logistic regression, but we cannot estimate the marginal effect. This is directly related to the fact that logistic regression is linear on the logit (or log-odds) scale, not on the probability scale. The issue here is collapsibility, or rather, non-collapsibility.

A simulation

Because binary outcomes are less amenable to visual illustration, I am going to stick with model estimation to see how this plays out:

library(simstudy) # define the data defB <- defData(varname = "L", formula =0.27, dist = "binary") defB <- defData(defB, varname = "Y0", formula = "-2.5 + 1.75*L", dist = "binary", link = "logit") defB <- defData(defB, varname = "Y1", formula = "-1.5 + 1.75*L", dist = "binary", link = "logit") defB <- defData(defB, varname = "A", formula = "0.315 + 0.352 * L", dist = "binary") defB <- defData(defB, varname = "Y", formula = "Y0 + A * (Y1 - Y0)", dist = "nonrandom") # generate the data set.seed(2002) dtB <- genData(200000, defB) dtB[1:6] ## id L Y0 Y1 A Y ## 1: 1 0 0 0 0 0 ## 2: 2 0 0 0 0 0 ## 3: 3 1 0 1 1 1 ## 4: 4 0 1 1 1 1 ## 5: 5 1 0 0 1 0 ## 6: 6 1 0 0 0 0

We can look directly at the potential outcomes to see the true causal effect, measured as a log odds ratio (LOR):

odds <- function (p) { return((p/(1 - p))) } # log odds ratio for entire sample (marginal LOR) dtB[, log( odds( mean(Y1) ) / odds( mean(Y0) ) )] ## [1] 0.8651611

In the linear regression context, the conditional effect measured using observed data from the exposed and unexposed subgroups was in fact a good estimate of the marginal effect in the population. Not the case here, as the conditional causal effect (LOR) of A is 1.0, which is greater than the true marginal effect of 0.86:

library(broom) tidy(glm(Y ~ A + L , data = dtB, family="binomial")) ## term estimate std.error statistic p.value ## 1 (Intercept) -2.4895846 0.01053398 -236.33836 0 ## 2 A 0.9947154 0.01268904 78.39167 0 ## 3 L 1.7411358 0.01249180 139.38225 0

This regression estimate for the coefficient of $$A$$ is a good estimate of the conditional effect in the population (based on the potential outcomes at each level of $$L$$):

dtB[, .(LOR = log( odds( mean(Y1) ) / odds( mean(Y0) ) ) ), keyby = L] ## L LOR ## 1: 0 0.9842565 ## 2: 1 0.9865561

Of course, ignoring the confounder $$L$$ is not very useful if we are interested in recovering the marginal effect. The estimate of 1.4 is biased for both the conditional effect and the marginal effect – it is not really useful for anything:

tidy(glm(Y ~ A , data = dtB, family="binomial")) ## term estimate std.error statistic p.value ## 1 (Intercept) -2.049994 0.009164085 -223.6987 0 ## 2 A 1.433094 0.011723767 122.2384 0 How weighting reshapes the data …

Here is a simple tree graph that shows the potential outcomes for 1000 individuals (based on the same distributions we’ve been using in our simulation). For 27% of the individuals, $$L=1$$, for 73% $$L=0$$. Each individual has a potential outcome under each level of treatment $$A$$. So, that is why there are 730 individuals with $$L=0$$ who are both with and without treatment. Likewise each treatment arm for those with $$L=0$$ has 270 individuals. We are not double counting.

Both the marginal and conditional estimates that we estimated before using the simulated data can be calculated directly using information from this tree. The conditional effects on the log-odds scale can be calculated as …

$LOR_{A=1 \textbf{ vs } A=0|L = 0} = log \left (\frac{0.182/0.818}{0.076/0.924} \right)=log(2.705) = 0.995$

and

$LOR_{A=1 \textbf{ vs } A=0|L = 1} = log \left (\frac{0.562/0.438}{0.324/0.676} \right)=log(2.677) = 0.984$

The marginal effect on the log odds scale is estimated marginal probabilities: $$P(Y=1|A=0)$$ and $$P(Y=1|A=1)$$. Again, we can take this right from the tree …

$P(Y=1|A=0) = 0.73\times0.076 + 0.27\times0.324 = 0.143$ and

$P(Y=1|A=1) = 0.73\times0.182 + 0.27\times0.562 = 0.285$

Based on these average outcomes (probabilities) by exposure, the marginal log-odds for the sample is:

$LOR_{A=1 \textbf{ vs } A=0} = log \left (\frac{0.285/0.715}{0.143/0.857} \right)=log(2.389) = 0.871$

Back in the real world of observed data, this is what the tree diagram looks like:

This tree tells us that the probability of exposure $$A=1$$ is different depending upon that value of $$L$$. For $$L=1$$, $$P(A=1) = 230/730 = 0.315$$ and for $$L=0$$, $$P(A=1) = 180/270 = 0.667$$. Because of this disparity, the crude estimate of the effect (ignoring $$L$$) is biased for the marginal causal effect:

$P(Y=1|A=0) = \frac{500\times0.076 + 90\times0.324}{500+90}=0.114$

and

$P(Y=1|A=1) = \frac{230\times0.182 + 180\times0.562}{230+180}=0.349$

The crude log odds ratio is

$LOR_{A=1 \textbf{ vs } A=0} = log \left (\frac{0.349/0.651}{0.114/0.886} \right)=log(4.170) = 1.420$

As mentioned in the prior post, the IPW is based on the probability of the actual exposure at each level of $$L$$: $$P(A=a | L)$$, where $$a\in(0,1)$$ (and not on $$P(A=1|L)$$, the propensity score). Here are the simple weights for each group:

If we apply the weights to each of the respective groups, you can see that the number of individuals in each treatment arm is adjusted to the total number of individuals in the sub-group defined the level of $$L$$. For example, if we apply the weight of 3.17 (730/230) to each person observed with $$L=0$$ and $$A=1$$, we end up with $$230\times3.17=730$$. Applying each of the respective weights to the subgroups of $$L$$ and $$A$$ results in a new sample of individuals that looks exactly like the one we started out with in the potential outcomes world:

This all works only if we make these two assumptions: $P(Y=1|A=0, L=l) = P(Y_0=1 | A=1, L=l)$ and $P(Y=1|A=1, L=l) = P(Y_1=1 | A=0, L=l)$

That is, we can make this claim only under the assumption of no unmeasured confounding. (This was discussed in the Part 1 post.)

Applying IPW to our data

We need to estimate the weights using logistic regression (though other, more flexible methods, can also be used). First, we estimate $$P(A=1|L)$$ …

exposureModel <- glm(A ~ L, data = dtB, family = "binomial") dtB[, pA := predict(exposureModel, type = "response")]

Now we can derive an estimate for $$P(A=a|L=l)$$ and get the weight itself…

# Define two new columns defB2 <- defDataAdd(varname = "pA_actual", formula = "(A * pA) + ((1 - A) * (1 - pA))", dist = "nonrandom") defB2 <- defDataAdd(defB2, varname = "IPW", formula = "1/pA_actual", dist = "nonrandom") # Add weights dtB <- addColumns(defB2, dtB) dtB[1:6] ## id L Y0 Y1 A Y pA pA_actual IPW ## 1: 1 0 0 0 0 0 0.3146009 0.6853991 1.459004 ## 2: 2 0 0 0 0 0 0.3146009 0.6853991 1.459004 ## 3: 3 1 0 1 1 1 0.6682351 0.6682351 1.496479 ## 4: 4 0 1 1 1 1 0.3146009 0.3146009 3.178631 ## 5: 5 1 0 0 1 0 0.6682351 0.6682351 1.496479 ## 6: 6 1 0 0 0 0 0.6682351 0.3317649 3.014183

To estimate the marginal effect on the log-odds scale, we use the function glm with weights specified by IPW. The true value of marginal effect (based on the population-wide potential outcomes) was 0.87 (as we estimated from the potential outcomes directly and from the tree graph). Our estimate here is spot on (but with such a large sample size, this is not so surprising):

tidy(glm(Y ~ A , data = dtB, family="binomial", weights = IPW)) ## term estimate std.error statistic p.value ## 1 (Intercept) -1.7879512 0.006381189 -280.1909 0 ## 2 A 0.8743154 0.008074115 108.2862 0

It may not seem like a big deal to be able to estimate the marginal effect – we may actually not be interested in it. However, in the next post, I will touch on the issue of estimating causal effects when there are repeated exposures (for example, administering a drug over time) and time dependent confounders that are both affected by prior exposures and affect future exposures and affect the outcome. Under this scenario, it is very difficult if not impossible to control for these confounders – the best we might be able to do is estimate a marginal, population-wide causal effect. That is where weighting will be really useful.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Naming things is hard

Sun, 12/03/2017 - 23:00

Prefixing R function names –

‘There are only two hard things in Computer Science: cache invalidation and naming things.’

The above quip by Phil Karlton is fairly well known and often quoted, sometimes with amusing extensions:

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

— Jeff Atwood (@codinghorror) August 31, 2014

There are only 2 hard things in computer science:
0. Cache invalidation
1. Naming things
7. Asynchronous callbacks
2. Off-by-one errors

— Paweł Zajączkowski (@gvaireth) September 18, 2017

These are funny, but they do also convey some truth: in the midst of all the technicalities and abstractions we can find ourselves caught up with in the world of programming, it’s surprising how often seemingly ‘simple’ things like naming things trip us up.

I was recently reminded of this when a difference of opinions about function names sparked some healthy debate in a pull request I was reviewing for one of my personal projects.

So the question I want to raise is this:

“When (if ever) is it a good idea to adopt a prefixing convention for the names of exported functions in an R package?”

Disclaimers

Before I dive into the details I feel it is important to state a few things upfront.

Firstly, I want to thank my friends and collaborators Katrin and Lorenz who are strong proponents of open source software and for whom I have a lot of respect. On this occasion they both seem to disagree with me, but that is not a bad thing – discussion and debate is valuable, and that’s not gonna happen when everyone agrees with each other all the time. I did also ask for their permission before publishing this post.

Secondly, my purpose in writing about this is less about trying to determine who is right, and more about attempting to convert this experience into insight. Often we learn more from our failures than from our successes, but it’s harder to share our mistakes than it is to share our triumphs. So this post is my way of being vulnerable about something that is a work in progress and the process of trying to improve on it. As an aside, if you haven’t encountered survivorship bias before, I highly recommend you read this.

Thirdly, I was wrong. This is especially important in light of the previous point. I was wrong to raise the issue of function names (for the package as a whole) in a pull request which was focused on something else. This is a valuable lesson. One should always aim to keep pull requests (and issues) narrow in scope because attempting to tackle multiple problems in one place (unless they are inherently linked or dependent on each other) is messy and complicates matters unnecessarily.

Lastly, what I share will be my own opinion, but it is just an opinion and I’m always open to learning from others with different views. My hope is that collectively we can share some worthwhile perspectives from both sides and possibly encourage more thinking and conversation around this or related issues.

Background Context

Ok, so having made those upfront disclaimers, I’ll begin by summarizing the back-story and context in which the discussion arose. If you’d like to refer to the pull request itself – it can be found here.

excercism.io is a learning platform that aims to provide code practice and mentorship for everyone. I got involved in developing the R track on Exercism and wrote about it earlier this year. Unlike most online learning platforms, with Exercism, all the coding happens on your machine in an environment you’re familiar with. So Exercism provides a command line tool which leverages an API in order to facilitate the process of fetching and submitting exercises.

A few months ago, I had an idea to write an R package which wraps this API. The thinking was that the user experience (for R users) might be improved upon by facilitating interaction with exercism.io directly from R itself. This removes the need for switching repeatedly between R and a shell when fetching, iterating on and submitting exercises – although now the addition of terminal tabs in RStudio 1.1 has already reduced this friction to a degree. In any case, there are additional opportunities for Exercism helper functions in the package which can be context aware and integrate with the RStudio if it is being used. An example this could be functions (or addins) which make use of the rstudioapi to detect which problem was last worked on when submitting so that it doesn’t need to be specified manually.

Katrin, who is a co-maintainer for the R track on exercism.io, has also been collaborating on this R package with me and has had some great ideas like leveraging testthat’s auto_test() to facilitate and encourage test driven development, as this is one of the implicit goals of Exercism. In the PR introducing this feature, the potential for function name confusion was soon evident when this new Exercism specific version of testthat::auto_test() was (initially) given the name autotest(). This reminded me that I’d in fact been thinking for a while about renaming all the exported functions to adopt the prefixing convention ex_* (for a few different reasons which I’ll get to later). So I figured this “name clash” was as good a catalyst as any, and made the suggestion to start adopting the new naming convention in the PR. Once again it’s worth noting that this was a mistake – I should have instead opened a separate issue to discuss my proposed change in naming conventions.

Discussion & follow-up

The suggestion was met with some resistance, and after some further discussion it became clear to me that it was a thoughtfully considered resistance. So I asked my friend Lorenz to weigh in on the discussion too, given that he knows Katrin and I but is not involved in the project and thus has the benefit of a more neutral perspective. To my surprise, he did not agree with me either!

But I did still seem to have Jenny Bryan on my side (thanks Jenny!), and I figured Hadley (or Sir Lord General Professor Wickham as Mara generally likes to refer to him) had to have thought it was a good idea at some point given the str_ prefix for stringr and fct_ prefix for forcats among others. So after thinking on the problem for a while, out of curiousity I eventually tweeted out a poll to see if I could get any sense of where the #rstats community falls on this issue.

What is your take on prefixing conventions for #rstats function names? (e.g. stringr/stringi, forcats, googlesheets)

— Jon Calder (@jonmcalder) October 20, 2017

At a glance it looks like a reasonable proportion are actually in favour of prefixing conventions for function names (or at least not against the idea), but of course there are a number of disclaimers to make here:

• Character limits (at the time of the poll) made it hard to communicate the question clearly or to include any additional context for the question, so that probably leaves a lot of room for interpretation
• I don’t have much reach on Twitter, so there weren’t many responses (81 votes is not much to go on)
• Even if there had been a good number of responses, Twitter polls need to be looked at skeptically given the potential for sampling bias
• Speaking of sampling bias, most of the votes came in after Hadley tweeted a reply to the poll so it makes sense that the results would be skewed towards his legions of followers (I’m one of them and the degree of influence is clear because his packages are what got me considering prefixing conventions in the first place, among others like googlesheets)

Maëlle had two helpful follow-ups for me. Firstly, she encouraged me to blog about this (and I don’t think I would have done so otherwise so thanks Maëlle!). Secondly, she directed me to the ROpenSci review process for her package ropenaq, which provides access to air quality data via the OpenAQ API. In his review of the package, Andrew MacDonald suggested the following:

“I was thinking that the names of functions might be a bit similar to functions in other packages that also use geography. What do you think of prefixing every function with a package-specific string? Perhaps something like aq_ before all the user-facing functions (i.e. countries() becomes aq_countries()). This is similar to another rOpenSci package, geonames, which uses GN (as in GNcities()). This has the added benefit of playing very nicely with Rstudio, which will show all the package functions as completions when users type aq_ and hit tab.”

Interestingly, this suggestion (although the original inspiration may have come from elsewhere) was later incorporated into ROpenSci’s packaging guide:

Consider an object_verb() naming scheme for functions in your package that take a common data type or interact with a common API. object refers to the data/API and verb the primary action. This scheme helps avoid namespace conflicts with packages that may have similar verbs, and makes code readable and easy to auto-complete. For instance, in stringi, functions starting with stri_ manipulate strings (stri_join(), stri_sort(), and in googlesheets functions starting with gs_ are calls to the Google Sheets API (gs_auth(), gs_user(), gs_download()).

Though I hadn’t seen this recommendation from ROpenSci at the time, it aligns very strongly with my initial reasoning for wanting to change the function names in the exercism package. It is primarily an API package, and all functions either interact with the exercism.io API or act on some (local) Exercism data/code (exercises). A potential objection could be that in some cases the ex_* prefix may be interpreted either as exercism_* or as exercise_*, but I don’t think that’s a problem since either way the context is common and shared implicitly.

Having said that, I’m also aware that a prefixing convention is not suitable in the majority of cases and there are reasons to avoid it, otherwise it would already be far more common. I’ve not tried to summarize the arguments for and against it here since this post is already quite lengthy, but I believe Katrin and Lorenz both raised a number of good points over in the original PR thread, so I would encourage you to read through that to get some more insight into the potential pros and cons.

Below is an overview of the currently exported functions for exercism, along with a brief
description of what they do and potential new names for each should we adopt a prefixing convention:

Current Function Description New Name? set_api_key() Set an environment variable for the provided exercism.io API key, and store in .Renviron so that it can persist for future sessions. ex_set_key() set_exercism_path() Set an environment variable for the provided exercism path, and store in .Renviron so that it can persist for future sessions. ex_set_path() track_status() Fetches current track status from exercism.io ex_status() check_next_problem() Returns the next problem for a language track ex_check() fetch_problem() Fetches the files for a problem via the Exercism API and writes them into a new problem folder in the Exercism directory ex_fetch() fetch_next() Checks for the next problem via the Exercism API, and writes the files into the folder in the Exercism directory *special case of ex_fetch() open_exercise() Open files for an exercism.io problem ex_open() start_testing() Exercism- and R-specific wrapper for testthat::auto_test() that starts testing your solution against the problem’s test cases. ex_auto_test() submit() Submits the specified solution to exercism.io ex_submit() skip_problem() Marks a problem as ‘skipped’ via the Exercism API ex_skip() browse_exercise() Navigate to an exercise description on exercism.io ex_browse() browse_solution() Navigate to an exercise solution on exercism.io *special case of ex_browse()

So looking at the above, do you think this a good use case for an object_verb() naming convention? How should one determine this? Please feel free to comment with your thoughts and suggestions below or ping me on Twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Explain! Explain! Explain!

Sun, 12/03/2017 - 08:00

(This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers)

Predictive modeling is fun. With random forest, xgboost, lightgbm and other elastic models…
Problems start when someone is asking how predictions are calculated.
Well, some black boxes are hard to explain.
And this is why we need good explainers.

In the June Aleksandra Paluszynska defended her master thesis Structure mining and knowledge extraction from random forest. Find the corresponding package randomForestExplainer and its vignette here.

In the September David Foster published a very interesting package xgboostExplainer. Try it to extract useful information from a xgboost model and create waterfall plots that explain variable contributions in predictions. Read more about this package here.

In the October Albert Cheng published lightgbmExplainer. Package with waterfall plots implemented for lightGBM models. Its usage is very similar to the xgboostExplainer package.

Waterfall plots that explain single predictions are great. They are useful also for linear models. So if you are working with lm() or glm() try the brand new breakDown package (hmm, maybe it should be named glmExplainer). It creates graphical explanations for predictions and has such a nice cheatsheet:

Install the package from https://pbiecek.github.io/breakDown/.

Thanks to RStudio for the cheatsheet’s template.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Tidytext Analysis of the Weinstein Effect

Sun, 12/03/2017 - 01:00

(This article was first published on Computational Social Science, and kindly contributed to R-bloggers)

Quantifying He-Said, She-Said: Newspaper Reporting

I have been meaning to get into quantitative text analysis for a while. I initially planned this post to feature a different package (that I wanted to showcase), however I ran into some problems with their .json parsing methods and currently waiting for the issue to be solved on their end. The great thing about doing data science with R is that there are multiple avenues leading you to the same destination, so let’s take advantage of that.

My initial inspiration came from David Robinson’s post on gendered verbs. I remember sending it around and thinking it was quite cool. Turns out he was building on Julia Silge’s earlier post on gender pronouns. I see that post and I go, ‘what a gorgeous looking ggplot theme!’. So. Neat. Praise be the open source gods, the code is on GitHub. Let’s take advantage of that too.

I still needed a topic, and even though both the Wikipedia plots and the Jane Austen datasets sound interesting to look at, I felt that there is another, obvious choice. It has a Wikipedia page and its own subreddit. Also, the title might have given it away. Let’s get to work.

Getting Full Text News Articles

My first instinct was to check out the NYT APIs—it made sense, given that they broke the news (along with the New Yorker). Everything seemed to be working out just fine, until I realised you cannot get the full text—only the lead. Staying true to my strict data scientist training, I googled ‘full text newspaper api r’ and there it was: GuardianR. Sorry NYC mates, we reckon we will have to cross the pond for this one.

Note that any one source will always be biased. If you are not familiar with the Guardian, it’s British and has a left-centre bias. It might be prudent to pair it with a right-centre newspaper, however not all newspapers provide APIs (which in itself is another selection bias). Alas, we will move on just with the Guardian—insert idiom regarding salt. Finally, you will need to get a free API key from their open source platform. You still have to register, but you are only in danger if you vote Tory and stand on the left side of the escalator. Once you got it, install/load the package via CRAN:

library(GuardianR) ls(pos = "package:GuardianR") ## [1] "get_guardian" "get_json" "parse_json_to_df"

As you can see, the GuardianR package is a simple one: it contains only three (but powerful) functions. We only need the first one to get a hold of the full text articles, and the syntax is super simple:

#not evaluated articles <- get_guardian(keywords = "sexual+harassment", section = "world", from.date = "2012-11-30", to.date = "2017-11-30", api.key = "your-key-here")

Running the above chunk with your own key will get you all articles published in the Guardian in the last five years tagged under the news section ‘world’ and containing the keywords ‘sexual harassment’ in the Guardian API. The keywords can be as simple or complicated as you want; just add more terms using the plus sign.

You might think the time frame is a bit skewed towards the ‘pre’ era—the news broke out on October 5, 2017. Going all the way back five full years, we are comparing 58 months worth of ‘pre’ to only 2 months of ‘post’ Weinstein. However, luckily for you blog posts are not written in real-time, meaning you get to see a (somewhat working) final result so just bear with me. And no, this is not at all like scientists running 514 regressions and failing to mention this tidbit in their publication. Relevant xkcd.

No, the reason is pure pragmatism. It’s not that running the code ‘live’ and getting the articles ‘real-time’ would not slow down this page—it’s not how it works. However, it is good practice to extract a tad bigger chunk than you think you will need, as you can always slice it up later to suit your needs better.

In any case, I am working with downloaded data so I will just load it up. Feel free to subset the data to see whether the results change if you use a different cut-off point. Also, if you go back the same amount of time (i.e. two months before October 5), that would lead to 183 articles for pre and 121 articles for the post time period—it is a reckoning, alright. Going back five years gets us 1224 articles in total; so we actually have 1103-pre and 121-post articles (89% to 11%). That’s more or less cross-validation ratio (well, a bit on the less side maybe), and we will roll with that for now.

articles <- read.csv("articles.csv", stringsAsFactors = FALSE) dim(articles) ## [1] 1224 27 sum(articles$wordcount) ## [1] 1352717 colnames(articles) ## [1] "id" "sectionId" "sectionName" ## [4] "webPublicationDate" "webTitle" "webUrl" ## [7] "apiUrl" "newspaperPageNumber" "trailText" ## [10] "headline" "showInRelatedContent" "lastModified" ## [13] "hasStoryPackage" "score" "standfirst" ## [16] "shortUrl" "wordcount" "commentable" ## [19] "allowUgc" "isPremoderated" "byline" ## [22] "publication" "newspaperEditionDate" "shouldHideAdverts" ## [25] "liveBloggingNow" "commentCloseDate" "body" We get a bunch of variables (27) with that call, but we won’t be needing most of them for our analysis: #laziest subset for only two variables want.var <- c("webPublicationDate", "body") want <- which(colnames(articles) %in% want.var) articles <- articles[, want] articles$webPublicationDate <- as.Date.factor(articles$webPublicationDate) The body contains the full-text, however it’s in HTML: dplyr::glimpse(articles$body[1]) ## chr "

Numerous women have accused Don Burke of indecent assault, sexual harassment and bullying during the 1980s a"| __truncated__

At this point, I must admit I resorted to hacking a bit. I’m sure there is a more elegant solution here. I’ll go with this SO answer to extract text from HTML. Basically, the cleaning function removes the HTML using regex. Unfortunately, this does not clear up various apostrophes found in the text. For that, we switch the encoding from ASCII to byte:

articles$body <- iconv(articles$body, "", "ASCII", "byte") cleanFun <- function(htmlString) { return(gsub("<.*?>", "", htmlString)) } articles$body <- cleanFun(articles$body) dplyr::glimpse(articles$body[1]) ## chr "Numerous women have accused Don Burke of indecent assault, sexual harassment and bullying during the 1980s and "| __truncated__ This will end up cutting some legitimate apostrophes (e.g. “hasn’t”, “didn’t” to “hasn”, “didn”) in some cases, but we will fix that later on. Let’s split the data on date October 5, 2017 and get rid of the date column afterwards: #You can also use negative index for subsetting articles.before <- articles[articles$webPublicationDate < "2017-10-05", ] articles.after <- articles[articles\$webPublicationDate >= "2017-10-05", ] full.text.before <- articles.before[, 2] full.text.before <- as.data.frame(full.text.before) full.text.after <- articles.after[, 2] full.text.after <- as.data.frame(full.text.after) N-Grams and Combinatorics

To me, n-grams are what prisoner’s dilemma to college freshman—that ‘wow, so simple but so cool’ moment. As in, simple after the fact when someone has already figured it out and explained it to you. N-grams are essentially combinations of n words. For example, a bigram (2-gram). Using the tidytext package developed by David and Julia, we can create them in a flash with unnest_tokens. After that, we will separate the bigrams into two distinct words. Next, we will subset the bigrams so that the first word is either he or she. Finally, we will transform the words into frequency counts. I’m heavily recycling their code—no need to reinvent the wheel:

library(tidytext) library(tidyverse) #or just dplyr and tidyr if you are allergic #Create bigrams bigrams.before <- full.text.before %>% unnest_tokens(bigram, full.text.before, token = "ngrams", n = 2) nrow(bigrams.before) ## [1] 1311051 head(bigrams.before) ## bigram ## 1 the walk ## 1.1 walk from ## 1.2 from the ## 1.3 the gare ## 1.4 gare du ## 1.5 du nord #Separate bigrams into two words bigrams.separated.before <- bigrams.before %>% separate(bigram, c("word1", "word2"), sep = " ") head(bigrams.separated.before) ## word1 word2 ## 1 the walk ## 1.1 walk from ## 1.2 from the ## 1.3 the gare ## 1.4 gare du ## 1.5 du nord #Subset he and she in word1 he.she.words.before <- bigrams.separated.before %>% filter(word1 %in% c("he", "she")) #Fix the missing t's after apostrophe fix.apos <- c("hasn", "hadn", "doesn", "didn", "isn", "wasn", "couldn", "wouldn") he.she.words.before <- he.she.words.before %>% mutate(word2 = ifelse(word2 %in% fix.apos, paste0(word2, "t"), word2)) #10 random samples; the numbers are row numbers not counts set.seed(1895) dplyr::sample_n(he.she.words.before, 10) ## word1 word2 ## 4403 she doesnt ## 3732 he was ## 5222 she wasnt ## 11862 she said ## 3972 she wrote ## 3189 he says ## 3952 she sees ## 4878 he was ## 9314 he went ## 9408 she noted #Transform words into counts, add +1 for log transformation he.she.counts.before <- he.she.words.before %>% count(word1, word2) %>% spread(word1, n, fill = 0) %>% mutate(total = he + she, he = (he + 1) / sum(he + 1), she = (she + 1) / sum(she + 1), log.ratio = log2(she / he), abs.ratio = abs(log.ratio)) %>% arrange(desc(log.ratio)) #Top 5 words after she head(he.she.counts.before) ## # A tibble: 6 x 6 ## word2 he she total log.ratio abs.ratio ## ## 1 testified 0.0002194908 0.0027206771 18 3.631734 3.631734 ## 2 awoke 0.0001097454 0.0010580411 6 3.269163 3.269163 ## 3 filed 0.0002194908 0.0021160822 14 3.269163 3.269163 ## 4 woke 0.0002194908 0.0019649335 13 3.162248 3.162248 ## 5 misses 0.0001097454 0.0007557437 4 2.783737 2.783737 ## 6 quickly 0.0001097454 0.0007557437 4 2.783737 2.783737

A couple of observations. First, n-grams overlap, resulting in 1.6M observations (and this is only the pre-period). However, we will only use the gendered subset, which is much more smaller in size. Second, as we define the log ratio as (she / he), the sign of the log ratio determines the direction (positive for she, negative for he), while the absolute value of the log ratio is just the effect size (without direction).

Good stuff, no? Wait until you see the visualisations.

Let There Be GGraphs

Both David and Julia utilise neat data visualisations to drive home their point. I especially like the roboto theme/font, so I will just go ahead and use it. You need to install the fonts separately, so if you are missing them you will get an error message.

We are also loading several other libraries. In addition to the usual suspects, ggrepel will make sure we can plot overlapping labels in a bit nicer way. Let’s start by looking at the most gendered verbs in articles on sexual harassment. In other words, we are identifying which verbs are most skewed towards one gender. I maintain the original logarithmic scale, so the effect sizes are in magnitudes and easy to interpret. If you read the blog posts, you will notice that Julia reports a unidirectional magnitude (relative to she/he), so her scales go from

.25x .5x x 2x 4x

whereas David uses directions, i.e.

'more he' 4x 2x x 2x 4x 'more she'

In both cases, x denotes the same frequency (equally likely) of usage. I don’t think one approach is necessarily better than the other, but I went with David’s approach. Finally, I filter out non-verbs plus ‘have’ and only plot verbs that occur at least five times. If you are serious about filtering out (or the opposite, filtering on) classes of words—say a certain sentiment or a set of adjectives—you should locate a dictionary from an NLP package and extract the relevant words from there. Here, I am doing it quite ad-hoc (and manually):

he.she.counts.before %>% filter(!word2 %in% c("himself", "herself", "ever", "quickly", "actually", "sexually", "allegedly", "have"), total >= 5) %>% group_by(direction = ifelse(log.ratio > 0, 'More "she"', "More 'he'")) %>% top_n(15, abs.ratio) %>% ungroup() %>% mutate(word2 = reorder(word2, log.ratio)) %>% ggplot(aes(word2, log.ratio, fill = direction)) + geom_col() + coord_flip() + labs(x = "", y = 'Relative appearance after "she" compared to "he"', fill = "", title = "Pre Weinstein: 2012-17 The Guardian Articles on Sexual Harassment", subtitle = "Top 15 Most Gendered (Skewed) Verbs after he/she; at least 5 occurrences.") + scale_y_continuous(labels = c("8X", "6X", "4X", "2X", "Same", "2X", "4X", "6X", "8X"), breaks = seq(-4, 4)) + guides(fill = guide_legend(reverse = TRUE)) + expand_limits(y = c(-4, 4))

Several immediate and depressing trends emerge from the data. The top active verbs for women cluster on bringing charges: ‘testified’, ‘filed’; whereas male verbs seem to react to those with ‘argued’, ‘faces’, ‘acknowledges’, and ‘apologized’. Women ‘awoke’ and ‘woke’, matching the more violent male verbs such as ‘drugged’, ‘assaulted’, ‘punched’, and ‘raped’. ‘Alleged’ occurs four times more after she relative to he, and there is no mention of denial (e.g. ‘denied’, ‘denies’) after he. A note on word variations: in some cases, it might be better to combine conjugations into a single category using a wildcard (such as expect* in the graph above). However, I thought the tense can also contribute to a quantitative story, so I left them as they are.

Another way of visualising the gendered differences is to plot their magnitude in addition to their frequency. This time, we are not limited to just verbs; however we still filter out some uninteresting words. There are additional ggplot and ggrepel arguments in this plot. First, I added two reference lines: a red y-intercept with geom_hline to act as a baseline and an invisible x-intercept using geom_vline to give the labels more space on the left-hand side. Do you not love tidy grammar? Last but not least, I insert geom_text_repel to give us more readability: segment.alpha controls the line transparency, while the force argument governs the aggressiveness of the jittering algorithm. We could supply it with a fill argument that corresponds to a factor variable to highlight a certain characteristic (say, total occurrence), however there is not much meaningful variation there in our case.

he.she.counts.before %>% filter(!word2 %in% c("himself", "herself", "she", "too", "later", "apos", "just", "says"), total >= 10) %>% top_n(100, abs.ratio) %>% ggplot(aes(total, log.ratio)) + geom_point() + geom_vline(xintercept = 5, color = "NA") + geom_hline(yintercept = 0, color = "red") + scale_x_log10(breaks = c(10, 100, 1000)) + geom_text_repel(aes(label = word2), segment.alpha = .1, force = 2) + scale_y_continuous(breaks = seq(-4, 4), labels = c('8X "he"', '6X "he"', '4X "he"', '2X "he"', "Same", '2X "she"', '4X "she"', '6X "she"', '8X "she"')) + labs(x = 'Total uses after "he" or "she" (Logarithmic scale)', y = 'Relative uses after "she" to after "he"', title = "Gendered Reporting: Pre Weinstein, The Guardian", subtitle = "Words occurring at least 10 times after he/she: 160 unique words (100 displayed) | 11,013 occurrences in total") + expand_limits(y = c(4, -4))

Plotting frequencies complement the first plot quite nicely. We can infer reported characteristics more easily when there is a tangible baseline. Words around the red line occur after she or he more or less equally: the y-axis determines the relational effect size (with regards to gender), and the x-axis displays the total amount of occurrences. Some additional insights: we see that ‘sexually’ and ‘allegedly’ popping up after he quite frequently. There is also the verb ‘admitted’, as well as ‘denies’ (even though visually it is located above the red line, if you follow the grey segment, it’s located around 1X ‘he’). For women, more morbid words like ‘suffered’, ‘died’ are added to the mix. There are also nuances regarding the tense; ‘claims’ follows she twice more than he, while ‘claimed’ is twice likely to come after he.

Moving on to the post-Weinstein period (‘the effect’), I quietly run the same code, and plot the equivalent graphics below. Some caveats: with the smaller sample size, I lowered the inclusion threshold from 5 to 2. Additionally, although it is top 15 most skewed verbs per gender, because of frequent ties, it ends up having more than that at the end.

After the scandal broke, we see that women are reported to have ‘complained’, ‘hoped’, and ‘became’. On the other hand, men are vehemently denying the accusations, with ‘denies’ and ‘denied’ being the most skewed verbs following he. Random point: in the pre-period, it’s ‘apologized’, in the post-period, it’s ‘apologised’. Maybe Brexit can manifest in mysterious ways.

Again we turn to the frequency plot to infer more. In addition to denial, men are also reported to use words such as ‘categorically’ and ‘utterly’. Both ‘claims’ and ‘claimed’ occur more after she, not repeating the earlier dynamic regarding the tense. In addition, we don’t see ‘alleged’ or ‘allegedly’ featured in the plot at all. How much of this change can we attribute to the effect? At a glance, we definitely see a difference. For example, verbs display a good variation for both genders. The post-frequency plot features less violent words than the pre-frequency plot. There is a lot more ‘denying’ and not much ‘alleging’ in the post-Weinstein period.

Some are definitely data artefacts. The post-frequency plot is ‘cleaner’—in addition to (and directly caused by) the smaller n—because the cut-off point is set to ‘more than once’: if we remove the filter, all the violence and harassment terms are back in. Some are probably reporter/reporting biases plus the prevalent gendered thinking (that occurs both on a conscious level and below). And perhaps some are genuine effects—true signal. It is still too early to pass judgement on whether the Weinstein effect will result in tangible, positive change. Currently, all we can get is a short, limited glimpse at the available data.

Hopefully you managed to enjoy this rather depressing data undertaking using the tidytext package. As usual, the underlying code is available on GitHub. N-grams are powerful. Imagine the possibilities: assume you have access to a rich dataset (say, minimum 100K very long articles/essays). You can construct n-grams sequentially; i.e. 2-grams; 3-grams, 4-grams etc., separate the words, and subset for gendered pronouns. This would give you access to structures like “he” + “word” + “her” (direct action) and “she” + “word” + “word” + “him” (allowing for adjective + verb). Then it would be possible to look for all kinds of interactions, revealing their underlying structure. I will be reporting more on this front, until I move onto image processing (looking at you, keras).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Computational Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...