Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 6 hours 35 min ago

logic: Interpretation of Logic Gates (Using TikZ)

Sun, 05/13/2018 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for matching logic gate diagrams (drawn with TikZ) to the corresponding truth table.

Name: logic Type: schoice Related: automaton

Description: Gate diagrams for three logical operators (sampled from: and, or, xor, nand, nor) are drawn with TikZ and have to be matched to a truth table for another randomly drawn logical operator. Depending on the exams2xyz() interface the TikZ graphic can be rendered in PNG, SVG, or directly by LaTeX. Solution feedback: Yes Randomization: Shuffling, text blocks, and graphics Mathematical notation: No Verbatim R input/output: No Images: Yes Other supplements: No Template: logic.Rnw logic.Rmd Raw: (1 random version) logic.tex logic.md PDF: HTML:

Demo code:

library("exams") set.seed(1090) exams2html("logic.Rnw") set.seed(1090) exams2pdf("logic.Rnw") set.seed(1090) exams2html("logic.Rmd") set.seed(1090) exams2pdf("logic.Rmd") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Do know Shake Shack’s locations outside of the US? You’d be surprised

Sat, 05/12/2018 - 20:35

(This article was first published on R – nandeshwar.info, and kindly contributed to R-bloggers)

Madison Shake

I had heard that the lines to get some food at Shake Shack are long. So when I saw a new location opening in downtown LA, I wondered how many locations does it have and how fast are they spreading across the US. The answers surprised me. Using R and previous code, I created a few maps:

Read on to learn how I got the data and plotted them.

Load Libraries

First, let’s load our favourite libraries.

1 2 3 4 5 library(rvest) library(readr) library(tidyverse) library(scales) library(ggmap) Figure out locations

On its site, Shake Shack fortunately has all the locations and opening dates, going back to April 23, 2012. The archive pages run from 1 to 20 with this URL structure:

https://www.shakeshack.com/location/page/

Using SelectorGadget, I figured out the XPath and CSS code to find the opening date, location name, and location page link. Then, I wrote a function to retrieve these values from a given archive page.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 get_locations <- function(url) { page_html <- read_html(url) nodes <- page_html %>% html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "span4", " " ))]')   data.frame(opdate = html_nodes(x = nodes, xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "date", " " ))]') %>% html_text(trim = TRUE), store_loc_name = html_nodes(x = nodes, css = 'h2') %>% html_text(trim = TRUE), store_loc_link = html_nodes(x = nodes, css = 'h2 a') %>% html_attr("href"), stringsAsFactors = FALSE) }

I applied this function to retrieve all location opening dates, names, and individual location urls:

1 2 3 all_loc_pages <- paste0("https://www.shakeshack.com/location/page/", 1:20, "/")   all_locations <- do.call(rbind, lapply(all_loc_pages, get_locations)) Find addresses of all locations

If you visit an individual location’s page, such as this Tokyo Dome page, you will see that often the exact address is not listed, or if it is, you can’t directly geocode it. But, luckily, there’s a Google Map right below the location. I thought, they must be passing some parameters to Google Maps API. I spend a good amount of time, but couldn’t figure out how they were getting the map. And. Then. I found out that the text “CLICK MAP FOR DIRECTIONS” block had a valid address as part of the hyperlink!!

I wrote another simple function to get the addresses from the given URL:

1 2 3 4 5 6 7 8 9 10 get_loc_cords <- function(loc_url) { location_html <- read_html(loc_url) data.frame(loc_url = loc_url, goog_map_url = location_html %>% html_nodes(xpath = '//a[text()="Click here for directions"]') %>% html_attr("href"), stringsAsFactors = FALSE) }   location_google_maps_address <- do.call(rbind, lapply(all_locations$store_loc_link, get_loc_cords))

Then I joined the location name with the address data frame:

1 all_locations <- left_join(all_locations, location_google_maps_address, by = c("store_loc_link" = "loc_url")) Geocoding the addresses

Using the fantastic ggmap library and mutate_geocode function, I geocoded all the addresses:

1 2 3 all_locations <- all_locations %>% mutate(google_addr_string = str_sub(goog_map_url, start = 36)) %>% mutate_geocode(google_addr_string, output = "latlon")

Here’s what the data frame looks like now:

Tip

You may want to create a Google developer key for mass geocoding. Since the mutate_geocode function is used by many people, sometimes you may not get all the addresses geocoded. Use register_google(key = , account_type = 'premium', day_limit = 100000) function to register your key with ggmap functions. Data manipulation

Now that we have all the geographical coordinates, we just need to do some clean-up to get the data ready for plotting.

First, get the date field in order and add opening month and year columns:

1 2 3 4 all_locations <- all_locations %>% mutate(open_date = as.Date(opdate, "%B %d, %Y"), open_month = lubridate::month(open_date), open_year = lubridate::year(open_date))

Second, get the cumulative count of store openings:

1 2 3 4 5 ss_op_data_smry <- all_locations %>% count(open_date) %>% ungroup() %>% arrange(open_date) %>% mutate(cumm_n = cumsum(n))

Third, join the summary back to the locations data frame:

1 2 all_locations_smry <- inner_join(all_locations, ss_op_data_smry, by = c("open_date" = "open_date")) Get the maps ready

Using the ggmap library, I got the US map and a world map:

1 2 us_map <- get_stamenmap(c(left = -125, bottom = 24, right = -67, top = 49), zoom = 5, maptype = "toner-lite") ggmap(us_map)

1 2 world_map <- get_stamenmap(bbox = c(left = -180, bottom = -60, right = 179.9999, top = 70), zoom = 3, maptype = "toner-lite") ggmap(world_map)

Create functions to plot each location

Repurposing my code from the Walmart spread across the US, I wrote a similar function to plot locations with two different sizes: big, if the locations opened during the mapped month, and small, if the locations opened before the mapped month. I did so that we could notice the new locations.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 my_us_plot <- function(df, plotdate, mapid){ g <- ggmap(us_map, darken = c("0.8", "black"), extent = "device") old_df <- filter(df, open_date < plotdate) new_df <- filter(df, open_date == plotdate) # old locations g <- g + geom_point(data = old_df, aes(x = lon, y = lat), size = 5, color = "dodgerblue", alpha = 0.4) # new locations g <- g + geom_point(data = new_df, aes(x = lon, y = lat), size = 8, color = "dodgerblue", alpha = 0.4) g <- g + theme(axis.ticks = element_blank(), axis.title = element_blank(), axis.text = element_blank(), plot.title = element_blank(), panel.background = element_rect(fill = "grey20"), plot.background = element_rect(fill = "grey20")) g <- g + annotate("text", x = -77, y = 33, label = "MONTH/YEAR:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 32, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color = "white", size = rel(6), fontface = 2, hjust = 0) g <- g + annotate("text", x = -77, y = 31, label = "STORE COUNT:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 30, label = comma(unique(new_df$cumm_n)), color = "white", size = rel(6), fontface = 2, hjust = 0) filename <- paste0("maps/img_" , str_pad(mapid, 7, pad = "0"), ".png") ggsave(filename = filename, plot = g, width = 13, height = 7, dpi = 120, type = "cairo-png") }

I modified this function to map the world:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 my_world_plot <- function(df, plotdate, mapid){ g <- ggmap(world_map, darken = c("0.8", "black"), extent = "device") old_df <- filter(df, open_date < plotdate) new_df <- filter(df, open_date == plotdate) g <- g + geom_point(data = old_df, aes(x = lon, y = lat), size = 5, color = "dodgerblue", alpha = 0.4) g <- g + geom_point(data = new_df, aes(x = lon, y = lat), size = 8, color = "dodgerblue", alpha = 0.4) g <- g + theme(axis.ticks = element_blank(), axis.title = element_blank(), axis.text = element_blank(), plot.title = element_blank(), panel.background = element_rect(fill = "grey20")) g <- g + annotate("text", x = -130, y = 0, label = "MONTH/YEAR:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -130, y = -10, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color = "white", size = rel(6), fontface = 2, hjust = 0) g <- g + annotate("text", x = -130, y = -20, label = "STORE COUNT:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -130, y = -30, label = comma(unique(new_df$cumm_n)), color = "white", size = rel(6), fontface = 2, hjust = 0) filename <- paste0("maps/img_" , str_pad(mapid, 7, pad = "0"), ".png") ggsave(filename = filename, plot = g, width = 12, height = 6, dpi = 150, type = "cairo-png") } Create maps

Now, the exciting part: create month-by-month maps.

US maps:

1 2 3 4 all_locations_smry %>% mutate(mapid = group_indices_(all_locations_smry, .dots = 'open_date')) %>% group_by(open_date) %>% do(pl = my_us_plot(all_locations_smry, unique(.$open_date), unique(.$mapid)))

World maps:

1 2 3 4 all_locations_smry %>% mutate(mapid = group_indices_(all_locations_smry, .dots = 'open_date')) %>% group_by(open_date) %>% do(pl = my_world_plot(all_locations_smry, unique(.$open_date), unique(.$mapid))) Create a movie

Using ffmpeg, we can put all the images together to create a movie:

1 2 3 # works on a mac makemovie_cmd <- paste0("ffmpeg -framerate 8 -y -pattern_type glob -i '", paste0(getwd(), "/maps/"), "*.png'", " -c:v libx264 -pix_fmt yuv420p '", paste0(getwd(), "/maps/"), "movie.mp4'") system(makemovie_cmd)

We can use the convert function to create a gif:

1 2 3 # https://askubuntu.com/a/43767 makegif_cmd <- paste0("convert -delay 8 -loop 0 ", paste0(getwd(), "/maps/"), "*.png ", "animated.gif") # loop 0 for forever looping system(makegif_cmd)

That’s it! We get nice looking videos showing location openings by each month. I was surprised to see how fast the company is opening the locations as well as how many locations it has in Asia!

Post hoc

Using the ggimage library, I tried creating the maps using Shake Shack’s burger icon, but they didn’t turn out as good:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 my_us_icon_plot <- function(df, plotdate, mapid){ g <- ggmap(us_map, darken = c("0.8", "black")) old_df <- filter(df, open_date < plotdate) new_df <- filter(df, open_date == plotdate) g <- g + geom_image(data = old_df, aes(x = lon, y = lat), image = "ss-app-logo.png", by = "height", size = 0.03, alpha = 0.4) g <- g + geom_image(data = new_df, aes(x = lon, y = lat), image = "ss-app-logo.png", by = "height", size = 0.07, alpha = 0.4) g <- g + theme(axis.ticks = element_blank(), axis.title = element_blank(), axis.text = element_blank(), plot.title = element_blank()) g <- g + annotate("text", x = -77, y = 33, label = "MONTH/YEAR:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 32, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color = "white", size = rel(6), fontface = 2, hjust = 0) g <- g + annotate("text", x = -77, y = 31, label = "STORE COUNT:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 30, label = comma(unique(new_df$cumm_n)), color = "white", size = rel(6), fontface = 2, hjust = 0) filename <- paste0("maps/img_" , str_pad(mapid, 7, pad = "0"), ".png") ggsave(filename = filename, plot = g, width = 13, height = 7, dpi = 150, type = "cairo-png") } Fun maps

What do you think? How else would you visualize these data points?

The post Do know Shake Shack’s locations outside of the US? You’d be surprised appeared first on nandeshwar.info.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – nandeshwar.info. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio:addins part 2 – roxygen documentation formatting made easy

Sat, 05/12/2018 - 16:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

Code documentation is extremely important if you want to share the code with anyone else, future you included. In this second post in the RStudio:addins series we will pay a part of our technical debt from the previous article and document our R functions conveniently using a new addin we will build for this purpose.

The addin we will create in this article will let us create well formatted roxygen documentation easily by using keyboard shortcuts to add useful tags such as \code{} or \link{} around selected text in RStudio. Contents
  1. Quick intro to documentation with roxygen2
    1. Documenting your first function
    2. Generating and viewing the documentation
    3. Real-life example
  2. Our addins to make documenting a breeze
    1. Add a selected tag around a character string
    2. Apply the tag on a selection in an active document in RStudio
    3. Wrappers around addRoxytag to be used as addin for some useful tags
    4. Add the addin bindings into addins.dcf and assign keyboard shortcuts
  3. The addins in action
  4. What is next – even more automated documentation
  5. TL;DR – Just give me the package
  6. References
Quick intro to documentation with roxygen2 1. Documenting your first function

To help us generate documentation easily we will be using the roxygen2 package. You can install it using install.packages("roxygen2"). Roxygen2 works with in-code tags and will generate R’s documentation format .Rd files, create a NAMESPACE, and manage the Collate field in DESCRIPTION (not relevant to us at this point) automatically for our package.

Documenting a function works in 2 simple steps:

Documenting a function

  1. Inserting a skeleton – Do this by placing your cursor anywhere in the function you want to document and click Code Tools -> Insert Roxygen Skeleton (default keyboard shortcut Ctrl+Shift+Alt+R).
  2. Populating the skeleton with relevant information. A few important tags are:
  • #' @params – describing the arguments of the function
  • #' @return – describing what the function returns
  • #' @importFrom package function – in case your function uses a function from a different package Roxygen will automatically add it to the NAMESPACE
  • #' @export – if case you want the function to be exported (mainly for use by other packages)
  • #' @examples – showing how to use the function in practice
2. Generating and viewing the documentation

Generating and viewing the documentation

  1. We generate the documentation files using roxygen2::roxygenise() or devtools::document() (default keyboard shortcut Ctrl+Shift+D)
  2. Re-installing the package (default keyboard shortcut Ctrl+Shift+B)
  3. Viewing the documentation for a function using ?functioname e.g. ?mean, or placing cursor on a function name and pressing F1 in RStudio – this will open the Viewer pane with the help for that function
3. A real-life example

Let us now document runCurrentRscript a little bit:

#' runCurrentRscript #' @description Wrapper around executeCmd with default arguments for easy use as an RStudio addin #' @param path character(1) string, specifying the path of the file to be used as Rscript argument (ideally a path to an R script) #' @param outputFile character(1) string, specifying the name of the file, into which the output produced by running the Rscript will be written #' @param suffix character(1) string, specifying additional suffix to pass to the command #' @importFrom rstudioapi getActiveDocumentContext #' @importFrom rstudioapi navigateToFile #' @seealso executeCmd #' @return side-effects runCurrentRscript <- function( path = replaceTilde(rstudioapi::getActiveDocumentContext()[["path"]]) , outputFile = "output.txt" , suffix = "2>&1") { cmd <- makeCmd(path, outputFile = outputFile, suffix = suffix) executeCmd(cmd) if (!is.null(outputFile) && file.exists(outputFile)) { rstudioapi::navigateToFile(outputFile) } }

As we can see by looking at ?runCurrentRscript versus ?mean, our documentation does not quite look up to par with documentation for other functions:

What is missing if we abstract from the richness of the content is the usage of markup commands (tags) for formatting and linking our documentation. Some of the very useful such tags are for example:

  • \code{}, \strong{}, \emph{} for font style
  • \link{}, \href{}, \url{} for linking to other parts of the documentation or external resources
  • \enumerate{}, \itemize{}, \tabular{} for using lists and tables
  • \eqn{}, \deqn{} for mathematical expressions such as equations etc.

For the full list of options regarding text formatting, linking and more see Writing R Extensions’ Rd format chapter

Our addins to make documenting a breeze

As you can imagine, typing the markup commands in full all the time is quite tedious. The goal of our new addin will therefore be to make this process efficient using keyboard shortcuts – just select a text and our addin will place the desired tags around it. For this time, we will be satisfied with simple 1 line tags.

1. Add a selected tag around a character string roxyfy <- function(str, tag = NULL, splitLines = TRUE) { if (is.null(tag)) { return(str) } if (!isTRUE(splitLines)) { return(paste0("\\", tag, "{", str, "}")) } str <- unlist(strsplit(str, "\n")) str <- paste0("\\", tag, "{", str, "}") paste(str, collapse = "\n") } 2. Apply the tag on a selection in an active document in RStudio

We will make the functionality available for multi-selections as well by lapply-ing over the selection elements retrieved from the active document in RStudio.

addRoxytag <- function(tag = NULL) { context <- rstudioapi::getActiveDocumentContext() lapply(X = context[["selection"]] , FUN = function(thisSel, contextid) { rstudioapi::modifyRange(location = thisSel[["range"]] , roxyfy(thisSel[["text"]], tag) , id = contextid) } , contextid = context[["id"]] ) return(invisible(NULL)) } 3. Wrappers around addRoxytag to be used as addin for some useful tags addRoxytagCode <- function() { addRoxytag(tag = "code") } addRoxytagLink <- function() { addRoxytag(tag = "link") } addRoxytagEqn <- function() { addRoxytag(tag = "eqn") } 4. Add the addin bindings into addins.dcf and assign keyboard shortcuts

As the final step, we need to add the bindings for our new addins to the inst/rstudio/addins.dcf file and re-install the package.

Name: addRoxytagCode Description: Adds roxgen tag code to current selections in the active RStudio document Binding: addRoxytagCode Interactive: false Name: addRoxytagLink Description: Adds roxgen tag link to current selections in the active RStudio document Binding: addRoxytagLink Interactive: false Name: addRoxytagEqn Description: Adds roxgen tag eqn to current selections in the active RStudio document Binding: addRoxytagEqn Interactive: false

assigning keyboard shortcuts to addins

The addins in action

And now, let’s just select the text we want to format and watch our addins do the work for us! Then document the package, re-install it and view the improved help for our functions:

The addins in action

What is next – even more automated documentation

Next time we will try to enrich our addins for generating documentation by adding the following functionalities

  • automatic generation of @importFrom tags by inspecting the function code
  • allowing for more complex tags such as itemize
TL;DR – Just give me the package References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fifteen New Zealand government Shiny web apps by @ellis2013nz

Sat, 05/12/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

table, td, th { border: 3px solid white; border-collapse:separate; border-spacing:0 5px; }

img { width: 100%; } At a glance:

I had a brief look around New Zealand government agency websites and found 15 high quality web apps written in the Shiny platform.

13 May 2018

Here’s fifteen nice web applications built with RStudio’s Shiny framework. All of these are owned and maintained by New Zealand government departments and have a main purpose as making public data more available and accessible for non-specialist users. I think it’s fair to say New Zealand has been a leader in using Shiny for this.

As a commissioner of several of those below and author of one, I can say the appeal of Shiny for this sort of public dissemination of data is the combination of cheap, quick and pretty good. It’s rare to have so little trade-off between those three items of the project management dilemma. Usually it’s “cheap, quick, good – pick any two”.

The modest trade-off in my view is in some small aspects of the eventual quality and performance. The server-client transfers can seem clunky (certainly compared to something written in pure JavaScript), particularly when you’re at the bottom of the world using a server in the USA. They have an annoying habit of freezing and going grey on your users if something goes wrong with the connection. And as they get more complicated, you start using more JavaScript, HTML and CSS anyway. But the results are 80% or 90% good enough for many use cases, and development and deployment costs are materially less than other options. I find Shiny more flexible, powerful and controllable (eg fonts, polish, etc which can all be done with CSS) than Tableau or Power BI, and cheaper and quicker than writing your own app from the ground up.

So, go Shiny!

Ministry of Business, Innovation and Employment I’m pretty sure the New Zealand Tourism Forecasts was the first use of Shiny by a government agency in New Zealand. My old team produced this tool in 2015. It’s nice and simple, not particularly ambitious, but it does the job of letting the user play around with the forecasts much more effectively and better presented than previous dissemination methods (ie Excel pivot tables, if you were lucky). The New Zealand Tourism Dashboard was our first ambitious big Shiny project. A brilliant job by Jimmy Oh, first of his sequence of high quality boundary-pushing apps for MBIE. It combines data from MBIE itself, Stats NZ, and direct from the web. The source code is on GitHub. Building on the style of the tourism dashboard came the New Zealand Sectors Dashboard. It aims to be a one-stop shop for all information about New Zealand’s economy by sectors. It brings together a range of economic datasets produced by MBIE and Statistics New Zealand into one easy-to-use tool. And the New Zealand Labour Market Dashboard. It displays labour market information from many different sources in one place. The Urban Development Capacity Dashboard, jointly branded by MBIE and the Ministry for the Environment, provides charts, maps, tables and underlying data on local markets for housing and business space. The Modelled Territorial Authority GDP is the only Shiny app on this page I can claim to have authored personally. It was the tail end (dissemination) of a big project producing new granular estimates of value add by industry, district and city. There is a paper and presentation on this on the presentations part of my website and source code for the app, and the much bigger job of creating the data, is on GitHub. Stats NZ The Living Cost Explorer presents data from Household Living-costs Price Indexes. It shows how price changes vary depending upon the average basket of goods of different types of people, such as beneficiaries, Māori and superannuitants. Irrigated land in New Zealand uses maps and graphs to present spatial information on irrigation of New Zealand land. This Landcover tool shows composition and changes in land cover. The third of these spatial / environmentally themed tools, Livestock numbers has graphs and maps showing the distribution of cattle (different types thereof), sheep and deer. The Iwi cultural well-being from Te Kupenga 2013 app may be the first Shiny app with an option to swap the user interface into Te reo Māori. This series of Experimental estimates of income has been derived from the tax data available in the Integrated Data Infrastructure as part of ongoing work at Stats NZ to increase the use of administrative data in the production of statistics. Ministry of Health An interactive tool for exploring New Zealand Health Survey data.It presents the latest results by sex, age, ethnic group and neighbourhood deprivation, as well as changes over time. A tool to allow summary data about prescriptions and dispensings funded by the New Zealand Government. Treasury Treasury’s Insights tool provides information drawn from a range of public sector agencies including extensive use of the Integrated Data Infrastructure.

Nice collection. Anyone know of any others?

Disclaimer: I was part of the commissioning team for several of the MBIE Shiny apps, and (as noted above) the author of one. I haven’t been involved in development of any of the others listed above.

Making animated GIFs of websites

To make the animated GIFs used in this website and keep them to a reasonable size (under 2MB each), here’s what I did.

  • I did the original screen captures using the open source CamStudio application and saved them as .avi files. Even though only a part of my screen was captured, with the original screen resolution and about 30 – 45 seconds of content, these files were large; typically 700MB or larger.
  • I used a Python program that GitHub user michaelosthege had published as a Gist to convert from .avi to .gif format. These were about a quarter or fifth the size, but still too large (150MB – 200MB) to use on the web
  • I found another Python program by PaulineLc on another Gist that shrank and sped up animated GIFs.

Is it ironic that a blog post celebrating R Shiny used Python for playing around with the animated images? I don’t think so at all; it’s just a matter of using the convenient and easy tool for the job at hand. Python is awesome with everything to do with images; There are ways in R to do this too (or they could certainly be developed) but it was easier to find out how to do it in Python. I am not as good with Python as with R, but I know how to copy and paste a program someone else has written when it does exactly what I need!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Investing for the Long Run

Sat, 05/12/2018 - 09:00

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

I often get asked about how to invest in the stock market. Not
surprisingly, this has been a common topic in my classes. Brazil is
experiencing a big change in its financial scenario. Historically, fixed
income instruments paid a large premium over the stock market and that
is no longer the case. Interest rates are low, without the pressure from
inflation. This means a more sustainable scenario for low-interest rates
in the future. Without the premium in the fixed income market, people
turn to the stock market.

We can separate investors according to their horizon. Traders try to
profit in the short term, usually within a day, and long-term investors
buy a stock without the intent to sell it in the near future. This type
of investment strategy is called BH (buy and hold). At the extreme,
you buy a stock and hold it forever. The most famous spokesperson of BH
is Warren Buffet, among many others.

Investing in the long run works for me because it doesn’t require much
of my time. You just need to keep up with the quarterly and yearly
financial reports of companies. You can easily do it as a side activity,
parallel to your main job. You don’t need a lot of brain power to do it
either, but it does require knowledge of accounting practices to
understand all the material that is released by the company.

I read many books before starting to invest and one of the most
interesting tables I’ve found portrays the relationship between
investment horizon and profitability. The idea is that the more time you
hold a stock (or index), higher the chance of a profit. The table,
originally from Taleb’s Fooled by Randomness, is as follows.

My problem with the table is that it seems pretty off. My experience
tells me that a 67% chance of positive return every month seems
exaggerated. If that was the case, making money in the stock market
would be easy. Digging deeper, I found out that the data behind the
table is simulated and, therefore, doesn’t really give good an estimate
about the improvement in the probability of profits as a function of the
investment horizon.

As you probably suspect, I decided to tackle the problem using real data
and R. I wrote a simple function
that will grab data, simulate investments of different horizons many
times and plot the results. Let’s try it for the SP500 index:

source('fct_invest_horizon.R') my.ticker <- '^GSPC' # ticker from yahoo finance max.horizon = 255*50 # 50 years first.date <- '1950-01-01' last.date <- Sys.Date() n.points <- 50 # number of points in figure rf.year <- 0 # risk free return (or inflation) l.out <- get.figs.invest.horizon(ticker.in = my.ticker, first.date = first.date, last.date = last.date, max.horizon = max.horizon, n.points = n.points, rf.year = rf.year) ## ## Running BatchGetSymbols for: ## tickers = ^GSPC ## Downloading data for benchmark ticker | Found cache file ## ^GSPC | yahoo (1|1) | Found cache file - Looking good! print(l.out$p1)

print(l.out$p2)

As we can see, the data doesn’t lie. As the investment horizon
increases, the chances of a positive return increases. This result
suggests that, if you invest for more than 13 years, it is very unlikely
that you’ll see a negative return. When looking at the distribution of
total returns by the horizon, we find that it increases significantly
with time. Someone that invested for 50 years is likely to receive a
2500% return on the investment.

With input input rf.year we can also set a desired rate of return.
Let’s try it with 5% return per year, with is pretty standard for
financial markets.

my.ticker <- '^GSPC' # ticker from yahoo finance max.horizon = 255*50 # 50 years first.date <- '1950-01-01' last.date <- Sys.Date() n.points <- 50 # number of points in figure rf.year <- 0.05 # risk free return (or inflation) - yearly l.out <- get.figs.invest.horizon(ticker.in = my.ticker, first.date = first.date, last.date = last.date, max.horizon = max.horizon, n.points = n.points, rf.year = rf.year) ## ## Running BatchGetSymbols for: ## tickers = ^GSPC ## Downloading data for benchmark ticker | Found cache file ## ^GSPC | yahoo (1|1) | Found cache file - Got it! print(l.out$p1)

As expected, the curve of probabilities has a lower slope, meaning that
you need more time investing in the SP500 index to guarantee a return of
more than 5% a year.

Now, let’s try the same setup for Berkshire stock (BRK-A). This is
Buffet’s company and looking at its share price we can have a good
understanding of how successful Buffet has been as a BH (buy and hold)
investor.

my.ticker <- 'BRK-A' # ticker from yahoo finance max.horizon = 255*25 # 50 years first.date <- '1980-01-01' last.date <- Sys.Date() n.points <- 50 # number of points in figure rf.year <- 0.05 # risk free return (or inflation) - yearly l.out <- get.figs.invest.horizon(ticker.in = my.ticker, first.date = first.date, last.date = last.date, max.horizon = max.horizon, n.points = n.points, rf.year = rf.year) ## ## Running BatchGetSymbols for: ## tickers = BRK-A ## Downloading data for benchmark ticker | Found cache file ## BRK-A | yahoo (1|1) | Found cache file - OK! print(l.out$p1)

print(l.out$p2)

Well, needless to say that, historically, Buffet has done very well in
his investments! If you bought the stock and kept it for more 1 year,
there is a 70% chance that you got a profit on your investment.

I hope this post convinced you to start investing. The message from the results are straighforward: the earliest you start, the better.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing graphs with overlapping node groups

Fri, 05/11/2018 - 18:36

(This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers)

I recently came across some data about multilateral agreements, which needed to be visualized as network plots. This data had some peculiarities that made it more difficult to create a plot that was easy to understand. First, the nodes in the graph were organized in groups but each node could belong to multiple groups or to no group at all. Second, there was one “super node” that was connected to all other nodes (while “normal” nodes were only connected within their group). This made it difficult to find the right layout that showed the connections between the nodes as well as the group memberships. However, digging a little deeper into the R packages igraph and ggraph it is possible to get satisfying results in such a scenario.

Example data, nodes & edges

We will use the following packages that we need to load at first:

library(dplyr) library(purrr) library(igraph) library(ggplot2) library(ggraph) library(RColorBrewer)

Let’s create some exemplary data. Let’s say we have 4 groups a, b, c, d and 40 nodes with the node IDs 1 to 40. Each node can belong to several groups but it must not belong to any group. An example would be the following data:

group_a <- 1:5 # nodes 1 to 5 in group a group_b <- 1:10 # nodes 1 to 10 in group b group_c <- c(1:3, 7:18) # nodes 1 to 3 and 7 to 18 in c group_d <- c(1:4, 15:25) # nodes 1 to 4 and 15 to 25 in d members <- data_frame(id = c(group_a, group_b, group_c, group_d, 26:40), group = c(rep('a', length(group_a)), rep('b', length(group_b)), rep('c', length(group_c)), rep('d', length(group_d)), rep(NA, 15))) # nodes 26 to 40 do not # belong to any group

An excerpt of the data:

> members id group 1 a 2 a [...] 5 a 1 b 2 b [...] 38 NA 39 NA 40 NA

Now we can create the edges of the graph, i.e. the connections between the nodes. All nodes within a group are connected to each other. Additionally, all nodes are connected with one “super node” (as mentioned in the introduction). In our example data, we pick node ID 1 to be this special node. Let’s start to create our edges by connecting all nodes to node 1:

edges <- data_frame(from = 1, to = 2:max(members$id), group = NA)

We also denote here, that these edges are not part of any group memberships. We’ll handle these group memberships now:

within_group_edges <- members %>% split(.$group) %>% map_dfr(function (grp) { id2id <- combn(grp$id, 2) data_frame(from = id2id[1,], to = id2id[2,], group = unique(grp$group)) }) edges <- bind_rows(edges, within_group_edges)

At first, we split the members data by their group which produces a list of data frames. We then use map_dfr from the purrr package to handle each of these data frames that are passed as grp argument. grp$id contains the node IDs of the members of this group and we use combn to create the pair-wise combinations of these IDs. This will create a matrix id2id, where the columns represent the node ID pairs. We return a data frame with the from-to ID pairs and a group column that denotes the group to which these edges belong. These “within-group edges” are appended to the already created edges using bind_rows.

> edges from to group 1 2 NA 1 3 NA 1 4 NA [...] 23 24 d 23 25 d 24 25 d Plotting with ggraph

We have our edges, so now we can create the graph with igraph and plot it using the ggraph package:

g <- graph_from_data_frame(edges, directed = FALSE) ggraph(g) + geom_edge_link(aes(color = group), alpha = 0.5) + # different edge color per group geom_node_point(size = 7, shape = 21, stroke = 1, fill = 'white', color = 'black') + geom_node_text(aes(label = name)) + # "name" is automatically generated from the node IDs in the edges theme_void()

Not bad for the first try, but the layout is a bit unfortunate, giving too much space to nodes that don’t belong to any group.

We can tell igraph’s layout algorithm to tighten the non-group connections (the gray lines in the above figure) by giving them a higher weight than the within-group edges:

# give weight 10 to non-group edges edges <- data_frame(from = 1, to = 2:40, weight = 10, group = NA) within_group_edges <- members %>% split(.$group) %>% map_dfr(function (grp) { id2id <- combn(grp$id, 2) # weight 1 for within-group edges data_frame(from = id2id[1,], to = id2id[2,], weight = 1, group = unique(grp$group)) })

We reconstruct the graph g and plot it using the same commands as before and get the following:

The nodes within groups are now much less cluttered and the layout is more balanced.

Plotting with igraph

A problem with this type of plot is that connections within smaller groups are sometimes hardly visible (for example group a in the above figure). The plotting functions of igraph allow an additional method of highlighting groups in graphs: Using the parameter mark.groups will construct convex hulls around nodes that belong to a group. These hulls can then be highlighted with respective colors.

At first, we need to create a list that maps each group to a vector of the node IDs that belong to that group:

group_ids <- lapply(members %>% split(.$group), function(grp) { grp$id }) > group_ids $a [1] 1 2 3 4 5 $b [1] 1 2 3 4 5 6 7 8 9 10 [...]

Now we can create a color for each group using RColorBrewer:

group_color <- brewer.pal(length(group_ids), 'Set1') # the fill gets an additional alpha value for transparency: group_color_fill <- paste0(group_color, '20')

We plot it by using the graph object g that was generated before with graph_from_data_frame:

par(mar = rep(0.1, 4)) # reduce margins plot(g, vertex.color = 'white', vertex.size = 9, edge.color = rgb(0.5, 0.5, 0.5, 0.2), mark.groups = group_ids, mark.col = group_color_fill, mark.border = group_color) legend('topright', legend = names(group_ids), col = group_color, pch = 15, bty = "n", pt.cex = 1.5, cex = 0.8, text.col = "black", horiz = FALSE)

This option usually works well when you have groups that are more or less well separated, i.e. do not overlap too much. However, in our case there is quite some overlap and we can see that the shapes that encompass the groups also sometimes include nodes that do not actually belong to that group (for example node 8 in the above figure that is encompassed by group a although it does not belong to that group).

We can use a trick that leads the layout algorithm to bundle the groups more closely in a different manner: For each group, we introduce a “virtual node” (which will not be drawn during plotting) to which all the normal nodes in the group are tied with more weight than to each other. Nodes that only belong to a single group will be placed farther away from the center than those that belong to several groups, which will reduce clutter and wrongly overlapping group hulls. Furthermore, a virtual group node for nodes that do not belong to any group will make sure that these nodes will be placed more closely to each other.

We start by generating IDs for the virtual nodes:

# 4 groups plus one "NA-group" virt_group_nodes <- max(members$id) + 1:5 names(virt_group_nodes) <- c(letters[1:4], NA)

This will give us the following IDs:

> virt_group_nodes a b c d NA 41 42 43 44 45

We start to create the edges again by connecting all nodes to the “super node” with ID 1:

edges_virt <- data_frame(from = 1, to = 2:40, weight = 5, group = NA)

Then, the edges within the groups will be generated again, but this time we add additional edges to each group’s virtual node:

within_virt %>% split(.$group) %>% map_dfr(function (grp) { group_name <- unique(grp$group) virt_from <- rep(virt_group_nodes[group_name], length(grp$id)) id2id <- combn(grp$id, 2) data_frame( from = c(id2id[1,], virt_from), to = c(id2id[2,], grp$id), # also connects from virtual_from node to each group node weight = c(rep(0.1, ncol(id2id)), # weight between group nodes rep(50, length(grp$id))), # weight that 'ties together' the group (via the virtual group node) group = group_name ) }) edges_virt <- bind_rows(edges_virt, within_virt)

We add edges from all nodes that don’t belong to a group to another virtual node:

virt_group_na <- virt_group_nodes[is.na(names(virt_group_nodes))] non_group_nodes <- (members %>% filter(is.na(group)))$id edges_na_group_virt <- data_frame(from = non_group_nodes, to = rep(virt_group_na, length(non_group_nodes)), weight = 10, group = NA) edges_virt <- bind_rows(edges_virt, edges_na_group_virt)

This time, we also create a data frame for the nodes, because we want to add an additional property is_virt to each node that denotes if that node is virtual:

nodes_virt <- data_frame(id = 1:max(virt_group_nodes), is_virt = c(rep(FALSE, max(members$id)), rep(TRUE, length(virt_group_nodes))))

We’re ready to create the graph now:

g_virt <- graph_from_data_frame(edges_virt, directed = FALSE, vertices = nodes_virt)

To illustrate the effect of the virtual nodes, we can plot the graph directly and get a figure like this (virtual nodes highlighted in turquois):

We now want to plot the graph without the virtual nodes, but the layout should nevertheless be calculated with the virtual nodes. We can achieve that by running the layout algorithm first and then removing the virtual nodes from both the graph and the generated layout matrix:

# use "auto layout" lay <- layout_nicely(g_virt) # remove virtual group nodes from graph g_virt <- g_virt - vertices(virt_group_nodes) # remove virtual group nodes' positions from the layout matrix lay <- lay[-virt_group_nodes, ]

It’s important to pass the layout matrix now with the layout parameter to produce the final figure:

plot(g_virt, layout = lay, vertex.color = 'white', vertex.size = 9, edge.color = rgb(0.5, 0.5, 0.5, 0.2), mark.groups = group_ids, mark.col = group_color_fill, mark.border = group_color) legend('topright', legend = names(group_ids), col = group_color, pch = 15, bty = "n", pt.cex = 1.5, cex = 0.8, text.col = "black", horiz = FALSE)

We can see that output is less cluttered and nodes that belong to the same groups are bundled nicely while nodes that do not share the same groups are well separated. Note that the respective edge weights were found empirically and you will probably need to adjust them to achieve a good graph layout for your data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Monte Carlo-based prediction intervals for nonlinear regression

Fri, 05/11/2018 - 18:22

(This article was first published on Rmazing, and kindly contributed to R-bloggers)

Calculation of the propagated uncertainty using (1), where is the gradient and the covariance matrix of the coefficients , is called the “Delta Method” and is widely applied in nonlinear least-squares (NLS) fitting. However, this method is based on first-order Taylor expansion and thus assummes linearity around . The second-order approach can partially correct for this restriction by using a second-order polynomial around , which is (2), where is the matrix trace and is the Hessian.
Confidence and prediction intervals for NLS models are calculated using (3) or (4), respectively, where the residual variance (5).
Now, how can we employ the matrix notation of error propagation for creating Taylor expansion- and Monte Carlo-based prediction intervals?
The inclusion of in the prediction interval can be implemented as an extended gradient and “augmented” covariance matrix. So instead of using (6), we take (7) as the expression and augment the covariance matrix to an covariance matrix, where . Partial differentiation and matrix multiplication will then yield, for example with two coefficients and and their corresponding covariance matrix :
(8)
(9)
, which then goes into (4).
The advantage of the augmented covariance matrix is that it can be exploited for creating Monte Carlo-based prediction intervals. This is new from propagate version 1.0-6 and is based on the paradigm that we add another dimension by employing the augmented covariance matrix of (8) in the multivariate t-distribution random number generator (in our case rtmvt), with . All simulations are then evaluated with (7) and the usual quantiles calculated for the prediction interval. Using the original covariance matrix with (6) will deliver the MC-based confidence interval.
Application of second-order Taylor expansion or the MC-based approach demonstrates nicely that for the majority of nonlinear models, the confidence/prediction intervals around are quite asymmetric, which the classical Delta Method does not capture:

library(propagate)

DNase1 <- subset(DNase, Run == 1)
fm3DNase1 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
data = DNase1, start = list(Asym = 3, xmid = 0, scal = 1))
## first-order prediction interval
set.seed(123)
PROP1 <- predictNLS(fm3DNase1, newdata = data.frame(conc = 2), nsim = 1000000,
second.order = FALSE, interval = "prediction")
t(PROP1$summary)


## second-order prediction interval and MC
set.seed(123)
PROP2 <- predictNLS(fm3DNase1, newdata = data.frame(conc = 2), nsim = 1000000,
second.order = TRUE, interval = "prediction")
t(PROP2$summary)

What we see here is that
i) the first-order prediction interval [0.70308712; 0.79300731] is symmetric and slightly down-biased compared to the second-order one [0.70317629; 0.79309874],
and
ii) the second-order prediction interval tallies nicely up to the 4th decimal with the new MC-based interval (0.70318286 and 0.70317629; 0.79311987 and 0.79309874).
I believe this clearly demonstrates the usefulness of the MC-based approach for NLS prediction interval estimation…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rmazing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Custom R charts coming to Excel

Fri, 05/11/2018 - 17:49

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

This week at the BUILD conference, Microsoft announced that Power BI custom visuals will soon be available as charts with Excel. You'll be able to choose a range of data within an Excel workbook, and pass those data to one of the built-in Power BI custom visuals, or one you've created yourself using the API.

Since you can create Power BI custom visuals using R, that means you'll be able to design a custom R-based chart, and make it available to people using Excel — even if they don't know how to use R themselves. There also many pre-defined custom visuals available, including some familiar R charts like decision trees, calendar heatmaps, and hexbin scatterplots.

For more details on how you'll be able to use custom R visuals in Excel, check out the blog post linked below.

PowerBI Blog: Excel announces new data visualization capabilities with Power BI custom visuals

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

goodpractice 1.0.2 on CRAN

Fri, 05/11/2018 - 16:32

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Hannah Frick, Data Scientist

We are excited to annouce that the goodpractice package is now
available on CRAN. The package gives advice about good practices when building R packages. Advice
includes functions and syntax to avoid, package structure, code
complexity, code formatting, etc.

You can install the CRAN version via

install.packages("goodpractice") Building R packages

Building an R package is a great way of encapsulating code,
documentation and data in a single testable and easily distributable
unit.

For a package to be distributed via CRAN, it needs to pass a set of
checks implemented in R CMD check, such as: Is there minimal
documentation, e.g., are all arguments of exported functions documented?
Are all dependencies declared?

These checks are helpful in developing a solid R package but they don’t
check for several other good practices. For example, a package does not
need to contain any tests but is it good practice to include some.
Following a coding standard helps readability. Avoiding overly complex
functions reduces the risk of bugs. Including an URL for bug reports
lets people more easily report bugs if they find any.

What the goodpractice package does

Tools for automatically checking several of the above mentioned aspects
already exist and the goodpractice package bundles the checks from
rcmdcheck with code
coverage through the covr
package, source code linting via the
lintr package and
cyclompatic complexity via the
cyclocomp package
and augments it with some further checks on good practice for R package
development such as avoiding T and F in favour of TRUE and
FALSE. It provides advice on which practices to follow and which to
avoid.

You can use goodpractice checks as a reminder for you and your
collegues – and if you have custom checks to run, you can make
goodpractice run those as well!

How to use goodpractice

The main fuction goodpractice() (and its alias gp()) takes the path
to the source code of a package as its first argument. The
goodpractice package contains the source for a simple package which
violates some good practices. We’ll use this for the examples.

library(goodpractice) # get path to example package pkg_path <- system.file("bad1", package = "goodpractice") # run gp() on it g <- gp(pkg_path) #> Preparing: covr #> Warning in MYPREPS[[prep]](state, quiet = quiet): Prep step for test #> coverage failed. #> Preparing: cyclocomp #> Skipping 2 packages ahead of CRAN: callr, remotes #> Installing 1 packages: stringr #> #> There is a binary version available but the source version is #> later: #> binary source needs_compilation #> stringr 1.3.0 1.3.1 FALSE #> installing the source package 'stringr' #> Preparing: description #> Preparing: lintr #> Preparing: namespace #> Preparing: rcmdcheck # show the result g #> ── GP badpackage ────────────────────────────────────────────────────────── #> #> It is good practice to #> #> ✖ not use "Depends" in DESCRIPTION, as it can cause name #> clashes, and poor interaction with other packages. Use #> "Imports" instead. #> ✖ omit "Date" in DESCRIPTION. It is not required and it gets #> invalid quite often. A build date will be added to the package #> when you perform `R CMD build` on it. #> ✖ add a "URL" field to DESCRIPTION. It helps users find #> information about your package online. If your package does #> not have a homepage, add an URL to GitHub, or the CRAN package #> package page. #> ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug #> tracker. Many online code hosting services provide bug #> trackers for free, https://github.com, https://gitlab.com, #> etc. #> ✖ omit trailing semicolons from code lines. They are not needed #> and most R coding standards forbid them #> #> R/semicolons.R:4:30 #> R/semicolons.R:5:29 #> R/semicolons.R:9:38 #> #> ✖ not import packages as a whole, as this can cause name clashes #> between the imported packages. Instead, import only the #> specific functions you need. #> ✖ fix this R CMD check ERROR: VignetteBuilder package not #> declared: ‘knitr’ See section ‘The DESCRIPTION file’ in the #> ‘Writing R Extensions’ manual. #> ✖ avoid 'T' and 'F', as they are just variables which are set to #> the logicals 'TRUE' and 'FALSE' by default, but are not #> reserved words and hence can be overwritten by the user. #> Hence, one should always use 'TRUE' and 'FALSE' for the #> logicals. #> #> R/tf.R:NA:NA #> R/tf.R:NA:NA #> R/tf.R:NA:NA #> R/tf.R:NA:NA #> R/tf.R:NA:NA #> ... and 4 more lines #> #> ───────────────────────────────────────────────────────────────────────────

So with this package, we’ve done a few things in the DESCRIPTION file
for which there are reasons not to do them, have unnecessary trailing
semicolons in the code and used T and F for TRUE and FALSE. The
output of gp() will tell you what isn’t considered good practice out of what you have already written. If that is in the R code itself, it will also point you to the location of your faux-pas. In general, the messages are supposed to not only point out to
you what you might want to avoid but also why.

Custom checks

The above example tries to run all 230 checks available, to see the full
list use all_checks(). You can customise the set of checks run by
selecting only those default checks you are intersted in and by adding
your own checks.

If you only want to run a subset of the checks, e.g., just the check on
the URL field in the DESCRIPTION, you can specify the checks by name:

# what is the name of the check? grep("url", all_checks(), value = TRUE) #> [1] "description_url" # run only this check gp(pkg_path, checks = "description_url") #> Preparing: description #> ── GP badpackage ────────────────────────────────────────────────────────── #> #> It is good practice to #> #> ✖ add a "URL" field to DESCRIPTION. It helps users find #> information about your package online. If your package does #> not have a homepage, add an URL to GitHub, or the CRAN package #> package page. #> ───────────────────────────────────────────────────────────────────────────

Additional checks can be used in gp() via the extra_checks argument.
This should be a named list of check objects as returned by the
make_check() function.

# make a simple version of the T/F check check_simple_tf <- make_check( description = "TRUE and FALSE is used, not T and F", gp = "avoid 'T' and 'F', use 'TRUE' and 'FALSE' instead.", check = function(state) { length(tools::checkTnF(dir = state$path)) == 0 } ) gp(pkg_path, checks = c("description_url", "simple_tf"), extra_checks = list(simple_tf = check_simple_tf)) #> Preparing: description #> ── GP badpackage ────────────────────────────────────────────────────────── #> #> It is good practice to #> #> ✖ add a "URL" field to DESCRIPTION. It helps users find #> information about your package online. If your package does #> not have a homepage, add an URL to GitHub, or the CRAN package #> package page. #> ✖ avoid 'T' and 'F', use 'TRUE' and 'FALSE' instead. #> ───────────────────────────────────────────────────────────────────────────

For more details on creating custom checks, please see the vignette
Custom
Checks
.

Acknowledgements

This package was written by Gábor
Csárdi
with contributions by Noam
Ross
, Neal
Fultz
, Douglas
Ashton
, Marcel
Ramos
, Joseph
Stachelek
, and
myself. Special thanks for the input and
feedback to the rOpenSci leadership team and
community as well as everybody who opened
issues!

Feedback

If you have any feedback, please consider opening an issue on
GitHub.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Basic R Automation

Fri, 05/11/2018 - 14:48

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

Last Wednesday, a small presentation was given at the RBelgium meetup in Brussels on Basic R Automation. For those of you who could not attend, here are the slides of that presentation which showed the use of the cronR and taskscheduleR R packages for automating basic R scripts.

If you are interested in setting up a project for more advanced ways on how to automate your R processes for your specific environment, get in touch.

{aridoc engine=”pdfjs” width=”100%” height=”550″}images/bnosac/blog/Basic_R_Automation.pdf{/aridoc}

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2018 R Conferences

Fri, 05/11/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

rstudio::conf 2018 and the New York R Conference are both behind us, but we are rushing headlong into the season for conferences focused on the R Language and its applications.

The European R Users Meeting (eRum) begins this coming Monday, May 14th, in Budapest with three days of workshops and talks. Headlined by R Core member Martin Mächler and fellow keynote speakers Achim Zeileis, Nathalie Villa-Vialaneix, Stefano Maria Iacus, and Roger Bivand, the program features an outstanding array of accomplished speakers including RStudio’s own Barbara Borges Ribeiro, Andrie de Vries, and Lionel Henry.

Second only to useR! in longevity, the tenth consecutive R / Finance will be held in Chicago on June 1st and 2nd. Keynote speakers Norm Matloff, J.J. Allaire, and Li Deng head a strong program. Produced by the same committed crew of Chicago quants with the unwavering support of UIC, R / Finance is the epitome of a small, tightly focused, single-track R conference. If you are interested in the quantitative side of Finance, there is no better place to network.

The relatively new CascadiaRConf will feature keynote speakers Alison Hill and Kara Woo in a one-day event on June 2nd in Portland, OR that promises to be good time with several hands-on workshops.

A SatRday mini-conference will be held in Cardiff on June 23rd. Stephanie Locke, Heather Turner, and Maelle Salmon will be leading the event. The recent conference in Capetown appears to have been a great day for working with R, and a lot of fun. I expect that Cardiff will also be a blast.

Why R? July 2nd through 5th in Wrocław, Poland is an ambitious undertaking with five keynote speakers (Bernd Bischl, Thomas Petzoldt, Leon Eyrich Jessen, Tomasz Niedzielski, and Maciej Eder), a hackathon, and several “pre-meetings” spread across Poland, Germany, and Denmark. I expect this to be a top-tier series of events.

R in Montreal will be held from July 4th through 6th. Pleanary speakers Julie Josse, Arun Srinivasan, and Daniel Stubbs will headline the program.

The 14th useR! conference, the first to happen in the Southern Hemisphere, will be held in Brisbane, Australia from July 10th through the 13th. The mother of all R conferences, useR! attracts R afficianados from around the globe and provides a window to what is au courant in the R universe. Keynote speakers Jenny Bryan, Steph De Silva, Heike Hofmann, Thomas Lin Pedersen, Roger Peng, and Bill Venables head the program. The tutorials, always a major attraction at useR! conferences, will take place over two days.

Insurance Data Science, the direct successor to the series of R in Insurance conference will be held in London on July 16th. Although renamed, and presumably refocused, the program for the conference still indicates quite a bit of R content. Garth Peters and Eric Novic will deliver the keynotes.

BioC 2018, the flagship conference for the BioConductor project and a major event in the computational genomics world, will be held from July 25 through the 27th at Victoria University, Toronto. The program is still coming together, but confirmed speakers include Brenda Andrews, Benjamine Haibe-Kains, Elana Fertig, Charlotte Sonneson, Michael Hoffman, and Tim Hughes.

The Latin American R/BioConductor Developers Workshop will be held between July 30th and August 3rd at the center for Genomic Sciences in Cuernavaca, Mexico. Invited speakers include Martin Morgan and Heather Turner. The workshop is aimed at students and researchers, with a goal of teaching participants the principles of reproducible data science through the development of R/Bioconductor packages.

Two brand-new conferences directly modeled on the R / Finance experience will make their debuts this year. R / Pharma, a conference devoted to the use of R for reproducible research, regulatory compliance and validation, safety monitoring, clinical trials, drug discovery, R&D, PK/PD/pharmacometrics, genomics, and diagnostics in the pharmaceutical industry will be held on August 15th and 16th at Harvard University. This will be a small, collegial gathering limited to 150 attendees; it will undoubtedly sell out soon after registration opens.

R / Medicine, which will focus on the use of R in medical research and clinical practice, with talks addressing Phase I clinical trial design; the analysis and visualization of clinical trial data, patient records, and genetic data; personalized medicine; and reproducible research, will take place in New Haven, CT on September 7th and 8th. This will also be a small gathering that is likely to sell out soon after registration opens.

LatinR, which will focus on the use of R in R&D, will be held at th University of Palmero in Buenos Aires on September 4th and 5th. Keynote speakers Jenny Bryan and Walter Sosa Escudero will head the program.

The last R conference on my radar for the 2018 season, the enterprise-focused EARL (Enterprise Applications of the R Language) Conference, will take place in London from September 11th through the 13th. Edwin Dunn and Garrett Grolemund will deliver the keynotes, and the list of speakers comprises an impressive roster of industrial-strength R users. This is clearly the event for data scientists looking to put R into production.

_____='https://rviews.rstudio.com/2018/05/11/2018-r-conferences/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

the riddle of the stands

Fri, 05/11/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The simple riddle of last week on The Riddler, about the minimum number of urinals needed for n men to pee if the occupation rule is to stay as far as possible from anyone there and never to stand next to another man,  is quickly solved by an R code:

ocupee=function(M){ ok=rep(0,M) ok[1]=ok[M]=1 ok[trunc((1+M/2))]=1 while (max(diff((1:M)[ok!=0])>2)){ i=order(-diff((1:M)[ok!=0]))[1] ok[(1:M)[ok!=0][i]+trunc((diff((1:M)[ok!=0])[i]/2))]=1 } return(sum(ok>0)) }

with maximal occupation illustrated by the graph below:

Meaning that the efficiency of the positioning scheme is not optimal when following the sequential positioning, requiring $latexN+2^{\lceil log_2(N-1) \rceil}$ urinals. Rather than one out of two, requiring 2N-1 urinals. What is most funny in this simple exercise is the connection exposed in the Riddler with an Xkcd blag written a few years go about the topic.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rquery: SQL from R

Thu, 05/10/2018 - 21:23

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.


(John teaching rquery at BARUG, photo credit: Timothy Liu)

I am now looking for invitations to give a streamlined version of this talk privately to groups using R who want to work with SQL (with databases such as PostgreSQL or big data systems such as Apache Spark). rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).

If your group is in the San Francisco Bay Area and using R to work with a SQL accessible data source, please reach out to me at jmount@win-vector.com, I would be honored to show your team how to speed up their project and lower development costs with rquery. If you are a big data vendor and some of your clients use R, I am especially interested in getting in touch: our system can help R users start working with your installation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Exploratory Factor Analysis in R

Thu, 05/10/2018 - 20:30

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Changing Your Viewpoint for Factors

In real life, data tends to follow some patterns but the reasons are not apparent right from the start of the data analysis. Taking a common example of a demographics based survey, many people will answer questions in a particular ‘way’. For example, all married men will have higher expenses than single men but lower than married men with children. In this case, the driving factor which makes them answer following a pattern is the economic status but these answers may also depend on other factors such as level of education, salary and locality or area. It becomes complicated to assign answers related to multiple factors. One option is to map variables or answers to one of the factors. This process has a lot of drawbacks such as the requirement to ‘guess’ the number of factors, heuristic based or biased manual assignment and non-consideration of influence of other factors to the variable. We have variables and answers in the data defined in a way such that we can understand them and interpret. What if we change our view lens in an alternate reality where all variables are automatically mapped into different new categories with customized weight based on their influence on that category. This is the idea behind factor analysis.

Creating Meaningful Factors

Factor analysis starts with the assumption of hidden latent variables which cannot be observed directly but are reflected in the answers or variables of the data. It also makes the assumption that there are as many factors as there are variables. We then transform our current set of variables into an equal number of variables such that each new variable is a combination of the current ones in some weightage. Hence, we are not essentially adding or removing information in this step but only transforming it. A typical way to make this transformation is to use eigenvalues and eigenvectors. We transform our data in the direction of each eigenvector and represent all the new variables or ‘factors’ using the eigenvalues. An eigenvalue more than 1 will mean that the new factor explains more variance than one original variable. We then sort the factors in decreasing order of the variances they explain. Thus, the first factor will be the most influential factor followed by the second factor and so on. This also helps us think of variable reduction by removing the last few factors. Typically we take factors which collectively explain 90-99% of the variance depending upon the dataset and problem. Thus, there is no need to guess the number of factors required for the data. However, we need to read into each of the factors and the combination of original features out of which they are made of to understand what they represent.

Loading Factors With Factor Loadings

As I mentioned, even after transformation, we retain the weight of each original feature that were combined to obtain the factor. These weights are known as ‘factor loadings’ and help understand what the factors represent. Let’s take a cooked up example of factor loadings for an airlines survey. Let’s say the table looks like this:

I took 10 features originally so it should generate 10 factors. Let’s say our first three factors are as shown in the table. Looking at the values of the factors, the first factor may represent customer experience post on-boarding. The second factor reflects the airline booking experience and related perks. The third factor shows the flight competitive advantage of the airline compared to its competition. We also have a 10th feature which is negatively affecting the second factor (which seems to make sense as a loyal customer will book flights even if they are no longer economic or don’t have as great frequent flyer programs). Thus, we can now understand that the top most important factors for customers filling the survey are customer experience, booking experience and competitive advantage. However, this understanding needs to be developed manually by looking at the loadings. It is the factor loadings and their understanding which are the prime reason which makes factor analysis of such importance followed by the ability to scale down to a few factors without losing much information.

Exploration – How Many Factors?

Factor analysis can be driven by different motivations. One objective of factor analysis can be verifying with the data with what you already know and think about it. This requires a prior intuition on the number of important factors after which the loadings will be low overall as well as an idea of the loadings of each original variable in those factors. This is the confirmatory way of factor analysis where the process is run to confirm with understanding of the data. A more common approach is to understand the data using factor analysis. In this case, you perform factor analysis first and then develop a general idea of what you get out of the results. How do we stop at a specific number of factor in factor analysis when we are exploring? We use the scree plot in this case. The scree plot maps the factors with their eigenvalues and a cut-off point is determined wherever there is a sudden change in the slope of the line.

Going Practical – The BFI Dataset in R

Let’s start with a practical demonstration of factor analysis. We will use the Psych package in R which is a package for personality, psychometric, and psychological research. It consists a dataset – the bfi dataset which represents 25 personality items with 3 additional demographics for 2800 data points. The columns are already classified into 5 factors thus their names start with letters A (Agreeableness), C (Conscientiousness), E (Extraversion), N (Neuroticism) and O (Openness). Let’s perform factor analysis to see if we really have the same association of the variables with each factor.

#Installing the Psych package and loading it install.packages("psych") library(psych) #Loading the dataset bfi_data=bfi

Though we have NA values in our data which need to be handled but we will not perform much processing on the data. To make things simple, we will only take those data points where there are no missing values.

#Remove rows with missing values and keep only complete cases bfi_data=bfi_data[complete.cases(bfi_data),]

This leaves us with 2236 data points down from 2800 which means a reduction of 564 data points. Since 2236 is still a plausible number of data points, we can proceed further. We will use the fa() function for factor analysis and need to calculate the correlation matrix for the function

#Create the correlation matrix from bfi_data bfi_cor <- cor(bfi_data)

The fa() function needs correlation matrix as r and number of factors. The default value is 1 which is undesired so we will specify the factors to be 6 for this exercise.

#Factor analysis of the data factors_data <- fa(r = bfi_cor, nfactors = 6) #Getting the factor loadings and model analysis factors_data Factor Analysis using method =  minres Call: fa(r = bfi_cor, nfactors = 6)

Standardized loadings (pattern matrix) based upon correlation matrix

            MR2   MR3   MR1   MR5   MR4   MR6    h2   u2 com A1         0.11  0.07 -0.07 -0.56 -0.01  0.35 0.379 0.62 1.8 A2         0.03  0.09 -0.08  0.64  0.01 -0.06 0.467 0.53 1.1 A3        -0.04  0.04 -0.10  0.60  0.07  0.16 0.506 0.49 1.3 A4        -0.07  0.19 -0.07  0.41 -0.13  0.13 0.294 0.71 2.0 A5        -0.17  0.01 -0.16  0.47  0.10  0.22 0.470 0.53 2.1 C1         0.05  0.54  0.08 -0.02  0.19  0.05 0.344 0.66 1.3 C2         0.09  0.66  0.17  0.06  0.08  0.16 0.475 0.53 1.4 C3         0.00  0.56  0.07  0.07 -0.04  0.05 0.317 0.68 1.1 C4         0.07 -0.67  0.10 -0.01  0.02  0.25 0.555 0.45 1.3 C5         0.15 -0.56  0.17  0.02  0.10  0.01 0.433 0.57 1.4 E1        -0.14  0.09  0.61 -0.14 -0.08  0.09 0.414 0.59 1.3 E2         0.06 -0.03  0.68 -0.07 -0.08 -0.01 0.559 0.44 1.1 E3         0.02  0.01 -0.32  0.17  0.38  0.28 0.507 0.49 3.3 E4        -0.07  0.03 -0.49  0.25  0.00  0.31 0.565 0.44 2.3 E5         0.16  0.27 -0.39  0.07  0.24  0.04 0.410 0.59 3.0 N1         0.82 -0.01 -0.09 -0.09 -0.03  0.02 0.666 0.33 1.1 N2         0.83  0.02 -0.07 -0.07  0.01 -0.07 0.654 0.35 1.0 N3         0.69 -0.03  0.13  0.09  0.02  0.06 0.549 0.45 1.1 N4         0.44 -0.14  0.43  0.09  0.10  0.01 0.506 0.49 2.4 N5         0.47 -0.01  0.21  0.21 -0.17  0.09 0.376 0.62 2.2 O1        -0.05  0.07 -0.01 -0.04  0.57  0.09 0.357 0.64 1.1 O2         0.12 -0.09  0.01  0.12 -0.43  0.28 0.295 0.70 2.2 O3         0.01  0.00 -0.10  0.05  0.65  0.04 0.485 0.52 1.1 O4         0.10 -0.05  0.34  0.15  0.37 -0.04 0.241 0.76 2.6 O5         0.04 -0.04 -0.02 -0.01 -0.50  0.30 0.330 0.67 1.7 gender     0.20  0.09 -0.12  0.33 -0.21 -0.15 0.184 0.82 3.6 education -0.03  0.01  0.05  0.11  0.12 -0.22 0.072 0.93 2.2 age       -0.06  0.07 -0.02  0.16  0.03 -0.26 0.098 0.90 2.0                        MR2  MR3  MR1  MR5  MR4  MR6 SS loadings           2.55 2.13 2.14 2.03 1.79 0.87 Proportion Var        0.09 0.08 0.08 0.07 0.06 0.03 Cumulative Var        0.09 0.17 0.24 0.32 0.38 0.41 Proportion Explained  0.22 0.18 0.19 0.18 0.16 0.08 Cumulative Proportion 0.22 0.41 0.59 0.77 0.92 1.00 With factor correlations of       MR2   MR3   MR1   MR5   MR4   MR6 MR2  1.00 -0.18  0.24 -0.05 -0.01  0.10 MR3 -0.18  1.00 -0.23  0.16  0.19  0.04 MR1  0.24 -0.23  1.00 -0.28 -0.19 -0.15 MR5 -0.05  0.16 -0.28  1.00  0.18  0.17 MR4 -0.01  0.19 -0.19  0.18  1.00  0.05 MR6  0.10  0.04 -0.15  0.17  0.05  1.00 Mean item complexity =  1.8 Test of the hypothesis that 6 factors are sufficient. The degrees of freedom for the null model are  378  and the objective function was  7.79 The degrees of freedom for the model are 225  and the objective function was  0.57 The root mean square of the residuals (RMSR) is  0.02 The df corrected root mean square of the residuals is  0.03 Fit based upon off diagonal values = 0.98 Measures of factor score adequacy                                                                MR2  MR3  MR1  MR5  MR4  MR6 Correlation of (regression) scores with factors   0.93 0.89 0.89 0.88 0.86 0.77 Multiple R square of scores with factors          0.86 0.79 0.79 0.77 0.74 0.59 Minimum correlation of possible factor scores     0.72 0.57 0.58 0.54 0.49 0.18

The factor loadings show that the first factor represents N followed by C,E,A and O. This means most of the members in the data have Neuroticism in the data. We also notice that the first five factors adequately represent the factor categories as the data is meant for.

Conclusion: A Deeper Insight

As apparent from the bfi survey example, factor analysis is helpful in classifying our current features into factors which represent hidden features not measured directly. It also has an additional advantage of helping reduce our data into a smaller set of features without losing much information. There are a few things to keep in mind before putting factor analysis into action. The first is about the values of factor loadings. We may have datasets where the factor loadings for all factors are low – lower than 0.5 or 0.3. While a factor loading lower than 0.3 means that you are using too many factors and need to re-run the analysis with lesser factors. A range of loadings around 0.5 is satisfactory but indicates poor predicting ability. You should later keep thresholds and discard factors which have a loading lower than the threshold for all features. Factor analysis on dynamic data can also be helpful in tracking changes in the nature of data. In case the data changes significantly, the number of factors in exploratory factor analysis will also change and indicate you to look into the data and check what changes have occurred. The final one of importance is the interpretability of factors. In case you are unable to understand or explain the factor loadings, you are either using a very granular or very generalized set of factors. In this case, you need to find the right number of factors and obtain loadings to features which are both interpretable and beneficial for analysis. There can be a variety of other situations that may occur with factor analysis and are all subject to interpretation.

Keep learning and here is the entire code used in this article.
#Installing the Psych package and loading it install.packages("psych") library(psych) #Loading the dataset bfi_data=bfi #Remove rows with missing values and keep only complete cases bfi_data=bfi_data[complete.cases(bfi_data),] #Create the correlation matrix from bfi_data bfi_cor <- cor(bfi_data) #Factor analysis of the data factors_data <- fa(r = bfi_cor, nfactors = 6) #Getting the factor loadings and model analysis factors_data

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Sai Ram, Saneesh Veetil and Chaitanya Sagar contributed to this article.

Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scientific debt

Thu, 05/10/2018 - 16:00

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

A very useful concept in software engineering is technical debt.

Technical debt occurs when engineers choose a quick but suboptimal solution to a problem, or don’t spend time to build sustainable infrastructure. Maybe they’re using an approach that doesn’t scale well as the team and codebase expand (such as hardcoding “magic numbers”), or using a tool for reasons of convenience rather than appropriateness (“we’ll write the DevOps infrastructure in PHP since that’s what our team already knows”). Either way, it’s something that seems like it’s working at first but causes real challenges in the long-term, in the form of delayed feature launches and hard-to-fix bugs.

In my new job as Chief Data Scientist at DataCamp, I’ve been thinking about the role of data science within a business, and discussing this with other professionals in the field. On a panel earlier this year, I realized that data scientists have a rough equivalent to this concept: “scientific debt.”

Scientific debt is when a team takes shortcuts in data analysis, experimental practices, and monitoring that could have long-term negative consequences. When you hear a statement like:

  • “We don’t have enough time to run a randomized test, let’s launch it”
  • “To a first approximation this effect is probably linear”
  • “This could be a confounding factor, but we’ll look into that later”
  • “It’s directionally accurate at least”

you’re hearing a little scientific debt being “borrowed”.

Example: WidgetCorp

Most engineers have a sense of what it’s like for a company to struggle with technical debt. What would a company struggling with scientific debt look like?

Imagine a small startup WidgetCorp is developing a B2B product, and deciding on their sales strategy. One year they decide to start focusing their sales efforts on larger corporate clients. They notice that as they take on this new strategy, their monthly revenue increases. They’re encouraged by this, and in the following years hire half a dozen salespeople with experience working with large clients, and spend marketing and design effort building that as part of their brand.

Years later, the strategy doesn’t seem to be paying off: their revenue is struggling and the early successes aren’t repeating themselves. They hire an analyst who looks at their sales data, and finds that in fact, it had never been the case that they’d had a higher return-on-investment selling to large companies. In that early year, their revenue had been rising because of a seasonal effect (the demand for widgets goes up in the fall and winter), which was compounded with some random noise and anecdotes (e.g. “SmallCompany.com was a waste of our time, but we just closed a huge deal with Megabiz!”)

WidgetCorp took on too much scientific debt.

 

Some ways this might have happened:

They made irreversible decisions based on flawed analyses. It’s reasonable to take a quick look at metrics and be happy that they’re going in the right direction. But once the company made product, sales and marketing changes, it became difficult to reverse them. Before making a major shift in business, it’s worth making sure that the data supports it: that they’ve accounted for seasonal effects and applied proper statistical tests.

Lack of monitoring. Early on, there may not have been enough data to tell whether larger clients were . But as more data was collected, it would be worth continually testing this assumption, in the form of a dashboard or a quarterly report. If this isn’t tracked, no one will notice that the hypothesis was falsified even once they have the data.

Lack of data infrastructure: Maybe early in the company the leads were locked in a sales CRM, while accounting data was stored in Excel spreadsheets that were emailed around. Even if there were a dedicated analyst within the company, they may not have easy access to the relevant data (linking together sales sucess and company size). Even if it were theoretically possible to combine the datasets with some effort, schlep blindness might have made everyone avoid the analysis entirely. This is an area where technical debt and scientific debt often appear together, since it takes engineering effort to make scientific problems easy to solve.

Spreading inaccurate lore. Suppose that the WidgetCorp CEO had given a series of company-wide talks and public blog posts with the message “The future of WidgetCorp is serving big companies!” Product teams got into the habit of prioritizing features in this direction, and every failure got blamed on “I guess we weren’t focused enough on big clients”. This kind of “cultural inertia” can be very difficult to reverse, even if the executive team is willing to publicly admit their mistake (which isn’t guaranteed!)

Just about every experienced data scientist has at least a few of these stories, even from otherwise successful companies. They are to scientific debt what the Daily WTF is to technical debt.

Is scientific debt always bad?

Not at all!

 

I often take shortcuts in my own analyses. Running a randomized experiment for a feature launch is sometimes too expensive, especially if the number of users is fairly small or the change pretty uncontroversial (you wouldn’t A/B test a typo fix). And while correlation doesn’t imply causation, it’s usually better than nothing when making business decisions.

The comparison to technical debt is useful here: a small engineering team’s first goal is typically to build an minimum viable product quickly, rather than overengineer a system that they think will be robust in the distant future. (The equivalent in scientific debt is typically called overthinking, e.g. “Yes I suppose we could control for weather when we examine what sales deals succeed, but I’m pretty sure you’re overthinking this”). And the comparison to financial debt is meaningful too: companies typically take on debt (or, similarly, give up equity) while they’re growing. Just like you can’t build a company without borrowing money, you can’t build a company while being certain every decision is thoroughly supported by data.

What’s important in both technical and scientific debt is to keep the long-term cost in mind.

It isn't technical debt if you aren't…

1) Leveraging it to get something valuable up front
2) Paying interest on it regularly
3) Treating it as a liability that you may eventually need to pay in full

Code that doesn't meet this criteria isn't debt, it's just low quality work.

— Practicing Developer (@practicingdev) February 26, 2018

Wrong decisions are expensive, and not paying attention to data is a risk. We can do a cost-benefit analysis of whether the risk is worth it, but we shouldn’t write it off as “data scientists always find something to complain about”.

Why even call it “debt”?

To a data scientist or analyst, this post might sound pretty obvious. Of course there are downsides to ignoring statistical rigor, so why bother giving it a “buzzword-y” name? Because it puts the concept in terms executives and managers can easily understand.

Again, let’s go back to technical debt. There are lots of reasons individual engineers may want to write “clean code”: they appreciate its elegance, they want to impress their peers, or they’re perfectionists procrastinating on other work. These reasons don’t generally matter to non-technical employees, who care about product features and reliability. The framing of technical debt helps emphasize what the company loses by not investing in architecture: the idea that even if a product looks like it’s working, the flaws have a long-term cost in actual dollars and time.

Engineer: It bothers me that different internal projects use different naming conventions.

CTO: Sorry it annoys you, but code is code, I don’t see why you should waste time on this.

Engineer: Our inconsistent naming conventions are technical debt: they make it harder for new developers to learn the system.

CTO: I’ve been looking for ways to reduce our onboarding time! Great idea, let me know what you need to fix it.

Similarly, scientists, especially from an academic background, often have a particular interest in discovering truths about reality. So the idea of “I’d like to analyze whether X is a confounding factor here” can sound like an indulgence rather than an immediate business need. Statisticians in particular are often excited by finding flaws in mathematical methods. So when a data scientist says something like “We can’t use that method, Jones et al 2012 proved that it is asymptotically inconsistent,” non-technical colleagues might assume they’re overthinking it or even showing off. Framing it in terms of what we’re actually risking helps communicate why it’s worth spending time on.

How can we manage scientific debt well?
  • Let data scientists “pay interest” on it. Just as not every engineering project will lead to a new feature, not every analysis will lead to an exciting discovery or novel algorithm. Some time needs to be spent confirming or invalidating existing assumptions. Jonathan Nolis has a great article about prioritizing data science work, where he describes this quadrant as “providing proof”.

  • Build data engineering processes: As described earlier, one reason a company might fall into scientific debt is that analysts may not have easy access to the data they need. It could be locked away in a platform that hasn’t been ingested, or in Google sheets that are edited by hand. Ingesting relevant data into a data warehouse or a data lake makes it more likely data scientists can make relevant discoveries.

  • Revisit old analyses: One common reason early-stage companies go into scientific debt is that they don’t yet have enough data to draw robust conclusions. Even if you don’t have enough data yet, that doesn’t mean you should forget about the problem. Sometimes I put time on my calendar to run an analysis once I expect enough data to be available, even if it’s a few months away. This can also help confirm an important analysis is still relevant: just like you’d keep track of a KPI over time, you want to keep track of whether a conclusion remains true.

  • Have data expertise spread throughout the company. Just as someone who can’t program may not recognize technical debt, someone who doesn’t have experience analyzing and understanding data may not recognize scientific debt. This is yet another reason to democratize data science within your company, as we do at DataCamp.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mimicking SQLDF with MonetDBLite

Thu, 05/10/2018 - 06:32

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

Like many useRs, I am also a big fan of the sqldf package developed by Grothendieck, which uses SQL statement for data frame manipulations with SQLite embedded database as the default back-end.

In examples below, I drafted a couple R utility functions with the MonetDBLite back-end by mimicking the sqldf function. There are several interesting observations shown in the benchmark comparison.
– The data import for csv data files is more efficient with MonetDBLite than with the generic read.csv function or read.csv.sql function in the sqldf package.
– The data manipulation for a single data frame, such as selection, aggregation, and subquery, is also significantly faster with MonetDBLite than with the sqldf function.
– However, the sqldf function is extremely efficient in joining 2 data frames, e.g. inner join in the example.

# IMPORT monet.read.csv <- function(file) { monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(MonetDBLite::monetdb.read.csv(monet.con, file, "file", sep = ",")) result <- DBI::dbReadTable(monet.con, "file") DBI::dbDisconnect(monet.con, shutdown = T) return(result) } microbenchmark::microbenchmark(monet = {df <- monet.read.csv("Downloads/nycflights.csv")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 528.5378 532.5463 539.2877 539.0902 542.4301 559.1191 10 microbenchmark::microbenchmark(read.csv = {df <- read.csv("Downloads/nycflights.csv")}, times = 10) #Unit: seconds # expr min lq mean median uq max neval # read.csv 2.310238 2.338134 2.360688 2.343313 2.373913 2.444814 10 # SELECTION AND AGGREGATION monet.sql <- function(df, sql) { df_str <- deparse(substitute(df)) monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(DBI::dbWriteTable(monet.con, df_str, df, overwrite = T)) result <- DBI::dbGetQuery(monet.con, sql) DBI::dbDisconnect(monet.con, shutdown = T) return(result) } microbenchmark::microbenchmark(monet = {monet.sql(df, "select * from df sample 3")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 422.761 429.428 439.0438 438.3503 447.3286 453.104 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from df order by RANDOM() limit 3")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 903.9982 908.256 925.4255 920.2692 930.0934 963.6983 10 microbenchmark::microbenchmark(monet = {monet.sql(df, "select origin, median(distance) as med_dist from df group by origin")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 450.7862 456.9589 458.6389 458.9634 460.4402 465.2253 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select origin, median(distance) as med_dist from df group by origin")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 833.1494 836.6816 841.952 843.5569 846.8117 851.0771 10 microbenchmark::microbenchmark(monet = {monet.sql(df, "with df1 as (select dest, avg(distance) as dist from df group by dest), df2 as (select dest, count(*) as cnts from df group by dest) select * from df1 inner join df2 on (df1.dest = df2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 426.0248 431.2086 437.634 438.4718 442.8799 451.275 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from (select dest, avg(distance) as dist from df group by dest) df1 inner join (select dest, count(*) as cnts from df group by dest) df2 on (df1.dest = df2.dest)")}, times = 10) #Unit: seconds # expr min lq mean median uq max neval # sqldf 1.013116 1.017248 1.024117 1.021555 1.025668 1.048133 10 # MERGE monet.sql2 <- function(df1, df2, sql) { df1_str <- deparse(substitute(df1)) df2_str <- deparse(substitute(df2)) monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(DBI::dbWriteTable(monet.con, df1_str, df1, overwrite = T)) suppressMessages(DBI::dbWriteTable(monet.con, df2_str, df2, overwrite = T)) result <- DBI::dbGetQuery(monet.con, sql) DBI::dbDisconnect(monet.con, shutdown = T) return(result) } tbl1 <- monet.sql(df, "select dest, avg(distance) as dist from df group by dest") tbl2 <- monet.sql(df, "select dest, count(*) as cnts from df group by dest") microbenchmark::microbenchmark(monet = {monet.sql2(tbl1, tbl2, "select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 93.94973 174.2211 170.7771 178.487 182.4724 187.3155 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 19.49334 19.60981 20.29535 20.001 20.93383 21.51837 10 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

leaflet 2.0.0

Thu, 05/10/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

leaflet 2.0.0 is now on CRAN!

The leaflet R package wraps the Leaflet.js JavaScript library, and this release of the R package marks a major upgrade from the outdated Leaflet.js 0.7.x to the current Leaflet.js 1.x (specifically, 1.3.1).

Leaflet.js 1.x includes some non-backward-compatible API changes versus 0.7.x. If you’re using only R code to create your Leaflet maps, these changes should not affect you. If you are using custom JavaScript, some changes may be required to your code. Please see the Leaflet.js reference documentation and changelog.

leaflet.extras and leaflet.esri

Two additional packages by Bhaskar Karambelkar, leaflet.extras and leaflet.esri, have been updated on CRAN to utilize the latest Leaflet.js library bindings. leaflet.extras extends the Leaflet R package using various Leaflet.js plugins, offering features like heatmaps, additional marker icons, and drawing tools. leaflet.esri provides access to ArcGIS services, based on the ESRI leaflet plugin.

library(leaflet) library(leaflet.extras) leaflet(quakes) %>% addTiles() %>% addHeatmap(lng = ~long, lat = ~lat, radius = 8)



Full changelog Breaking Changes

  • Update to latest Leaflet.js 1.x (v1.3.1). Please see the Leaflet.js reference documentation and change log.
  • Previously, labels were implemented using the 3rd party extension Leaflet.label. Leaflet.js 1.x now provides this functionality naively. There are some minor differences to note:
    • If you are using custom JavaScript to create labels, you’ll need to change references to L.Label to L.Tooltip.
    • Tooltips are now displayed with default Leaflet.js styling.
    • In custom javascript extensions, change all *.bindLabel() to *.bindTooltip().
  • All Leaflet.js plugins updated to versions compatible with Leaflet.js 1.x.
Known Issues
  • The default CSS z-index of the Leaflet map has changed; for some Shiny applications, the map now covers elements that are intended to be displayed on top of the map. This issue has been fixed on GitHub (devtools::install_github("rstudio/leaflet")). For now, you can work around this in the CRAN version by including this line in your application UI:
tags$style(".leaflet-map-pane { z-index: auto; }") Features
  • Added more providers for addProviderTiles(): OpenStreetMap.CH, OpenInfraMap, OpenInfraMap.Power, OpenInfraMap.Telecom, OpenInfraMap.Petroleum, OpenInfraMap.Water, OpenPtMap, OpenRailwayMap, OpenFireMap, SafeCast.
  • Add groupOptions function. Currently the only option is letting you specify zoom levels at which a group should be visible.
  • Added support for drag events.
  • Added method argument to addRasterImage() to enable nearest neighbor interpolation when projecting categorical rasters.
  • Added an 'auto' method for addRasterImage(). Projected factor results are coerced into factors.
  • Added data parameter to remaining addXXX() methods, including addLegend.
  • Added preferCanvas argument to leafletOptions().
    ### Bug Fixes and Improvements
  • Relative protocols are used where possible when adding tiles. In RStudio 1.1.x on linux and windows, a known issue of ‘https://’ routes fail to load, but works within browsers (rstudio/rstudio#2661).
  • L.multiPolyline was absorbed into L.polyline, which also accepts an array of polyline information.
  • Fixed bug where icons were anchored to the top-center by default, not center-center.
  • Fixed bug where markers would not appear in self contained knitr files.
  • L.Label is now L.tooltip in Leaflet.js. labelOptions() now translates old options clickable to interactive and noHide to permanent.
  • Fix a bug where the default addTiles() would not work with .html files served directly from the filesystem.
  • Fix bug with accessing columns in formulas when the data source is a Crosstalk SharedData object wrapping a spatial data frame or sf object.
  • Fix strange wrapping behavior for legend, especially common for Chrome when browser zoom level is not 100%.
  • Fix incorrect opacity on NA entry in legend.
  • Ensure type safety of .indexOf(stamp).
  • validateCoords() warns on invalid polygon data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rolling Fama French

Thu, 05/10/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)




























In a previous post, we reviewed how to import the Fama French 3-Factor data, wrangle that data, and then regress our portfolio returns on the factors. Please have a look at that previous post, as the following work builds upon it. For more background on Fama French, see the original article published in The Journal of Financial Economics, Common risk factors in the returns on stocks and bonds.

Today, we will explore the rolling Fama French model and the explanatory power of the 3 factors in different time periods. In the financial world, we often look at rolling means, standard deviations and models to make sure we haven’t missed anything unusual, risky, or concerning during different market or economic regimes. Our portfolio returns history is for the years 2013 through 2017, which is rather a short history, but there still might a be a 24-month period where the Fama French factors were particularly strong, weak, or meaningless. We would like to unearth and hypothesize about what explains them or their future likelihood.

We will be working with our usual portfolio consisting of:

+ SPY (S&P500 fund) weighted 25% + EFA (a non-US equities fund) weighted 25% + IJS (a small-cap value fund) weighted 20% + EEM (an emerging-mkts fund) weighted 20% + AGG (a bond fund) weighted 10%

Before we can run a Fama French model for that portfolio, we need to find portfolio monthly returns, which was covered in this post. I won’t go through the logic again but the code is here:

library(tidyquant) library(tidyverse) library(timetk) symbols <- c("SPY","EFA", "IJS", "EEM","AGG") prices <- getSymbols(symbols, src = 'yahoo', from = "2012-12-31", to = "2017-12-31", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) w <- c(0.25, 0.25, 0.20, 0.20, 0.10) asset_returns_long <- prices %>% to.monthly(indexAt = "lastof", OHLC = FALSE) %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns)))) %>% na.omit() portfolio_returns_tq_rebalanced_monthly <- asset_returns_long %>% tq_portfolio(assets_col = asset, returns_col = returns, weights = w, col_rename = "returns", rebalance_on = "months")

We also need to import the Fama French factors and combine them into one object with our portfolio returns. We painstakingly covered this in the previous post and the code for doing so is here:

temp <- tempfile() base <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/" factor <- "Global_3_Factors" format<- "_CSV.zip" full_url <- glue(base, factor, format, sep ="") download.file( full_url, temp, quiet = TRUE) Global_3_Factors <- read_csv(unz(temp, "Global_3_Factors.csv"), skip = 6) %>% rename(date = X1) %>% mutate_at(vars(-date), as.numeric) %>% mutate(date = rollback(ymd(parse_date_time(date, "%Y%m") + months(1)))) %>% filter(date >= first(portfolio_returns_tq_rebalanced_monthly$date) & date <= last(portfolio_returns_tq_rebalanced_monthly$date)) ff_portfolio_returns <- portfolio_returns_tq_rebalanced_monthly %>% left_join(Global_3_Factors, by = "date") %>% mutate(MKT_RF = Global_3_Factors$`Mkt-RF`/100, SMB = Global_3_Factors$SMB/100, HML = Global_3_Factors$HML/100, RF = Global_3_Factors$RF/100, R_excess = round(returns - RF, 4))

We now have one data frame ff_portfolio_returns that holds our Fama French factors and portfolio returns. Let’s get to the rolling analysis.

We first define a rolling model with the rollify() function from tibbletime. However, instead of wrapping an existing function, such as kurtosis() or skewness(), we will pass in our linear Fama French model.

# Choose a 24-month rolling window window <- 24 library(tibbletime) # define a rolling ff model with tibbletime rolling_lm <- rollify(.f = function(R_excess, MKT_RF, SMB, HML) { lm(R_excess ~ MKT_RF + SMB + HML) }, window = window, unlist = FALSE)

Next, we pass columns from ff_portfolio_returns to the rolling function model.

rolling_ff_betas <- ff_portfolio_returns %>% mutate(rolling_ff = rolling_lm(R_excess, MKT_RF, SMB, HML)) %>% slice(-1:-23) %>% select(date, rolling_ff) head(rolling_ff_betas, 3) # A tibble: 3 x 2 date rolling_ff 1 2014-12-31 2 2015-01-31 3 2015-02-28

We now have a new data frame called rolling_ff_betas, in which the column rolling_ff holds an S3 object of our model results. We can tidy() that column with map(rolling_ff, tidy) and then unnest() the results, very similar to our CAPM work, except we have more than one independent variable.

rolling_ff_betas <- ff_portfolio_returns %>% mutate(rolling_ff = rolling_lm(R_excess, MKT_RF, SMB, HML)) %>% mutate(tidied = map(rolling_ff, tidy, conf.int = T)) %>% unnest(tidied) %>% slice(-1:-23) %>% select(date, term, estimate, conf.low, conf.high) %>% filter(term != "(Intercept)") %>% rename(beta = estimate, factor = term) %>% group_by(factor) head(rolling_ff_betas, 3) # A tibble: 3 x 5 # Groups: factor [3] date factor beta conf.low conf.high 1 2014-12-31 MKT_RF 0.931 0.784 1.08 2 2014-12-31 SMB -0.0130 -0.278 0.252 3 2014-12-31 HML -0.160 -0.459 0.139

We now have rolling betas and confidence intervals for each of our 3 factors. Let’s apply the same code logic and extract the rolling R-squared for our model. The only difference is we call glance() instead of tidy().

rolling_ff_rsquared <- ff_portfolio_returns %>% mutate(rolling_ff = rolling_lm(R_excess, MKT_RF, SMB, HML)) %>% slice(-1:-23) %>% mutate(glanced = map(rolling_ff, glance)) %>% unnest(glanced) %>% select(date, r.squared, adj.r.squared, p.value) head(rolling_ff_rsquared, 3) # A tibble: 3 x 4 date r.squared adj.r.squared p.value 1 2014-12-31 0.898 0.883 4.22e-10 2 2015-01-31 0.914 0.901 8.22e-11 3 2015-02-28 0.919 0.907 4.19e-11

We have extracted rolling factor betas and the rolling model R-squared, now let’s visualize.

Visualizing Rolling Fama French

We start by charting the rolling factor betas with ggplot(). This gives us an view into how the explanatory power of each factor has changed over time.

rolling_ff_betas %>% ggplot(aes(x = date, y = beta, color = factor)) + geom_line() + labs(title= "24-Month Rolling FF Factor Betas", x = "rolling betas") + scale_x_date(breaks = scales::pretty_breaks(n = 10)) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90))

The rolling factor beta chart reveals some interesting trends. Both SMB and HML have hovered around zero, while the MKT factor has hovered around 1. That’s consistent with our plot of betas with confidence intervals from last time.

Next, let’s visualize the rolling R-squared with highcharter.

We first convert rolling_ff_rsquared to xts, using the tk_xts() function.

rolling_ff_rsquared_xts <- rolling_ff_rsquared %>% tk_xts(date_var = date, silent = TRUE)

Then pass the xts object to a highchart(type = "stock") code flow, adding the rolling R-squared time series with hc_add_series(rolling_ff_rsquared_xts$r.squared...).

highchart(type = "stock") %>% hc_add_series(rolling_ff_rsquared_xts$r.squared, color = "cornflowerblue", name = "r-squared") %>% hc_title(text = "Rolling FF 3-Factor R-Squared") %>% hc_add_theme(hc_theme_flat()) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE) %>% hc_exporting(enabled = TRUE)

{"x":{"hc_opts":{"title":{"text":"Rolling FF 3-Factor R-Squared"},"yAxis":{"title":{"text":null}},"credits":{"enabled":false},"exporting":{"enabled":true},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1419984000000,0.898193874076082],[1422662400000,0.913613669331354],[1425081600000,0.919257460405196],[1427760000000,0.919053841151536],[1430352000000,0.920192764383095],[1433030400000,0.920671410263961],[1435622400000,0.919801526746092],[1438300800000,0.908279914990913],[1440979200000,0.92707755328544],[1443571200000,0.916184001290169],[1446249600000,0.926114696965536],[1448841600000,0.925254318145437],[1451520000000,0.923190960293236],[1454198400000,0.929828565005735],[1456704000000,0.922988355712527],[1459382400000,0.938348757232282],[1461974400000,0.937402062865212],[1464652800000,0.936268502221349],[1467244800000,0.912164668775577],[1469923200000,0.916567108467131],[1472601600000,0.918047866998973],[1475193600000,0.917334780359066],[1477872000000,0.938055499197554],[1480464000000,0.939201416001601],[1483142400000,0.938828531422516],[1485820800000,0.939091105917217],[1488240000000,0.934435224922171],[1490918400000,0.934421254157578],[1493510400000,0.934758653807486],[1496188800000,0.936641187924107],[1498780800000,0.934909707340084],[1501459200000,0.943913578310692],[1504137600000,0.932829172819464],[1506729600000,0.926037292248958],[1509408000000,0.911850534203785],[1512000000000,0.910661077924351],[1514678400000,0.916587636494134]],"color":"cornflowerblue","name":"r-squared"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"colors":["#f1c40f","#2ecc71","#9b59b6","#e74c3c","#34495e","#3498db","#1abc9c","#f39c12","#d35400"],"chart":{"backgroundColor":"#ECF0F1"},"xAxis":{"gridLineDashStyle":"Dash","gridLineWidth":1,"gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"yAxis":{"gridLineDashStyle":"Dash","gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"legendBackgroundColor":"rgba(0, 0, 0, 0.5)","background2":"#505053","dataLabelsColor":"#B0B0B3","textColor":"#34495e","contrastTextColor":"#F0F0F3","maskColor":"rgba(255,255,255,0.3)"},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

That chart looks choppy, but the R-squared never really left the range between .9 and .95. We can tweak the minimum and maximum y-axis values for some perspective.

highchart(type = "stock") %>% hc_add_series(rolling_ff_rsquared_xts$r.squared, color = "cornflowerblue", name = "r-squared") %>% hc_title(text = "Rolling FF 3-Factor R-Squared") %>% hc_yAxis( max = 2, min = 0) %>% hc_add_theme(hc_theme_flat()) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE) %>% hc_exporting(enabled = TRUE)

{"x":{"hc_opts":{"title":{"text":"Rolling FF 3-Factor R-Squared"},"yAxis":{"title":{"text":null},"max":2,"min":0},"credits":{"enabled":false},"exporting":{"enabled":true},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1419984000000,0.898193874076082],[1422662400000,0.913613669331354],[1425081600000,0.919257460405196],[1427760000000,0.919053841151536],[1430352000000,0.920192764383095],[1433030400000,0.920671410263961],[1435622400000,0.919801526746092],[1438300800000,0.908279914990913],[1440979200000,0.92707755328544],[1443571200000,0.916184001290169],[1446249600000,0.926114696965536],[1448841600000,0.925254318145437],[1451520000000,0.923190960293236],[1454198400000,0.929828565005735],[1456704000000,0.922988355712527],[1459382400000,0.938348757232282],[1461974400000,0.937402062865212],[1464652800000,0.936268502221349],[1467244800000,0.912164668775577],[1469923200000,0.916567108467131],[1472601600000,0.918047866998973],[1475193600000,0.917334780359066],[1477872000000,0.938055499197554],[1480464000000,0.939201416001601],[1483142400000,0.938828531422516],[1485820800000,0.939091105917217],[1488240000000,0.934435224922171],[1490918400000,0.934421254157578],[1493510400000,0.934758653807486],[1496188800000,0.936641187924107],[1498780800000,0.934909707340084],[1501459200000,0.943913578310692],[1504137600000,0.932829172819464],[1506729600000,0.926037292248958],[1509408000000,0.911850534203785],[1512000000000,0.910661077924351],[1514678400000,0.916587636494134]],"color":"cornflowerblue","name":"r-squared"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"colors":["#f1c40f","#2ecc71","#9b59b6","#e74c3c","#34495e","#3498db","#1abc9c","#f39c12","#d35400"],"chart":{"backgroundColor":"#ECF0F1"},"xAxis":{"gridLineDashStyle":"Dash","gridLineWidth":1,"gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"yAxis":{"gridLineDashStyle":"Dash","gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"legendBackgroundColor":"rgba(0, 0, 0, 0.5)","background2":"#505053","dataLabelsColor":"#B0B0B3","textColor":"#34495e","contrastTextColor":"#F0F0F3","maskColor":"rgba(255,255,255,0.3)"},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

Ah, when the y-axis is zoomed out a bit, our R-squared looks consistently near 1 for the life of the portfolio.

That’s all for today. Thanks and see you next time!

_____='https://rviews.rstudio.com/2018/05/10/rolling-fama-french/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Format and Interpret Linear Mixed Models

Thu, 05/10/2018 - 02:00

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

You find it time-consuming to manually format, copy and paste output values to your report or manuscript? That time is over: the psycho package is here for you!

The data

Let’s take the example dataset included in the psycho package.

library(psycho) library(tidyverse) df <- psycho::emotion %>% select(Participant_ID, Emotion_Condition, Subjective_Valence, Autobiographical_Link) summary(df) Participant_ID Emotion_Condition Subjective_Valence Autobiographical_Link 10S : 48 Negative:456 Min. :-100.000 Min. : 0.00 11S : 48 Neutral :456 1st Qu.: -65.104 1st Qu.: 0.00 12S : 48 Median : -2.604 Median : 16.15 13S : 48 Mean : -18.900 Mean : 28.99 14S : 48 3rd Qu.: 7.000 3rd Qu.: 59.90 15S : 48 Max. : 100.000 Max. :100.00 (Other):624 NA's :1

Our dataframe (called df) contains data from several participants, exposed to neutral and negative pictures (the Emotion_Condition column). Each row corresponds to a single trial. During each trial, the participant had to rate its emotional valence (Subjective_Valence: positive – negative) experienced during the picture presentation and the amount of personal memories associated with the picture (Autobiographical_Link).

Our dataframe contains, for each of the 48 trials, 4 variables: the name of the participant (Participant_ID), the emotion condition (Emotion_Condition), the valence rating (Subjective_Valence) and the Autobiographical Link (Autobiographical_Link).

Fit the model

Let’s fit a linear mixed model to predict the autobiographical link with the condition and the subjective valence.

library(lmerTest) fit <- lmer(Autobiographical_Link ~ Emotion_Condition * Subjective_Valence + (1|Participant_ID), data=df) summary(fit) Linear mixed model fit by REML. t-tests use Satterthwaite's method [ lmerModLmerTest] Formula: Autobiographical_Link ~ Emotion_Condition * Subjective_Valence + (1 | Participant_ID) Data: df REML criterion at convergence: 8555.5 Scaled residuals: Min 1Q Median 3Q Max -2.2682 -0.6696 -0.2371 0.7052 3.2187 Random effects: Groups Name Variance Std.Dev. Participant_ID (Intercept) 243.1 15.59 Residual 661.4 25.72 Number of obs: 911, groups: Participant_ID, 19 Fixed effects: Estimate Std. Error df (Intercept) 25.52248 4.23991 31.49944 Emotion_ConditionNeutral 6.13715 2.66993 895.13045 Subjective_Valence 0.05772 0.03430 898.46616 Emotion_ConditionNeutral:Subjective_Valence 0.16140 0.05020 896.26695 t value Pr(>|t|) (Intercept) 6.020 1.09e-06 *** Emotion_ConditionNeutral 2.299 0.02176 * Subjective_Valence 1.683 0.09280 . Emotion_ConditionNeutral:Subjective_Valence 3.215 0.00135 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Correlation of Fixed Effects: (Intr) Emt_CN Sbjc_V Emtn_CndtnN -0.459 Sbjctv_Vlnc 0.455 -0.726 Emtn_CN:S_V -0.308 0.301 -0.676 The analyze function

The analyze function, available in the psycho package, transforms a model fit object into user-friendly outputs.

results <- analyze(fit, CI = 95) Summary

Summarizing an analyzed object returns a dataframe, that can be easily saved and included in reports. It also includes standardized coefficients, as well as bootstrapped confidence intervals (CI) and effect sizes.

summary(results) %>% mutate(p = psycho::format_p(p)) Variable Coef SE t df Coef.std SE.std p Effect_Size CI_lower CI_higher (Intercept) 25.52 4.24 6.02 31.50 0.00 0.00 < .001*** Very Small 17.16 33.93 Emotion_ConditionNeutral 6.14 2.67 2.30 895.13 0.10 0.04 < .05* Very Small 0.91 11.37 Subjective_Valence 0.06 0.03 1.68 898.47 0.09 0.06 = 0.09° Very Small -0.01 0.12 Emotion_ConditionNeutral:Subjective_Valence 0.16 0.05 3.22 896.27 0.13 0.04 < .01** Very Small 0.06 0.26 Print

Moreover, the print method return a nicely formatted output that can be almost directly pasted into the manuscript.

print(results) The overall model predicting Autobiographical_Link (formula = Autobiographical_Link ~ Emotion_Condition * Subjective_Valence + (1 | Participant_ID)) successfully converged and explained 32.48% of the variance of the endogen (the conditional R2). The variance explained by the fixed effects was of 7.66% (the marginal R2) and the one explained by the random effects of 24.82%. The model's intercept is at 25.52 (SE = 4.24, 95% CI [17.16, 33.93]). Within this model: - The effect of Emotion_ConditionNeutral is significant (beta = 6.14, SE = 2.67, 95% CI [0.91, 11.37], t(895.13) = 2.30, p < .05*) and can be considered as very small (std. beta = 0.098, std. SE = 0.043). - The effect of Subjective_Valence is significant (beta = 0.058, SE = 0.034, 95% CI [-0.0097, 0.12], t(898.47) = 1.68, p = 0.09°) and can be considered as very small (std. beta = 0.095, std. SE = 0.056). - The effect of Emotion_ConditionNeutral:Subjective_Valence is significant (beta = 0.16, SE = 0.050, 95% CI [0.063, 0.26], t(896.27) = 3.22, p < .01**) and can be considered as very small (std. beta = 0.13, std. SE = 0.041).

The intercept (the baseline level) corresponds, here, to the negative condition with subjective valence at 0 (the average, as the data is standardized). Compared to that, changing the condition from negative to neutral does not induce any significant change to the outcome. However, in the negative condition, there is a trending linear relationship between valence and autobiographical memories: the more an item is positive the more it is related to memories. Finally, the interaction is significant: the relationship between valence autobiographical memories is stronger (more positive) in the neutral condition.

Credits

This package helped you? You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The social weather of rOpenSci onboarding system

Thu, 05/10/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Our onboarding process
ensures that packages contributed by the community undergo a
transparent, constructive, non adversarial and open review process.
Before even submitting my first R package to rOpenSci onboarding system
in December 2015, I spent a fair amount of time reading through previous
issue threads in order to assess whether onboarding was a friendly place
for me: a newbie, very motivated to learn more but a newbie nonetheless.
I soon got the feeling that yes, onboarding would help me make my
package better without ever making me feel inadequate.

More recently, I read Anne Gentle’s
essay

in Open Advice where she mentions the concept of the social weather of
an online community. By listening before I even posted, I was simply
trying to get a good idea of the social weather of onboarding – as a
side-note, how wonderful is it that onboarding’s being open makes it
possible to listen?!

In the meantime, I’ve now only submitted and reviewed a few packages but
also become an associate
editor
. In
general, when one of us editors talks about onboarding, we like to use
examples illustrating the system in a good light: often excerpts from
review processes, or quotes of tweets by authors or reviewers. Would
there be a more quantitative way for us to support our promotion
efforts? In this blog post, I shall explore how a
tidytext analysis of review processes
(via their GitHub threads) might help us characterize the social weather
of onboarding.

Preparing the data A note about text mining

In this blog post, I’ll leverage the tidytext package, with the help
of its accompanying book “Tidy text
mining”
. The authors, Julia Silge and
David Robinson, actually met and started working on this project, at
rOpenSci unconf in 2016!

The “tidy text” of the analysis

I’ve described in the first post of this
series

how I got all onboarding issue threads. Now, I’ll explain how I cleaned
their text. Why does it need to be cleaned? Well, this text contains
elements that I wished to remove: code chunks, as well as parts from our
submission and review
templates
.

My biggest worry was the removal of templates from issues. I was already
picturing my spending hours writing regular expressions to remove these
lines… and then I realized that the word “lines” was the key! I could go
split all issue comments into lines, which is called tokenization in
proper text mining vocabulary, and then remove duplicates! This way, I
didn’t even have to worry about the templates having changed a bit over
time, since each version was used at least twice. A tricky part that
remained was the removal of code chunks since I only wanted to keep
human conversation. In theory, it was easy: code chunks are lines
located between two lines containing ““`”… I’m still not sure I
solved this in the easiest possible way.

library("magrittr") threads <- readr::read_csv("data/clean_data.csv") # to remove code lines between ``` range <- function(x1, x2){ x1:x2 } # I need the indices of lines between ``` split_in_indices <- function(x){ lengthx <- length(x) if(length(x) == 0){ return(0) }else{ if(lengthx > 2){ limits1 <- x[seq(from = 1, to = (lengthx - 1), by = 2)] limits2 <- x[seq(from = 2, to = lengthx, by = 2)] purrr::map2(limits1, limits2, range) %>% unlist() %>% c() }else{ x[1]:x[2] } } } # tokenize by line threads <- tidytext::unnest_tokens(threads, line, body, token = "lines") # remove white space threads <- dplyr::mutate(threads, line = trimws(line)) # remove citations lines threads <- dplyr::filter(threads, !stringr::str_detect(line, "^\\>")) # remove the line from the template that has ``` that used to bother me threads <- dplyr::filter(threads, !stringr::str_detect(line, "bounded by ```")) # correct one line threads <- dplyr::mutate(threads, line = stringr::str_replace_all(line, "`` `", "```")) # group by comment threads <- dplyr::group_by(threads, title, created_at, user, issue) # get indices threads <- dplyr::mutate(threads, index = 1:n()) # get lines limiting chunks threads <- dplyr::mutate(threads, chunk_limit = stringr::str_detect(line, "```")&stringr::str_count(line, "`") %in% c(3, 4)) # special treatment threads <- dplyr::mutate(threads, chunk_limit = ifelse(user == "MarkEdmondson1234" & issue == 127 & index == 33, FALSE, chunk_limit)) threads <- dplyr::mutate(threads, which_limit = list(split_in_indices(which(chunk_limit)))) # weird code probably to get indices of code lines threads <- dplyr::mutate(threads, code = index %in% which_limit[[1]]) threads <- dplyr::ungroup(threads)

Let’s look at what this does in practice, with comments from
gutenbergr
submission
as
example. I chose this submission because the package author, David
Robinson, is one of the two tidytext creators, and because I was the
reviewer, so it’s all very meta, isn’t it?

In the excerpt below, we see the most important variable, the binary
code indicating whether the line is a code line. This excerpt also
shows variables created to help compute code: index is the index of
the line withing a comment, chunk_limit indicates whether the line is
a chunk limit, which_limit indicates which indices in the comment
indicate lines of code.

dplyr::filter(threads, package == "gutenbergr", user == "sckott", !stringr::str_detect(line, "ropensci..footer")) %>% dplyr::mutate(created_at = as.character(created_at)) %>% dplyr::select(created_at, line, code, index, chunk_limit, which_limit) %>% knitr::kable() created_at line code index chunk_limit which_limit 2016-05-02 17:04:56 thanks for your submission @dgrtwo – seeking reviewers now FALSE 1 FALSE 0 2016-05-04 06:09:19 reviewers: @masalmon FALSE 1 FALSE 0 2016-05-04 06:09:19 due date: 2016-05-24 FALSE 2 FALSE 0 2016-05-12 16:16:38 having a quick look over this now… FALSE 1 FALSE 0 2016-05-12 16:45:59 @dgrtwo looks great. just a minor thing: FALSE 1 FALSE 3:7 2016-05-12 16:45:59 gutenberg_get_mirror() throws a warning due to xml2, at this line https://github.com/dgrtwo/gutenbergr/blob/master/r/gutenberg_download.r#l213 FALSE 2 FALSE 3:7 2016-05-12 16:45:59 “` r TRUE 3 TRUE 3:7 2016-05-12 16:45:59 warning message: TRUE 4 FALSE 3:7 2016-05-12 16:45:59 in node_find_one(x*nod*e, *x*doc, xpath = xpath, nsmap = ns) : TRUE 5 FALSE 3:7 2016-05-12 16:45:59 101 matches for .//a: using first TRUE 6 FALSE 3:7 2016-05-12 16:45:59 “` TRUE 7 TRUE 3:7 2016-05-12 16:45:59 wonder if it’s worth a suppresswarnings() there? FALSE 8 FALSE 3:7 2016-05-12 20:42:53 great! FALSE 1 FALSE 3:5 2016-05-12 20:42:53 – add the footer to your readme: FALSE 2 FALSE 3:5 2016-05-12 20:42:53 “` TRUE 3 TRUE 3:5 2016-05-12 20:42:53 “` TRUE 5 TRUE 3:5 2016-05-12 20:42:53 – could you add url and bugreports entries to description, so people know where to get sources and report bugs/issues FALSE 6 FALSE 3:5 2016-05-12 20:42:53 – update installation of dev versions to ropenscilabs/gutenbergr and any urls for the github repo to ropenscilabs instead of dgrtwo FALSE 7 FALSE 3:5 2016-05-12 20:42:53 – go to the repo settings –> transfer ownership and transfer to ropenscilabs – note that all our newer pkgs go to ropenscilabs first, then when more mature we’ll move to ropensci FALSE 8 FALSE 3:5 2016-05-13 01:22:22 nice, builds on at travis https://travis-ci.org/ropenscilabs/gutenbergr/ – you can keep appveyor builds under your acct, or i can start on mine, let me know FALSE 1 FALSE 0 2016-05-13 16:06:31 updated badge link, started an appveyor account with ropenscilabs as account name – sent pr – though the build is failing, something about getting the current gutenberg url https://ci.appveyor.com/project/sckott/gutenbergr/build/1.0.1#l650 FALSE 1 FALSE 0

So as you see now getting rid of chunks is straightforward: the lines
with code == TRUE have to be deleted.

# remove them and get rid of now useless columns threads <- dplyr::filter(threads, !code) threads <- dplyr::select(threads, - code, - which_limit, - index, - chunk_limit)

Now on to removing template parts… I noticed that removing duplicates
was a bit too drastic because sometimes duplicates were poorly formatted
citations, e.g. an author answering a reviewer’s question by
copy-pasting it without Markdown
blockquotes
,
in which case we definitely want to keep the first occurrence. Besides,
duplicates were sometimes very short sentences such as “great!” that are
not templates, that we therefore should keep. Therefore, for each line,
I counted how many times it occurred overall (no_total_occ), and in
how many issues it occurred (no_issues).

Let’s look at Joseph Stachelek’s review of
rrricanes

for instance.

dplyr::filter(threads, user == "jsta", is_review) %>% dplyr::select(line) %>% head() %>% knitr::kable() line ## package review – [x] as the reviewer i confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review). #### documentation the package includes all the following forms of documentation: – [x] a statement of need clearly stating problems the software is designed to solve and its target audience in readme – [x] installation instructions: for the development version of package and any non-standard dependencies in readme

Now if we clean up a bit…

threads <- dplyr::group_by(threads, line) threads <- dplyr::mutate(threads, no_total_occ = n(), no_issues = length(unique(issue)), size = stringr::str_length(line)) threads <- dplyr::ungroup(threads) threads <- dplyr::group_by(threads, issue, line) threads <- dplyr::arrange(threads, created_at) threads <- dplyr::filter(threads, no_total_occ <= 2, # for repetitions in total keep the short ones # bc they are stuff like "thanks" so not template # yes 10 is arbitrary no_issues <= 1 | size < 10) # when there's a duplicate in one issue # it's probably citation # so keep the first occurrence get_first <- function(x){ x[1] } threads <- dplyr::group_by(threads, issue, line) threads <- dplyr::summarise_all(threads, get_first) threads <- dplyr::select(threads, - no_total_occ, - size, - no_issues) threads <- dplyr::mutate(threads, # let code words now be real words line = stringr::str_replace_all(line, "`", ""), # only keep text from links, not the links themselves line = stringr::str_replace_all(line, "\\]\\(.*\\)", ""), line = stringr::str_replace_all(line, "\\[", ""), line = stringr::str_replace_all(line, "blob\\/master", ""), # ’ line = stringr::str_replace_all(line, "’", ""), # remove some other links line = stringr::str_replace_all(line, "https\\:\\/\\/github\\.com\\/", "")) threads <- dplyr::filter(threads, !stringr::str_detect(line, "estimated hours spent reviewing")) threads <- dplyr::filter(threads, !stringr::str_detect(line, "notifications@github\\.com")) threads <- dplyr::filter(threads, !stringr::str_detect(line, "reply to this email directly, view it on")) threads <- dplyr::ungroup(threads)

Here is what we get from the same review.

dplyr::filter(threads, user == "jsta", is_review) %>% dplyr::select(line) %>% head() %>% knitr::kable() line * also, you might consider using the skip_on_cran function for lines that call an external download as recommended by the ropensci packaging guide. * i am having timeout issues with building the getting_started vignette. i wonder if there is a particular year with very few hurricanes that would solve the timeout problem. * i cannot build the data.world vignette. probably because i don’t have an api key set up. you may want to consider setting the code chunks to eval=false. * i really like the tidy_ functions. i wonder if it would make it easier on the end-user to have the get_ functions return tidy results by default with an optional argument to return “messy” results. * missing a maintainer field in the description * there are no examples for knots_to_mph, mb_to_in, status_abbr_to_str, get_discus, get_fstadv, tidy_fstadv, tidy_wr, tidy_fcst. maybe some can be changed to non-exported?

So now, we mostly got the interesting human and original language.

This got me “tidy enough” text. Let’s not mention this package author
who found a way to poorly format their
submission
right
under a guideline explaining how to copy the DESCRIPTION… Yep, that’s
younger me. Oh well.

Computing sentiment

Another data preparation part was to compute the sentiment score of each
line via the sentimentr
package by Tyler Rinker, which computes a score for sentences, not for
single words.

sentiment <- all %>% dplyr::group_by(line, created_at, user, role, issue) %>% dplyr::mutate(sentiment = median(sentimentr::sentiment(line)$sentiment)) %>% dplyr::ungroup() %>% dplyr::select(line, created_at, user, role, issue, sentiment)

This dataset of sentiment will be used later in the post.

Tidy text analysis of onboarding social weather What do reviews talk about?

To find out what reviews deal with as if I didn’t know about our
guidelines, I’ll compute the frequency of words and bigrams, and the
pairwise correlation of words within issue comments.

My using lollipops below was inspired by this fascinating blog post of
Tony ElHabr’s about his Google search
history
.

library("ggplot2") library("ggalt") library("hrbrthemes") stopwords <- rcorpora::corpora("words/stopwords/en")$stopWords word_counts <- threads %>% tidytext::unnest_tokens(word, line) %>% dplyr::filter(!word %in% stopwords) %>% dplyr::count(word, sort = TRUE) %>% dplyr::mutate(word = reorder(word, n)) ggplot(word_counts[1:15,]) + geom_lollipop(aes(word, n), size = 2, col = "salmon") + hrbrthemes::theme_ipsum(base_size = 16, axis_title_size = 16) + coord_flip()

bigrams_counts <- threads %>% tidytext::unnest_tokens(bigram, line, token = "ngrams", n = 2) %>% tidyr::separate(bigram, c("word1", "word2"), sep = " ", remove = FALSE) %>% dplyr::filter(!word1 %in% stopwords) %>% dplyr::filter(!word2 %in% stopwords) %>% dplyr::count(bigram, sort = TRUE) %>% dplyr::mutate(bigram = reorder(bigram, n)) ggplot(bigrams_counts[2:15,]) + geom_lollipop(aes(bigram, n), size = 2, col = "salmon") + hrbrthemes::theme_ipsum(base_size = 16, axis_title_size = 16) + coord_flip()

I’m not showing the first bigram that basically shows I’ve an encoding
issue to solve with a variation of “´”. In any case, both figures show
what we care about, like our guidelines that are mentioned often, and
documentation. I think words absent from the figures such as performance
or speed also highlight what we care less about, following Jeff Leek’s
philosophy
.

Now, let’s move on to a bit more complex visualization of pairwise
correlations between
words

within lines. First, let’s prepare the table of words in lines. Compared
with the book
tutorial
,
we add a condition for eliminating words mentioned in only one
submission, often function names.

users <- unique(threads$user) onboarding_line_words <- threads %>% dplyr::group_by(user, issue, created_at, package, line) %>% dplyr::mutate(line_id = paste(package, user, created_at, line)) %>% dplyr::ungroup() %>% tidytext::unnest_tokens(word, line) %>% dplyr::filter( word != package, !word %in% users, is.na(as.numeric(word)), word != "ldecicco", word != "usgs") %>% dplyr::group_by(word) %>% dplyr::filter(length(unique(issue)) > 1) %>% dplyr::select(line_id, word) onboarding_line_words %>% head() %>% knitr::kable() line_id word rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer. add rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer. a rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer. ropensci rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer. footer rrlite karthik 2015-04-12 20:56:04 – ] add an appropriate entry into ropensci.org/packages/index.html add rrlite karthik 2015-04-12 20:56:04 – ] add an appropriate entry into ropensci.org/packages/index.html an

Then, we can compute the correlation.

word_cors <- onboarding_line_words %>% dplyr::group_by(word) %>% dplyr::filter(!word %in% stopwords) %>% dplyr::filter(n() >= 20) %>% widyr::pairwise_cor(word, line_id, sort = TRUE)

For instance, what often goes in the same line as vignette?

dplyr::filter(word_cors, item1 == "vignette") ## # A tibble: 853 x 3 ## item1 item2 correlation ## ## 1 vignette readme 0.176 ## 2 vignette vignettes 0.174 ## 3 vignette chunk 0.145 ## 4 vignette eval 0.120 ## 5 vignette examples 0.108 ## 6 vignette overview 0.0933 ## 7 vignette building 0.0914 ## 8 vignette link 0.0863 ## 9 vignette maps 0.0840 ## 10 vignette package 0.0831 ## # ... with 843 more rows

Now let’s plot the network of these relationships between words, using
the igraph package by Gábor Csárdi and Támas
Nepusz and ggraph package by
Thomas Lin Pedersen.

library("igraph") library("ggraph") set.seed(2016) word_cors %>% dplyr::filter(correlation > .35) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + geom_node_point(color = "lightblue", size = 5) + geom_node_text(aes(label = name), repel = TRUE) + theme_void()

This figure gives a good sample of things discussed in reviews. Despite
our efforts filtering words specific to issues, some of them remain very
specific, such as country/city/location that are very frequent in
ropenaq review.

How positive is onboarding?

Using sentiment analysis, we can look at how positive comments are.

sentiments %>% dplyr::group_by(role) %>% skimr::skim(sentiment) ## Skim summary statistics ## n obs: 11553 ## n variables: 6 ## group variables: role ## ## Variable type: numeric ## role variable missing complete n mean sd min p25 ## author sentiment 0 4823 4823 0.07 0.21 -1.2 0 ## community_manager sentiment 0 97 97 0.13 0.21 -0.41 0 ## editor sentiment 0 1521 1521 0.13 0.22 -1.63 0 ## other sentiment 0 344 344 0.073 0.2 -0.6 0 ## reviewer sentiment 0 4768 4768 0.073 0.21 -1 0 ## median p75 max hist ## 0 0.17 1.84 ## 0.071 0.23 1 ## 0.075 0.25 1.13 ## 0 0.2 0.81 ## 0 0.17 1.73 summary(sentiments$sentiment) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -1.63200 0.00000 0.00000 0.07961 0.18353 1.84223 sentiments %>% dplyr::filter(!role %in% c("other", "community_manager")) %>% ggplot(aes(role, sentiment)) + geom_boxplot(fill = "salmon") + hrbrthemes::theme_ipsum(base_size = 16, axis_title_size = 16, strip_text_size = 16)

These boxplots seem to indicate that lines are generally positive
(positive mean, zero 25th-quantile), although it’d be better to be able
to compare them with text from traditional review processes of
scientific manuscripts in order to get a better feeling for the meaning
of these numbers.

On these boxplots we also see that we do get lines with a negative
sentiment value… about what? Here are the most common words in negative
lines.

sentiments %>% dplyr::filter(sentiment < 0) %>% tidytext::unnest_tokens(word, line) %>% dplyr::filter(!word %in% stopwords) %>% dplyr::count(word, sort = TRUE) %>% dplyr::mutate(word = reorder(word, n)) %>% dplyr::filter(n > 100) %>% ggplot() + geom_lollipop(aes(word, n), size = 2, col = "salmon") + hrbrthemes::theme_ipsum(base_size = 16, axis_title_size = 16) + coord_flip()

And looking at a sample…

sentiments %>% dplyr::arrange(sentiment) %>% dplyr::select(line, sentiment) %>% head(n = 15) %>% knitr::kable() line sentiment @ultinomics no more things, although do make sure to add more examples – perhaps open an issue ropenscilabs/gtfsr/issues to remind yourself to do that, -1.6320000 not sure what you mean, but i’ll use different object names to avoid any confusion (ropenscilabs/mregions#24) -1.2029767 error in .local(.object, …) : -1.0000000 error: -1.0000000 #### miscellaneous -1.0000000 error: command failed (1) -0.8660254 – get_plate_size_from_number_of_columns: maybe throwing an error makes more sense than returning a string indicating an error -0.7855844 this code returns an error, which is good, but it would be useful to return a more clear error. filtering on a non-existant species results in a 0 “length” onekp object (ok), but then the download_* functions return a curl error due to a misspecified url. -0.7437258 0 errors | 0 warnings | 0 notes -0.7216878 once i get to use this package more, i’m sure i’ll have more comments/issues but for the moment i just want to get this review done so it isn’t a blocker. -0.7212489 – i now realize i’ve pasted the spelling mistakes without thinking too much about us vs. uk english, sorry. -0.7071068 minor issues: -0.7071068 ## minor issues -0.7071068 replicates issue -0.7071068 visualization issue -0.7071068

It seems that negative lines are mostly people discussing bugs and
problems in code, and GitHub issues, and trying to solve them. The kind
of negative lines we’re happy to see in our process, since once solved,
they mean the software got more robust!

Last but not least, I mentioned our using particular cases as examples
of how happy everyone seems to be in the process. To find such examples,
we rely on memory, but what about picking heart-warming lines using
their sentiment score?

sentiments %>% dplyr::arrange(- sentiment) %>% dplyr::select(line, sentiment) %>% head(n = 15) %>% knitr::kable() line sentiment absolutely – it’s really important to ensure it really has been solved! 1.842234 overall, really easy to use and really nicely done. 1.733333 this package is a great and lightweight addition to working with rdf and linked data in r. coming after my review of the codemetar package which introduced me to linked data, i found this a great learning experience into a topic i’ve become really interested in but am still quite novice in so i hope my feedback helps to appreciate that particular pov. 1.463226 i am very grateful for your approval and i very much look forward to collaborating with you and the ropensci community. 1.256935 thank you very much for the constructive thoughts. 1.237437 thanks for the approval, all in all a very helpful and educational process! 1.217567 – really good use of helper functions 1.139013 – i believe the utf note is handled correctly and this is just a snafu in goodpractice, but i will seek a reviewer with related expertise in ensuring that all unicode is handled properly. 1.132201 seem more unified and consistent. 1.126978 very much appreciated! 1.125833 – well organized, readable code 1.100000 – wow very extensive testing! well done, very thorough 1.100000 – i’m delighted that you find my work interesting and i’m very keen to help, contribute and collaborate in any capacity. 1.084493 thank you very much for your thorough and thoughtful review, @batpigandme ! this is great feedback, and i think that visdat will be much improved because of these reviews. 1.083653 great, thank you very much for accepting this package. i am very grateful about the reviews, which were very helpful to improve this package! 1.074281

As you can imagine, these sentences make the whole team very happy! And
we hope they’ll encourage you to contribute to rOpenSci onboarding.

Outlook

This first try at text analysis of onboarding issue threads is quite
promising: we were able to retrieve text and to use natural language
processing to extract most common words and bigrams, and sentiment. This
allowed us to describe the social weather of onboarding: we could see
that this system is about software, and that negative sentiment was
often due to bugs being discussed and solved; and we could extract the
most positive lines where volunteers praised the review system or the
piece of software under review.

One could expand this analysis with a study of emoji use, in the text
using an emoji dictionary as in this blog
post
and around the
text (so-called emoji reactions, present in the API and used in e.g
ghrecipes::get_issues_thumbs).
Another aspect of social weather is maybe the timeliness that’s expected
or implemented at the different parts of the process, but it’d be
covered by other data such as the labelling history of the issues, which
could get extracted from GitHub V4 API as well.

This is the final post of the “Our package review system in review”
series. The first post presented data collection from
GitHub
,
the second post aimed at quantifying the work represented by
onboarding
.
The posts motivated us to keep using data to illustrate and assess the
system. Brace yourself for more onboarding data analyses in the future!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages