Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 11 min ago

Notebooks from the Practical AI Workshop

Thu, 01/03/2019 - 17:38

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn't assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too). 

The workshop was divided into five sections, each with its associated Notebook:

  1. The AI behind Seeing AI. Use the web interfaces to Cognitive Services to learn about the AI services behind the "Seeing AI" app
  2. Computer Vision API with R. Use an R script to interact with the Computer Vision API and generate captions for random Wikimedia images.
  3. Custom Vision with R. An R function to classify an image as a "Hot Dog" or "Not Hot Dog", using the Custom Vision service.
  4. MNIST with scikit-learn. Use sckikit-learn to build a digit recognizer for the MNIST data using a regression model.
  5. MNIST with Tensorflow. Use Tensorflow (from Python) to build a digit recognizer for the MNIST data using a convolutional neural network.

The workshop was a practical version of a talk I also gave at AI Live, "Getting Started with Deep Learning", and I've embedded those slides below. 

This is an embedded Microsoft Office presentation, powered by Office Online.

Azure Notebooks: Practical AI for the Working Software Engineer

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

‘data:’ Scraping & Chart Reproduction : Arrows of Environmental Destruction

Thu, 01/03/2019 - 16:26

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Today’s RSS feeds picked up this article by Marianne Sullivan, Chris Sellers, Leif Fredrickson, and Sarah Lamdanon on the woeful state of enforcement actions by the U.S. Environmental Protection Agency (EPA). While there has definitely been overreach by the EPA in the past the vast majority of its regulatory corpus is quite sane and has made Americans safer and healthier as a result. What’s happened to an EPA left in the hands of evil (yep, “evil”) in the past two years is beyond lamentable and we likely have two more years of lamenting ahead of us (unless you actually like your water with a coal ash chaser).

The authors of the article made this chart to show the stark contrast between 2017 and 2018 when it comes to regulatory actions for eight acts:

  • Clean Air Act (CAA)
  • Clean Water Act (CWA)
  • Emergency Planning and Community Right to Know Act (EPCRA)
  • Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)
  • Resource Conservation and Recovery Act (RCRA)
  • Safe Drinking Water Act (SDWA)
  • Toxic Substances Control Act (TSCA)
  • Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)

They made this arrow chart (via Datawrapper):

For some reason, that chart sparked a “I really need to make that in R” moment, and thus begat this post.

I’ve got a geom for dumbbell charts but that’s not going to work for this arrow chart since I really wanted to (mostly) reproduce it the way it was. Here’s my go at it.

Data First

Datawrapper embeds have a handy “Get the data” link in them but it’s not a link to a file. It’s a javascript-generated data: href so you either need to click on the link and download it or be hard-headed like I am go the way of pain and scrape it (reproducibility FTW). Let’s get packages and data gathering code out of the way. I’ll exposit a bit more about said data gathering after the code block:

library(stringi) library(rvest) library(hrbrthemes) # git[la|hu]b / hrbrmstr / hrbrthemes library(tidyverse) article <- read_html("https://theconversation.com/the-epa-has-backed-off-enforcement-under-trump-here-are-the-numbers-108640") html_node(article, "iframe#psm7n") %>% # find the iframe html_attr("src") %>% # get iframe URL read_html() %>% # read it in html_node(xpath=".//script[contains(., 'data: ')]") %>% # find the javascript section with the data html_text() %>% # get that section stri_split_lines() %>% # split into lines so we can target the actual data element unlist() %>% keep(stri_detect_fixed, 'data: "Fiscal') %>% # just get the data line stri_trim_both() %>% # prep it for extraction stri_replace_first_fixed('data: "', "") %>% stri_replace_last_fixed('"', "") %>% stri_replace_all_fixed("\\n", "\n") %>% # make lines lines stri_split_lines() %>% unlist() %>% stri_split_fixed("\\t") %>% # we now have a list of vectors map_dfc(~set_names(list(.x[2:length(.x)]), .x[1])) %>% # first element of each vector is colname type_convert(col_types = "cddn") %>% # get real types set_names(c("act", "y2018", "y2017", "pct")) -> psm psm ## # A tibble: 8 x 4 ## act y2018 y2017 pct ## ## 1 CAA 199 405 -51 ## 2 CERCLA 147 194 -24 ## 3 CWA 320 565 -43 ## 4 EPCRA 56 107 -48 ## 5 FIFRA 363 910 -60 ## 6 RCRA 149 275 -46 ## 7 SDWA 121 178 -32 ## 8 TSCA 80 152 -47

Inside the main article URL content there’s an iframe load:

We grab the contents of that iframe link (https://datawrapper.dwcdn.net/psm7n/2/) which has a data: line way down towards the bottom of one of the last javascript blocks:

That ugly line gets transformed into a link that will download as a normal CSV file, but we have to do the above wrangling on it before we can get it into a format we can work with.

Now, we can make the chart.

Chart Time!

Let’s get the Y axis in the right order:

psm %>% arrange(desc(y2017)) %>% mutate(act = factor(act, levels = rev(act))) -> psm

Next, we setup X axis breaks and also get the max value for some positioning calculations (so we don’t hardcode values):

# setup x axis breaks and max value for label position computation x_breaks <- pretty(c(psm$y2018, psm$y2017)) max_val <- max(x_breaks)

I have two minor nitpicks about the original chart (and changes to them as a result). First, I really don’t like the Y axis gridlines but I do believe we need something to help the eye move horizontally and associate each label to its respective geom. Instead of gridlines I opt for a diminutive dotted line from 0 to the first (min) value.

The second nitpick is that — while the chart has the act information in the caption area — the caption is in alpha order vs the order the act acronyms appear in the data. If it was an alpha bullet list I might not complain, but I chose to modify the order to fit the chart, which we build dynamically with the help of this vector:

# act info for caption c( "CAA" = "Clean Air Act (CAA)", "CWA" = "Clean Water Act (CWA)", "EPCRA" = "Emergency Planning and Community Right to Know Act (EPCRA)", "FIFRA" = "Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)", "RCRA" = "Resource Conservation and Recovery Act (RCRA)", "SDWA" = "Safe Drinking Water Act (SDWA)", "TSCA" = "Toxic Substances Control Act (TSCA)", "CERCLA" = "Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)" ) -> acts w125 <- scales::wrap_format(125) # help us word wrap at ~125 chars # order the vector and turn it into wrapped lines act_info <- w125(paste0(unname(acts[as.character(psm$act)]), collapse = "; "))

Now, we can generate the geoms. It looks like alot of code, but I like to use newlines to help structure ggplot2 calls. I still miss my old gg <- gg + idiom but RStudio makes it way too easy to execute the whole expression with just the use of + so I’ve succumbed to their behaviour modification. To break it down w/o code, we essentially need:

  • the arrows for each act
  • the 2017 and 2018 direct label values for each act
  • the 2017 and 2018 top “titles”
  • segments for ^^
  • title, subtitle and caption(s)

We use percent-maths to position labels and other objects so the code can be re-used for other arrow plots (hardcoding to the data values is likely fine, but you’ll end up tweaking the numbers more and wasting ~2-5m per new chart).

ggplot(psm) + # dots from 0 to minval geom_segment( aes(0, act, xend = y2018, yend = act), linetype = "dotted", color = "#b2b2b2", size = 0.33 ) + # minval label geom_label( aes(y2018, act, label = y2018), label.size = 0, hjust = 1, size = 3.5, family = font_rc ) + # maxval label geom_label( aes(y2017 + (0.0015 * y2017), act, label = y2017), label.size = 0, hjust = 0, size = 3.5, family = font_rc ) + # the measure line+arrow geom_segment( aes(y2018, act, xend = y2017, yend = act), color = "#4a90e2", size = 0.75, # I pulled the color value from the original chart arrow = arrow(ends = "first", length = unit(5, "pt")) ) + # top of chart year (min) geom_label( data = head(psm, 1), aes(y2018, 9, label = "2018"), hjust = 0, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate ) + # top of chart year (max) geom_label( data = head(psm, 1), aes(y2017, 9, label = "2017"), hjust = 1, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate ) + # bar from top of chart year label to first minval measure geom_segment( data = head(psm, 1), aes( y2018 + (0.005 * max_val), 8.5, xend = y2018 + (0.005 * max_val), yend = 8.25 ), size = 0.25 ) + # bar from top of chart year label to first maxval measure geom_segment( data = head(psm, 1), aes( y2017 - (0.005 * max_val), 8.5, xend = y2017 - (0.005 * max_val), yend = 8.25 ), size = 0.25 ) + # fix x axis scale and place breaks scale_x_comma(limits = c(0, max_val), breaks = seq(0, max_val, 200)) + # make room for top "titles" scale_y_discrete(expand = c(0, 1)) + labs( y = NULL, title = "Decline by statute", subtitle = "The number of civil cases the EPA brought to conclusion has dropped across a number of federal statutes,\nincluding the Clean Air Act (CAA) and others.", x = act_info, caption = "Original Chart/Data: The Conversation, CC-BY-ND;; Source: Environmental Data & Government Initiative " ) + theme_ipsum_rc(grid = "X") + theme(axis.text.x = element_text(color = ft_cols$slate)) + theme(axis.title.x = element_text( hjust = 0, size = 10, face = "italic", color = ft_cols$gray, margin = margin(t = 10) )) + theme(plot.caption = element_text(hjust = 0))

Here’s the result:

(it even looks ok in “batman” mode):

FIN

With Microsoft owning GitHub I’m not using gists anymore and the GitLab “snippets” equivalent is just too dog-slow to use, so starting in 2019 I’m self-hosing contiguous R example code used in the blog posts. For the moment, that means links to plain R files but I may just setup gitea for them sometime before the end of Q1. You can find a contiguous, commented version of the above code in here.

If you do your own makeover don’t forget to drop a link to your creation(s) in the comments!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

x-mas tRees with gganimate, ggplot, plotly and friends

Thu, 01/03/2019 - 09:37

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

At the last homework before Christmas I asked my students from DataVisTechniques to create a ,,Christmas style” data visualization in R or Python (based on simulated data).

Libaries like rbokeh, ggiraph, vegalite, shiny+ggplot2 or plotly were popular last year. This year there are also some nice submissions that use gganimate.

Find source codes here. Plots created last year are here.
And here are homeworks from this year.

Trees created with gganimate (and gifski)



Trees created with ggplot2 (and sometimes shiny)









Trees created with plotly


Trees created with python


Trees created with rbokeh


Trees created with vegalite


Trees created with ggiraph

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Check Machin-like formulae with arbitrary-precision arithmetic

Thu, 01/03/2019 - 09:00

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

Happy New Year to all of you! Let us start the year with something for your inner maths nerd

Source: Wikimedia

For those of you who don’t yet know Rosetta Code: it is a real cool site where you can find lots of interesting code examples in all kinds of different languages for many different tasks. Of course R is also present big time (at the time of writing 426 code examples!): Rosetta Code for R.

The name of the site is inspired by the famous Rosetta Stone of Ancient Egypt which is inscribed with three different versions of the same text: in Ancient Egyptian hieroglyphs, Demotic script, and Ancient Greek script which proved invaluable in deciphering Egyptian hieroglyphs and thereby opening the window into ancient Egyptian history.

Now, a few days a ago I again added an example (for the other tasks I solved I will write more posts in the future, so stay tuned!). The task is to verify the correctness of Machin-like formulae using exact arithmetic.

A little bit of mathematical background is in order, so Wikipedia to the rescue:

Machin-like formulae are a popular technique for computing to a large number of digits. They are generalizations of John Machin]s formula from 1706:

   

which he used to compute to 100 decimal places.

Machin-like formulae have the form

   

where and are positive integers such that , is a signed non-zero integer, and is a positive integer.

The exact task is to verify that the following Machin-like formulae are correct by calculating the value of tan (right hand side) for each equation using exact arithmetic and showing they equal one:















The same should be done for the last and most complicated case…

… but it should be confirmed that the following, slightly changed, formula is incorrect by showing tan (right hand side) is not one:

This is what I contributed to Rosetta Code:

library(Rmpfr) prec <- 1000 # precision in bits `%:%` <- function(e1, e2) '/'(mpfr(e1, prec), mpfr(e2, prec)) # operator %:% for high precision division # function for checking identity of tan of expression and 1, making use of high precision division operator %:% tanident_1 <- function(x) identical(round(tan(eval(parse(text = gsub("/", "%:%", deparse(substitute(x)))))), (prec/10)), mpfr(1, prec)) tanident_1( 1*atan(1/2) + 1*atan(1/3) ) ## [1] TRUE tanident_1( 2*atan(1/3) + 1*atan(1/7)) ## [1] TRUE tanident_1( 4*atan(1/5) + -1*atan(1/239)) ## [1] TRUE tanident_1( 5*atan(1/7) + 2*atan(3/79)) ## [1] TRUE tanident_1( 5*atan(29/278) + 7*atan(3/79)) ## [1] TRUE tanident_1( 1*atan(1/2) + 1*atan(1/5) + 1*atan(1/8) ) ## [1] TRUE tanident_1( 4*atan(1/5) + -1*atan(1/70) + 1*atan(1/99) ) ## [1] TRUE tanident_1( 5*atan(1/7) + 4*atan(1/53) + 2*atan(1/4443)) ## [1] TRUE tanident_1( 6*atan(1/8) + 2*atan(1/57) + 1*atan(1/239)) ## [1] TRUE tanident_1( 8*atan(1/10) + -1*atan(1/239) + -4*atan(1/515)) ## [1] TRUE tanident_1(12*atan(1/18) + 8*atan(1/57) + -5*atan(1/239)) ## [1] TRUE tanident_1(16*atan(1/21) + 3*atan(1/239) + 4*atan(3/1042)) ## [1] TRUE tanident_1(22*atan(1/28) + 2*atan(1/443) + -5*atan(1/1393) + -10*atan(1/11018)) ## [1] TRUE tanident_1(22*atan(1/38) + 17*atan(7/601) + 10*atan(7/8149)) ## [1] TRUE tanident_1(44*atan(1/57) + 7*atan(1/239) + -12*atan(1/682) + 24*atan(1/12943)) ## [1] TRUE tanident_1(88*atan(1/172) + 51*atan(1/239) + 32*atan(1/682) + 44*atan(1/5357) + 68*atan(1/12943)) ## [1] TRUE tanident_1(88*atan(1/172) + 51*atan(1/239) + 32*atan(1/682) + 44*atan(1/5357) + 68*atan(1/12944)) ## [1] FALSE

As you can see all statements are TRUE except for the last one!

In the code I make use of the Rmpfr package (from Martin Maechler of ETH Zürich, Switzerland) which is based on the excellent GMP (GNU Multiple Precision) library. I define a new infix operator %:% for high-precision division and after that convert all standard divisions in the formulae to high-precision divisions and calculate the tan. Before I check if the result is identical to one I round it to 100 decimal places which is more than enough given the precision of , so about 300 decimal places, in the example.

Please let me know in the comments what you think of this approach and whether you see room for improvement for the code – Thank you!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Icon making with ggplot2 and magick

Thu, 01/03/2019 - 01:00

(This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers)

Icon
noun
A person or thing regarded as a representative symbol or as worthy of veneration.
A symbol or graphic representation on a screen of a program, option, or window.
from Oxford English Dictionaries

Fontawesome and the noun project along with other icon provides produce and distribute beautiful icons for free use. But sometimes, I would like to alter or create my own tiny 16×16 or 32×32 pixel wide icons without having to learn inkscape or any other svg editor. The figure should have one/two colors on a transparent background.

So first I will try to make an icon of a normal distribution using ggplot2. We start with a typical normal distribution on the basic ggplot2 theme.

library(ggplot2) ## Warning: package 'ggplot2' was built under R version 3.5.1 p <- ggplot(data.frame(x = c(-2, 2)), aes(x)) + stat_function(fun = dnorm) p

In this case, we want a white icon so the curve should be white.

p <- ggplot(data.frame(x = c(-2, 2)), aes(x)) + stat_function(fun = dnorm, color = "white") p

We can add a title.

p + ggtitle("Icon")

To make the background transparent, we can edit the panel.background and plot.background elements in theme.

p + ggtitle("Icon") + theme(panel.background = element_rect(fill="transparent",colour=NA), plot.background = element_rect(fill="transparent",colour=NA))

This doesn’t look very informative so we can just change one of the backgrounds temporarily so we can see the different elements.

p + ggtitle("Icon") + theme(panel.background = element_rect(fill="transparent",colour=NA), plot.background = element_rect(fill="grey",colour=NA))

The elements outside the actual plot is controlled by text, line, and rect (as explained here). So we can set all three to white as well.

p + ggtitle("Icon") + theme(panel.background = element_rect(fill="transparent",colour=NA), plot.background = element_rect(fill="grey",colour=NA), text = element_text(color = "white"), line = element_line(color = "white"), rect = element_rect(fill = "transparent", colour = NA))

The axis tick marks and grid lines would not be extraneous in an icon and can be removed as well.

p + ggtitle("Icon") + theme(panel.background = element_rect(fill="transparent",colour=NA), plot.background = element_rect(fill="grey",colour=NA), text = element_text(color = "white"), line = element_line(color = "white"), rect = element_rect(fill = "transparent", colour = NA), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

We can add back the axis lines with axis.line.

p + ggtitle("Icon") + theme(panel.background = element_rect(fill="transparent",colour=NA), plot.background = element_rect(fill="grey",colour=NA), text = element_text(color = "white"), line = element_line(color = "white"), rect = element_rect(fill = "transparent", colour = NA), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank(), axis.line = element_line(color = "white"))

For the real plot, we want the background to be transparent rather than gray.

icon_plot <- p + ggtitle("Icon") + theme(panel.background = element_rect(fill="transparent",colour=NA), plot.background = element_rect(fill="transparent",colour=NA), text = element_text(color = "white"), line = element_line(color = "white"), rect = element_rect(fill = "transparent", colour = NA), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank(), axis.line = element_line(color = "white"))

We can save this resulting plot as both svg and png using ggsave. As ggsave uses width and height in inches/centimeters but we would like to produce images of pixel size, we can use the dpi parameter to avoid doing any math.

These transparent plots can also work on colored infographics as abstract figures but convey much more detail. since ggplot2 allows for detailed control of the plot, changing text and line sizes can generate all kinds of

A 72×72 pixel svg.

ggsave(filename = "icon_72px.svg", icon_plot, dpi=72, width = 1, height = 1)

A 72×72 pixel png icon. The important argument here is passing the bg = "transparent" argument into the png device.

ggsave(filename = "icon_72px.png", icon_plot, dpi=72, width = 1, height = 1, bg = "transparent")

To actually show the white icon on the page, I’m manually adding a background color.

On the other hand, if you already have an image you want to convert to an icon, we can use the magick package to use imagemagick for image processing and scaling.

library(magick) ## Linking to ImageMagick 6.9.9.14 ## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp ## Disabled features: fontconfig, x11

We will use the example tiger image from the magick vignette.

tiger <- image_read_svg('http://jeroen.github.io/images/tiger.svg', width = 400) print(tiger) ## format width height colorspace matte filesize density ## 1 PNG 400 400 sRGB TRUE 0 72x72

First, we convert color to two colors only with image_convert.

tiger %>% image_convert(type = "bilevel")

Since the background is black, and we would like it transparent, we can flip the black and white.

tiger %>% image_convert(type = "bilevel") %>% image_negate()

And lastly, we can scale to pixel size with image_scale.

tiger %>% image_convert(type = "bilevel") %>% image_negate() %>% image_scale("72x")

imagemagick has much better capabilities and I feel that saving a full-size plot with ggsave and then scaling with magick will create better icons than using ggsave and specifying dpi as above.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My R Take in Advent of Code – Day 5

Thu, 01/03/2019 - 01:00

(This article was first published on r-tastic, and kindly contributed to R-bloggers)

There’s no time to lose, so here comes another Advent of Code puzzle solved using R. Day 5 challenge, here we come! What are we expected to do?

The polymer is formed by smaller units which, when triggered, react with each other such that two adjacent units of the same type and opposite polarity are destroyed. Units’ types are represented by letters; units’ polarity is represented by capitalization. For instance, r and R are units with the same type but opposite polarity, whereas r and s are entirely different types and do not react.

For example:
In aA, a and A react, leaving nothing behind.
In abBA, bB destroys itself, leaving aA. As above, this then destroys itself, leaving nothing.
In abAB, no two adjacent units are of the same type, and so nothing happens.
In aabAAB, even though aa and AA are of the same type, their polarities match, and so nothing happens.
Now, consider a larger example, dabAcCaCBAcCcaDA:

dabAcCaCBAcCcaDA The first ‘cC’ is removed.
dabAaCBAcCcaDA This creates ‘Aa’, which is removed.
dabCBAcCcaDA Either ‘cC’ or ‘Cc’ are removed (the result is the same).
dabCBAcaDA No further actions can be taken.

After all possible reactions, the resulting polymer contains 10 units.
How many units remain after fully reacting the polymer you scanned?

OK, this one actually doesn’t sound too bad. Basically, we need to keep removing 2-letter combinations of same lowercase and capital letters until the letter sequence stops changing in length.

Let’s have a look at the data:

library(tidyverse) raw_input <- read_delim('day5-raw-input.txt', delim = '\t', col_names = FALSE) %>% as.character() glimpse(raw_input) ## chr "nVvNOoHhiJjlLSuUvVsHhIpiIPIhHgqNVvnQGaAbBiFfbBFQqKaAkfsNxXnpPrSIikKsBcCbdDaMDdmAkKEebBNnRpPGgxXyYJjrRSvKkoOlLrR"| __truncated__

As expected, just a string of letters. How long?

# original length nchar(raw_input) ## [1] 50000

50000-letters long! I wish I knew that was the right length from the beginning – this puzzle took me way longer than necessary to solve because I didn’t copy a complete sequence into the .txt file!

Anyway, instead of coming up with a fancy regex pattern I hard-coded all possible combinations of letters that we want to remove from the sequence. Not elegant but very effective.

pattern <- c('Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ')

Let’s test on the first example if this pattern works:

## test the example ## dabAcCaCBAcCcaDA The first 'cC' is removed. ## dabAaCBAcCcaDA This creates 'Aa', which is removed. ## dabCBAcCcaDA Either 'cC' or 'Cc' are removed (the result is the same). ## dabCBAcaDA str_remove_all('dabAcCaCBAcCcaDA', pattern) %>% str_remove_all(pattern) %>% nchar() ## [1] 10

Yes! We get the right answer after 2 rounds of ‘removals’! Now, we don’t know how many rounds need to be applied to the puzzle dataset in order to reach the final solution, so how do we go about it? It’s a perfect case for a repeat loop. In such loop we keep repeating a specified operation until a condition is met. In our case, we want to keep applying the removal of ‘reactive units’ until the length of the letter sequence stops changing. And once it happens, we want to know how long is the final (shortest) sequence. This will be our solution to the puzzle:

# now, put it in the repeat loop # final solution input <- raw_input repeat { output <- str_remove_all(input, pattern) if (nchar(input) == nchar(output)) { print(nchar(output)); break } else input <- output; } ## [1] 11194

Now, that’s what I call an ELEGANT solution!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-tastic. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

gganimate has transitioned to a state of release

Thu, 01/03/2019 - 01:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

Just to start of the year in a positive way, I’m happy to announce that
gganimate is now available on CRAN. This release is the result of a pretty
focused development starting in the spring 2018 prior to my useR keynote about
it.

Some History

The gganimate package has been around for quite some time now with David
Robinson making the first commit
in early 2016. David’s vision of gganimate revolved around the idea of
frame-as-an-aesthetic and this easy-to-grasp idea gave it an early success. The
version developed by David never made it to CRAN, and as part of ramping down
his package development he asked me if I was interested in taking over
maintenance. I was initially reluctant because I wanted a completely different
API, but he insisted that he supported a complete rewrite. The last version of
gganimate as maintained by David is still available
but I very quickly made some drastic changes:

While this commit was done in the autumn 2017, nothing further happened until I
decided to make gganimate the center of my useR 2018 keynote, at which point I
was forced (by myself) to have some sort of package ready by the summer of 2018.

A fair amount of users have shown displeasure in the breaking changes this
history has resulted in. Many blog posts have already been written focusing on
the old API, as well as code on numerous computers that will no longer work. I
understand this frustration, of course, but both me and David agreed that doing
it this way was for the best in the end. I’m positive that the new API has
already greatly exceeded the mind-share of the old API and given a year the old
API will be all but a distant memory…

The Grammar

Such drastic breaking changes were required because of a completely different
vision for how animation fitted into the grammar of graphics. Davids idea was
that it was essentially a third dimension in the graphic and the animation was
simply flipping through slices along the third dimension in the same way as you
would look through the output of a CT scan. Me, on the other hand, wanted a
grammar that existed in parallel to the grammar of graphics — not as part of it.

My useR keynote goes in to a lot of detail about my motivation and inspiration for
taking on this approach, and I’ll not rehash it in this release post. Feel free
to take a 1h break from reading as you watch the talk


The gist of it all is that animations are a multifaceted beast and requires much
more than an additional aesthetic to be tamed. One of the cornerstones of the
talk is the separation of animations into scenes and segues. In short, a segue
is an animated change in the underlying laws of the graphic (e.g. changes to
coordinate systems, scales, mappings, etc.), whereas a scene is a change in
the data on display. Scenes are concerned with what and segues are concerned
with how. This separation is important for several reasons: It gives me a natural
focus area for the current version of gganimate (scenes), it serves as a
theoretical backbone to group animation operation, and it is a central limit in
animation good practices: “You should never change how and what at the same
time”.

So, the version I’m presenting here is a grammar of animation uniquely focused
on scenes. This does not mean that I’ll never look into segues, but they are
both much harder, and less important than getting a scene grammar to make sense,
so segues have to play second fiddle for now.

What’s in a scene

There are two main components to a scene: What we are looking at, and where we
are looking from. The former is handled by transitions and shadows, whereas
the latter is handled by views. In brief:

  • transitions populates the frames of the animation with data, based on the
    data assigned to each layer. Several different transitions exists that
    interpret the layer data differently.
  • shadows gives memory to each frame by letting each frame include data from
    prior or future frames.
  • views allow you to modify the range of the positional scales (zoom and
    pan) either directly or as a function of the data assigned to the frame.

On top of these three main grammar components there is a range of functions to
modify how key parts of animations behave — for a general introduction to the
ins and outs of the API, please see the
*Getting Started** guide.

Grammar vs API

While it may appear that grammar and API are the same, this is not the case. A
grammar is a theoretical construct, a backbone from which an API can be defined.
Several APIs could implement the same grammar in multiple, incompatible, ways.
For gganimate I have tried to align the API as much as possible with the ggplot2
API, so that the line between the two packages becomes blurred. You change a
plot to an animation by adding functions from gganimate to it, and the animation
is rendered when printing the animation object in the same way as ggplots are
rendered when printing the object. An example of this is adding
transition_reveal() to a plot to make it appear gradually along a numeric
variable:

library(ggplot2) library(gganimate) ggplot(airquality) + geom_line(aes(x = Day, y = Temp, group = Month))

last_plot() + transition_reveal(Day)

For the most part, the marriage between the ggplot2 and gganimate APIs is a happy
one, though it does show at points that the ggplot2 API was never designed with
animation in mind. I am particularly pleased with how powerful the API has
turned out, and I have already seen countless uses I had never anticipated.

The Future

While this release is a milestone for gganimate, it is not a signal of it being
done
as many things are still missing (even if we ignore the whole segue part
of the grammar). It does signal a commitment to stability from now on, though so
you should feel confident in using this package without fearing that your code
will break in the future. You can follow the state of the package at its
website, , where I’ll also try to add additional guides and
tutorials with time. If you create something with gganimate please share it on
twitter, as I’m eager to see what people will make of it.

I’ll do a sort-of live cookbook talk on gganimate at this years RStudio conf in
Austin, so if you are there and interested to learn more about the package do
swing by.

Now, Go Animate!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Adding Firebase Authentication to Shiny

Thu, 01/03/2019 - 01:00

(This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers)

Firebase and Shiny

Firebase is a mobile and web application development platform owned by Google. Firebase provides front-end solutions for authentication, database storage, object storage, messaging, and more. Firebase drastically reduces the time needed to develop certain types of highly scalable and secure applications.

As heavy R users, Shiny is our favorite web development tool. Shiny allows us to build apps to quickly communicate our R analyses. Shiny’s elegant reactive model makes it simple to construct complex interactive data flows.

Naturally we are interested in how Firebase can super charge our Shiny applications. There are many ways the two tools can be used together, but, to us, the most obvious was to add Firebase authentication to a Shiny app. Here we do exactly that:

Live Example App

Source Code

and demo Video:


Instructions on how to set up the app to run locally on your computer are available in the source code README.md.

Benefits over building a custom solution in R

Firebase authentication comes with security best practices out of the box. You never send passwords to your Shiny server; they are sent directly from the front-end to Firebase. No need to worry about proper ways to encrypt and store passwords.

Firebase authentication has built in solutions for common authentication needs (e.g. email verification, two-factor authentication, password reset, provider sign in through Google, Facebook, Twitter, etc.)

Many common authentication errors are checked automatically, and the proper error message is returned. e.g. if the user enters an email address without an “@” symbol, Firebase will respond with error code “auth/invalid-email” and an error message of “The email address is badly formatted.”

When you sign into your Firebase project on https://firebase.google.com, there is a nice UI for managing users. You can easily view registered users and click to disable or delete users. Additionally you can use Firebase Admin to programmatically manage users.

Firebase authenticaion naturally integrates very nicely with the other Firebase services, and Firebase (as a Google product) can be easily integrated with all of Google Cloud Platform.

Future plans for Shiny and Firebase

We are exploring additional ways to use Firebase with Shiny. Please reach out if you are interested in collaborating or have ideas of your own for how the two tools could be used together.

Disclaimer

Firebase is providing front end authentication. It is up to the developer to make sure not to send sensitive data from the Shiny server to unauthenticated or unauthorized users.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on Tychobra. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Office for Students report on “grade inflation”

Wed, 01/02/2019 - 23:41

(This article was first published on R – Let's Look at the Figures, and kindly contributed to R-bloggers)

A journalist asked me to look at a recent report, Analysis of degree classifications over time: Changes in graduate attainment.  The report was published by the UK government’s Office for Students (OfS) on 19 December 2018, along with a headline-grabbing press release:

The report uses a statistical method — the widely used method of logistic regression — to devise a yardstick by which each English university (and indeed the English university sector as a whole) is to be measured, in terms of their tendency to award the top degree classes (First Class and Upper Second Class honours degrees).  The OfS report looks specifically at the extent to which apparent “grade inflation” in recent years can be explained by changes in student-attribute data available to OfS (which include grades in pre-university qualifications, and also some other characteristics such as gender and ethnicity).

I write here as an experienced academic, who has worked at the University of Warwick (in England) for the last 15 years.  At the end, below, I will briefly express some opinions based upon that general experience (and it should be noted that everything I write here is my own — definitely not an official view from the University of Warwick!)

My specific expertise, though, is in statistical methods, and this post will focus mainly on that aspect of the OfS report.  (For a more wide-ranging critique, see for example https://wonkhe.com/blogs/policy-watch-ofs-report-on-grade-inflation/)

Parts of what I say below will get a bit technical, but I will aim to write first in a non-technical way about the big issue here, which is just how difficult it is to devise a meaningful measurement of “grade inflation” from available data.  My impression is that, unfortunately, the OfS report has either not recognised the difficulty or has chosen to neglect it.  In my view the methods used in the report are not actually fit for their intended purpose.

1.  Analysis of an idealized dataset

In much the same way as when I give a lecture, I will aim here to expose the key issue through a relatively simple, concocted example.  The real data from all universities over several years are of course quite complex; but the essence can be captured in a much smaller set of idealized data, the advantage of which is that it allows a crucial difficulty to be seen quite directly.

An imagined setup: Two academic years, two types of university, two types of student

Suppose (purely for simplicity) that there are just two identifiable types of university (or, if you prefer, just two distinct universities) — let’s call them A and B.

Suppose also (purely for simplicity) that all of the measurable characteristics of students can be encapsulated in a single binary indicator: every student is known to be either of type h or of type i, say.  (Maybe h for hardworking and i for idle?)

Now let’s imagine the data from two academic years — say the years 2010-11 and 2016-17 as in the OfS report — on the numbers of First Class and Other graduates.

The 2010-11 data looks like this, say:

University A University B Firsts Other Firsts Other h 1000 0 h 500 500 i 0 1000 i 500 500

The two universities have identical intakes in 2010-11 (equal numbers of type h and type i students).  Students of type h do a lot better at University A than do students of type i; whereas University B awards a First equally often to the two types of student.

Now let’s suppose that, in the years that follow 2010-11,

  • students (who all know which type they are) learn to target the “right” university for themselves
  • both universities A and B tighten their final degree criteria, so as to make it harder (for both student types h and i) to achieve a First.

As a result of those behavioural changes, the 2016-17 data might look like this:

University A University B Firsts Other Firsts Other h 1800 200 h 0 0 i 0 0 i 500 1500

Now we can combine the data from the two universities, so as to look at how degree classes across the whole university sector have changed over time:

Combined data from both universities: 2010-11 2016-17 Firsts Other Firsts Other h 1500 500 h 1800 200 i 500 1500 i 500 1500 ------------- ------------- Total 2000 2000 Total 2300 1700 % 50 50 57.5 42.5 The conclusion (not!)

The last table shown above would be interpreted, according to the methodology of the OfS report, as showing an unexplained increase of 7.5 percentage points in the awarding of first-class degrees.

(It is 7.5 percentage points because that’s the difference between 50% Firsts in 2010-11 and 57.5% Firsts in 2016-17.  And it is unexplained — in the OfS report’s terminology — because the composition of the student body was unchanged, with 50% of each type h and i in both years.)

But such a conclusion would be completely misleading.  In this constructed example, both universities actually made it harder for every type of student to get a First in 2016-17 than in 2010-11.

The real conclusion

The constructed example used above should be enough to demonstrate that the method developed in the OfS report does not necessarily measure what it intends to.

The constructed example was deliberately made both simple and quite extreme, in order to make the point as clearly as possible.  The real data are of course more complex, and patterns such as shifts in the behaviour of students and/or institutions will usually be less severe (and will always be less obvious) than they were in my constructed example.  The point of the constructed example is merely to demonstrate that any conclusions drawn from this kind of combined analysis of all universities will be unreliable, and such conclusions will often be incorrect (sometimes severely so).

That false conclusion is just an instance of Simpson’s Paradox, right?

Yes.

The phenomenon of analysing aggregate data to obtain (usually incorrect) conclusions about disaggregated behaviour is often (in Statistics) called ecological inference or the ecological fallacy.  In extreme cases, even the direction of effects can be apparently reversed (as in the example above) — and in such cases the word “paradox” does seem merited.

Logistic regression

The simple example above was (deliberately) easy enough to understand without any fancy statistical methods.  For more complex settings, especially when there are several “explanatory” variables to take into account, the method of logistic regression is a natural tool to choose (as indeed the authors of the OfS report did).

It might be thought that a relatively sophisticated tool such as logistic regression can solve the problem that was highlighted above.  But that is not the case.  The method of logistic regression, with its results aggregated as described in the OfS report, merely yields the same (incorrect) conclusions in the artificial example above.

For anyone reading this who wants to see the details: here is the full code in R, with some bits of commentary.

2.  So, what is a better way?

The above has shown how the application of a statistical method can result in potentially very misleading results.

Unfortunately, it is hard for me (and perhaps just as hard for anyone else?) to come up with a purely statistical remedy — i.e., a better statistical method.

The problem of measuring “grade inflation” is an intrinsically difficult one to solve.  Subject-specific Boards of Examiners — which is where the degree classification decisions are actually made within universities — work very hard (in my experience) to be fair to all students, including those students who have graduated with the same degree title in previous years or decades.  This last point demands attention to the maintenance of standards through time.  Undoubtedly, though, there are other pressures in play — pressures that might still result in “grade inflation” through a gradual lowering of standards, despite the efforts of exam boards to maintain those standards.  (Such pressures could include the publication of %Firsts and similar summaries, in league tables of university courses for example.)   And even if standards are successfully held constant, there could still be apparent grade-inflation wherever actual achievement of graduates is improving over time, due to such things as increased emphasis on high-quality teaching in universities, or improvements in the range of options and the information made available to students (who can then make better choices for their degree courses).

I should admit that I do not have an answer!

3.  A few (more technical) notes

a.  For the artificial example above, I focused on the difficulty caused by aggregating university-level data to draw a conclusion about the whole sector.  But the problem does not go away if instead we want to draw conclusions about individual universities, because each university comprises several subject-specific exam boards (which is where the degree classification decisions are actually made).  Any statistical model that aims to measure successfully an aspect of behaviour (such as grade inflation) would need to consider data at the right level of disaggregation — which in this instance would be the separate Boards of Examiners within each university.

b.  Many (perhaps all?) of the reported standard errors attached to estimates in the OfS report seem, to my eye, unrealistically small.  It is unclear how they were calculated, though, so I cannot judge this reliably.  (A more general point related to this: It would be good if the OfS report’s authors could publish their complete code for the analysis, so that others can check it and understand fully what was done.)

c.  In tables D2 and D3 of the OfS report, the model’s parameterization is not made clear enough to understand it fully.  Specifically, how should the Year estimates be interpreted — do they, for example, relate to one specific university?  (Again, giving access to the analysis code would help with understanding this in full detail.)

d.  In equations E2 and E3 of the OfS report, it seems that some independence assumptions (or, at least, uncorrelatedness)  have been made.  I missed the justification for those; and it is unclear to me whether all of them are indeed justifiable.

e.  The calculation of thresholds for “significance flags” as used in the OfS report is opaque.  It is unclear to me how to interpret such statistical significance, in the present context.

4.  Opinion

This topic seems to me to be a really important one for universities to be constantly aware of, both qualitatively and quantitatively.

Unfortunately I am unconvinced that the analysis presented in this OfS report contributes any reliable insights.  This is worrying (to me, and probably to many others in academia) because the Office for Students is an important government body for the university sector.

It is especially troubling that the OfS appears to base aspects of its regulation of universities upon such a flawed approach to measurement.  As someone who has served in many boards of examiners, at various different universities in the UK and abroad (including as an external examiner when called upon), I cannot help feeling that a lot of careful work by such exam boards is in danger of simply being dismissed as “unexplained”, on the basis of some well-intentioned but inadequate statistical analysis.  The written reports of exam boards, and especially of the external examiners who moderate standards across the sector, would surely be a much better guide than that?

© David Firth, January 2019

To cite this entry:
Firth, D (2019). Office for Students report on “grade inflation”. Weblog entry at URL https://statgeek.net/2019/01/02/office-for-students-report-on-grade-inflation

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Let's Look at the Figures. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Apache Drill 1.15.0 + sergeant 0.8.0 = pcapng Support, Proper Column Types & Mounds of New Metadata

Wed, 01/02/2019 - 23:24

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Apache Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores […] without having to create and manage schemas. […] It has a schema-free JSON document model similar to MongoDB and Elasticsearch; [a plethora of APIs, including] ANSI SQL, ODBC/JDBC, and HTTP[S] REST; [is] extremely user and developer friendly; [and, has a] pluggable architecture enables connectivity to multiple datastores.

To ring in the new year the Drill team knocked out a new 1.15.0 release with a cadre of new functionality including:

One super-helpful new feature of the REST API is that it now returns query results metadata along with the query results themselves. This means REST API endpoints finally know both column order and column type. This gave me cause to re-visit the sergeant package [GL|GH] and make some accommodations for some of these new features.

Ushering In A New Order

Drill REST API queries return a "columns" field and "metadata" field with the data itself. We can use that to force an order to the columns as well as mostly use proper types (vs JSON-parsed/guessed types). I say mostly since the package still uses jsonlite to parse the results and there’s no support for 64-bit integers in jsonlite (more on this later).

We’ll use the example from DRILL-6847 and use the example provided by Charles Givre in his Jira issue since it will let me demonstrate more of that “mostly” comment and show off another new feature:

library(sergeant) # 0.8.0 branch of sergeant on gitlab or github library(tidyverse) con <- src_drill("localhost") x <- tbl(con, "cp.`employee.json`") mutate(x, employee_id = as.integer64(employee_id)) %>% mutate(position_id = as.integer64(position_id)) %>% select( employee_id, full_name, first_name, last_name, position_id, position_title ) -> bigint_result

The above is (logically):

SELECT CAST (employee_id AS INT) AS employee_id, full_name, first_name, last_name, CAST (position_id AS BIGINT) AS position_id, position_title FROM cp.`employee.json`

What do we get when we take a preview of the result?

bigint_result ## # Source: lazy query [?? x 6] ## # Database: DrillConnection ## employee_id full_name first_name last_name position_id position_title ## ## 1 1 Sheri Now… Sheri Nowmer 1 President ## 2 2 Derrick W… Derrick Whelply 2 VP Country Man… ## 3 4 Michael S… Michael Spence 2 VP Country Man… ## 4 5 Maya Guti… Maya Gutierrez 2 VP Country Man… ## 5 6 Roberta D… Roberta Damstra 3 VP Information… ## 6 7 Rebecca K… Rebecca Kanagaki 4 VP Human Resou… ## 7 8 Kim Brunn… Kim Brunner 11 Store Manager ## 8 9 Brenda Bl… Brenda Blumberg 11 Store Manager ## 9 10 Darren St… Darren Stanz 5 VP Finance ## 10 11 Jonathan … Jonathan Murraiin 11 Store Manager ## # ... with more rows Warning message: One or more columns are of type BIGINT. The sergeant package currently uses jsonlite::fromJSON() to process Drill REST API result sets. Since jsonlite does not support 64-bit integers BIGINT columns are initially converted to numeric since that's how jsonlite::fromJSON() works. This is problematic for many reasons, including trying to use 'dplyr' idioms with said converted BIGINT-to-numeric columns. It is recommended that you 'CAST' BIGINT columns to 'VARCHAR' prior to working with them from R/'dplyr'. If you really need BIGINT/integer64 support, consider using the R ODBC interface to Apache Drill with the MapR ODBC drivers. This informational warning will only be shown once per R session and you can disable them from appearing by setting the 'sergeant.bigint.warnonce' option to 'FALSE' (i.e. options(sergeant.bigint.warnonce = FALSE)).

The first thing sergeant users will notice is proper column order (before it just returned the columns in the order they came back in the JSON rows[] structure). The second thing is that we didn’t get integer64s back. Instead, we got doubles plus an information warning about why and what you can do about it. Said warning only displays once per-session and can be silenced with the option sergeant.bigint.warnonce. i.e. just put:

options(sergeant.bigint.warnonce = FALSE)

in your script or ~/.Rprofile and you won’t hear from it again.

The as.integer64() we used is not from the bit64 package but an internal sergeant package function that knows how to translate said operation to, e.g. CAST( employee_id AS BIGINT ).

You can use the ODBC drivers to gain BIGINT support and there are plans for the 0.8.0 branch to eventually use rapidjsonr at the C++-level to provide direct in-package support for BIGINTs as well.

Better Error Messages

Drill query errors that the sergeant package bubbled up through its various interfaces have not been pretty or all that useful. This has changed with the 0.8.0 branch. Let’s take a look:

tbl(con, "cp.employees.json") ## # Source: table [?? x 4] ## # Database: DrillConnection Warning message: VALIDATION ERROR: From line 2, column 6 to line 2, column 24: Object 'cp.employees.json' not found Original Query: 1: SELECT * 2: FROM `cp.employees.json` 3: LIMIT 10 Query Profile Error Link: http://localhost:8047/profiles/079fc8cf-19c6-4c78-95a9-0b949a3ecf4c

As you can see in the above output, you now get a highly-formatted return value with the original SQL query broken into lines (with line numbers) and a full link to the Drill query profile so you can dig in to the gnarly details of complex query issues. As you work with this and find edge cases I missed for messages, drop an issue on your social-coding site of choice.

SUPPORT ALL THE PCAPs!

Drill has had packet capture (PCAP) file support for a while now and 1.15.0 adds support for the more modern/rich pcapng format. To enable support for this you need to add "pcapng": {"type": "pcapng", "extensions": ["pcapng"] }, to the "formats" section of your storage plugins and also configure a workspace directory to use that as the default (the principle of which is covered here).

We’ll use one of the Wireshark example captures to demonstrate:

pcaps <- tbl(con, "dfs.caps.`*.pcapng`") glimpse(pcaps) ## Observations: ?? ## Variables: 25 ## $ tcp_flags_ece_ecn_capable 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ tcp_flags_ece_congestion_experienced 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ tcp_flags_psh 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ type "TCP", "TCP", "TCP", "TCP... ## $ tcp_flags_cwr 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ dst_ip "74.125.28.139", "10.254.... ## $ src_ip "10.254.157.208", "74.125... ## $ tcp_flags_fin 1, 1, 0, 0, 0, 0, 0, 0, 0... ## $ tcp_flags_ece 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ tcp_flags 17, 17, 16, 16, 16, 0, 0,... ## $ tcp_flags_ack 1, 1, 1, 1, 1, 0, 0, 0, 0... ## $ src_mac_address "00:05:9A:3C:7A:00", "00:... ## $ tcp_flags_syn 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ tcp_flags_rst 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ timestamp 2015-04-14 07:19:25, 201... ## $ tcp_session 8.353837e+17, 8.353837e+1... ## $ packet_data "\"3DU... "ACK|FIN", "ACK|FIN", "AC... ## $ tcp_flags_ns 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ src_port 60268, 443, 60268, 58382,... ## $ packet_length 54, 54, 54, 55, 66, 78, 7... ## $ tcp_flags_urg 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ tcp_ack 662445631, 1496589825, 66... ## $ dst_port 443, 60268, 443, 29216, 5... ## $ dst_mac_address "00:11:22:33:44:55", "00:... count(pcaps, src_ip, dst_ip, sort=TRUE) ## # Source: lazy query [?? x 3] ## # Database: DrillConnection ## # Groups: src_ip ## # Ordered by: desc(n) ## src_ip dst_ip n ## ## 1 10.254.157.208 10.254.158.25 298 ## 2 10.254.158.25 10.254.157.208 204 ## 3 174.137.42.81 10.254.157.208 76 ## 4 10.254.157.208 10.254.158.8 54 ## 5 10.254.158.8 10.254.157.208 49 ## 6 74.125.28.102 10.254.157.208 49 ## 7 10.254.157.208 74.125.28.102 44 ## 8 10.254.157.208 174.137.42.81 41 ## 9 54.84.98.25 10.254.157.208 25 ## 10 157.55.56.168 10.254.157.208 25 ## # ... with more rows

More work appears to be planned by the Drill team to enable digging into the packet (binary) contents.

Drill Metadata As Data

Drill has provided ways to lookup Drill operational information as actual tables but the Drill team has added support for even more metadata-as-data queries.

First up is finally having better access to filesystem information. Prior to 1.15.0 one could get file and path attributes as part of other queries, but now we can treat filesystems as actual data. Let’s list all the PCAPs in the above workspace:

tbl(con, "information_schema.`schemata`") %>% filter(SCHEMA_NAME == "dfs.caps") %>% print() %>% pull(SCHEMA_NAME) -> pcap_schema ## # Source: lazy query [?? x 9] ## # Database: DrillConnection ## CATALOG_NAME SCHEMA_NAME SCHEMA_OWNER TYPE IS_MUTABLE ## ## 1 DRILL dfs.caps file NO tbl(con, "information_schema.`files`") %>% filter(schema_name == pcap_schema) %>% glimpse() ## Observations: ?? ## Variables: 13 ## $ SCHEMA_NAME "dfs.caps" ## $ ROOT_SCHEMA_NAME "dfs" ## $ WORKSPACE_NAME "caps" ## $ FILE_NAME "dof-short-capture.pcapng" ## $ RELATIVE_PATH "dof-short-capture.pcapng" ## $ IS_DIRECTORY FALSE ## $ IS_FILE TRUE ## $ LENGTH 634280 ## $ OWNER "hrbrmstr" ## $ GROUP "staff" ## $ PERMISSION "rw-r--r--" ## $ ACCESS_TIME 1969-12-31 19:00:00 ## $ MODIFICATION_TIME 2019-01-01 19:12:17

The Drill system options table now has full descriptions for the options and also provides a new table that knows about all of Drills functions and all your custom UDFs. drill_opts() and drill_functions() return a data frame of all this info and have an optional browse parameter which, if set to TRUE, will show a DT interactive data table for them. I find this especially handy when I forget something like regexp_like syntax (I use alot of back-ends and many are wildly different) and can now do this:

FIN

Keep on the lookout for the rapidjsonr/BIGINT integration and more new features of the sergeant package. NOTE: The better error messages have been ported over to the sergeant.caffeinated package (the RJDBC interface) and the other niceties will make their way into that package soon as well.

So, make sure you’re using the 0.8.0 GL / GH, kick the tyres, file issues where you’re most comfortable working.

May your queries all be optimized and results sets complete in the new year!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dataviz Course Packet Quickstart

Wed, 01/02/2019 - 21:20

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

Chapter 2 of Data Visualization walks you through setting up an R Project, and takes advantage of R Studio’s support for RMarkdown templates. That is, once you’ve created your project in R Studio, can choose File > New File > R Markdown, like this:

Select R Markdown …

And then choose “From Template” on the left side of the dialog box that pops up, and select the “Data Visualization Notes” option on the right:

Pick a Template

Unfortunately, this option isn’t showing up for some users, I think due to a bug in one of the libraries used to install the socviz package. While that’s getting sorted out (hopefully soon), there’s also an alternative and very quick way to get a project and notes files up and running. From the console, first make sure the socviz package is loaded:

library(socviz)

Then, do this:

setup_course_notes()

This will copy and unzip a folder to your Desktop containing an R project with a set of Rmarkdown files that are ready to be used to take notes with. You’ll get a message that looks something like this:

Copied dataviz_course_notes.zip to /Users/kjhealy/Desktop and expanded it into /Users/kjhealy/Desktop/dataviz_course_notes

Your user name will most likely be different, and the destination shown may be different also depending on what kind of computer you are using.

Once it has been created, you can navigate to that dataviz_course_notes folder and open it. Inside will be a dataviz folder that looks like this:

Contents of the course packet.

Double-click on the dataviz.Rproj file and you should be good to go.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

Wed, 01/02/2019 - 13:58

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Are you wondering whether to get into the ‘R’ bus or ‘Python’ bus?
My suggestion is to you is “Why not get into the ‘R and Python’ train?”

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($8.99/Rs449) versions.  In the third edition all code sections have been re-formatted to use the fixed width font ‘Consolas’. This neatly organizes output which have columns like confusion matrix, dataframes etc to be columnar, making the code more readable.  There is a science to formatting too!! which improves the look and feel. It is little wonder that Steve Jobs had a keen passion for calligraphy! Additionally some typos have been fixed.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99)
2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle- $8.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Table of Contents
Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Pick up your copy today!!
Hope you have a great time learning as I did while implementing these algorithms!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Music listener statistics: last.fm’s last.year as an R package

Wed, 01/02/2019 - 01:36

(This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers)

When starting analyzing last.fm scrobbles with the last.week and last.year function I was always missing some plots or a pure data table. That is why I developed the package “analyzelastfm” as a simple R6 implementation. I wanted to have different album statistics, as e.g the #of plays per album, divided by the # of tracks on the album. This is now implemented. To get music listening statistic you would start with:

# If you don't have it, install devtools
# install.packages("devtools")
devtools::install_github("zappingseb/analyze_last_fm")
library(analyzelastfm)

First it allows you to import your last.year data by using the last.fm REST API. Therefore you need to have a last.fm API key. This can be derived by simply going to the last.fm API website. From there you will get the 32 character long key. After receiving the key I got my last.fm data by:

api_key <- "PLEASE ENTER YOUR KEY HERE"
data <- UserData$new("zappingseb",api_key,2018)

The data object now contains the function albumstats that allows me to get the specific information I wanted. I’ll exclude the artist “Die drei ???” as it’s a kids detective story from Germany I listen to a lot and which’s albums are split into 1 minute tracks, that really screws up my statistic.

View(data$albumstats(
sort_by="by_album_count", # album track plays / nr tracks on album
exclude_artist="Die drei ???", # exclude my audio book favorite
exclude_album=c(""), # exclude tracks without album name
min_tracks=5) # have minimum 5 tracks on the album (NO EPs)
)

The result looks like that:

Album statistics of 2018 of zappingseb

The statistic shows n the number of plays, count the number of tracks on the album and count_by_track=n/count .

The top 5 albums you can find here:

Additionally to such calculations, I was also interested in when I was listening to music. Therefore I added some plots to the package. The first one is the listening clock:

data$clock.plot() Clock plot for my music history

My most important statistic though, is which time of the year I spent listening to music. I’m most interested in some specific days and an average play/month, that can tell me in which mood I was in. Therefore I created the function daily_month_plot . It plots the average plays per month and the daily plays in one plot. The average plays per month are visualized behind the daily spikes. Here you can see that February was a really quite month for me.

data$daily_month_plot()

For some simple statistics, I also included an aggregation plotting function called barplots . It can plot aggregations per weekday, week, month or day. The data for this function is provided by the function bar_data .

data$barplots("weekdays")

weekly (starting with week 52 of 2017):

data$barplots(“week”)

Where I can clearly see that I was not really listening to a lot of music during the beginning of the year and my listening activity increased dramatically by week 18.

This post is a bit out of my spectrum, which mostly deals with packages about R packages being useful for Pharma. But I really like music and think music and data science make a good combination. Sound’s great together!

Music listener statistics: last.fm’s last.year as an R package was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2018 R Views Review and Highlights

Wed, 01/02/2019 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

2018 was a good year for R Views. With a total of sixty-three posts for the year, we exceeded the pace of at least one post per week. But, it wasn’t quantity we were shooting for. Our main goal was, and continues to be, featuring thoughtful commentary on topics of interest to the R Community and in-depth technical elaboration of R language applications.

Before highlighting a few of my favorite posts for 2018, I would like to express my profound gratitude to our guest bloggers (R Community members who are not employed at RStudio), our regular RStudio contributors who sparkled with creativity while meeting committed deadlines, and you, our readers, who made it all worthwhile.

At the top of my list for 2018 posts is the interview with mathematician Noah Giansiracusa: A Mathematician’s Pespective on Topological Analysis and R. Although my name is on the byline, this is really a guest post. Professor Giansiracusa’s openness, enthusiasm, lucidity and excitement about doing mathematics and finding genuine value in using R makes this post outstanding.

The three part series of posts: Statistics in Glaucoma (Part I, Part II and Part III) by statisticians Sam Berchuck and Joshua Warren establishes a new standard for posts introducing academic research. It is a masterful exposition of their quest to find a predictive model of disease progression. More than tutorial on the underlying R packages, the series prepares interested readers for reading the authors’ academic research papers.

Other noteworthy guest posts included:
* Eric Anderson: Alternative Design for Shiny
* Anqi Fu, Balasubramanian Narasimhan, Stephen Boyd: CVXR: A Direct Standardization Example
* David Kane: Player Data for the 2018 FIFA World Cup
* Roland Stevenson: In-database xgboost predictions with R
* Sebastian Wolf: How to Build a Shiny “Truck”!

Regular R Views readers will know that RStudio’s Jonathan Regenstein’s posts in his series Reproducible Finance with R are always dependable “good reads”. In addition to Jonathan’s financial analyses, his posts feature Shiny applications and elegant code applicable to many every-day programming tasks. If you have any interest at all in doing Finance with R, I highly recommend Jonathan’s new book Reproducible Finance with R: Code Flows and Shiny Apps for Portfolio Analysis which is based on his R Views posts.

The 2018 “R for the Enterprise* series featured posts by RStudio solution engineers Cole Arendt, James Blair, Kelly O’Briant, Edgar Ruiz, Nathan Stephens, and Andrie de Vries that provided sophisticated, in-depth coverage of a variety of enterprise-level topics such as Slack and Plumber Part One and Part Two and Enterprise Dashboards with R Markdown.

Finally, if in an idle moment you would like to review some of the more interesting new packages that made it to CRAN in 2018, you can peruse my monthly Top 40 picks.

All of us at R Views wish you a Happy and Prosperous New Year!

_____='https://rviews.rstudio.com/2019/01/02/2018-r-views-highlights/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Entering and Exiting 2018

Wed, 01/02/2019 - 01:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

The year is nearly over and it is the time for reflection and navel-gazing. I
don’t have incredibly profound things to say, but a lot of things happened in
2018 and this is as good a time as any to go through it all…

Picking Myself Up

The prospects of my “2017 in review”
post were not particularly rosy… I had hit somewhat of a burnout in terms of
programming, but was none the less positive and had a great job and a lot of
positive feedback on patchwork.
Further, I had RStudio::conf to look forward to, which would be my first IRL
head-to-head with the R community at large. I had also promised to present a
fully-fledged tidy approach to network analysis and while both
ggraph and
tidygraph had already been released
there were things I wanted to develop prior to presenting it. All-in-all there
was a great impediment to pick myself up and get on with developing (not arguing
that this is a fail-safe way to deal with burnout by the way).

RStudio::conf(2018L)

My trip to San Diego was amazing. If you ever get to go to an RStudio conference
I don’t think you will be disappointed (full disclosure and spoiler-alert: I now
work for RStudio). My suspicion that the R community is as amazing in real life
as on Twitter was confirmed and it was great to finally get to see all those
people I admire and look up to. My talk went fairly well I think — I haven’t
watched the recordings as I don’t particularly enjoy watching myself talk,
but you can,
if you are so inclined. At the conference I got to chat a bit with Jenny Bryan
(one of the admire/look-up-to people referenced above) and we discussed what we
were going to talk about in our respective keynotes at useR in Brisbane in the
summer. I half-jokingly said that I might talk about gganimate
because that would give me the required push to actually begin developing it…

Talk-Driven Development

Around April Dianne Cook was getting pushy with getting at least a talk title
for my keynote, and at that point I had already imagined a couple of slides on
gganimate and thought “to heck-with-it” and responded with the daunting title of
The Grammar of Animation. At that point I had still not written a single line
of code for gganimate, and knew that tweenr
would need a serious update to support what I had in mind. In addition, I knew
I had to develop what ended up as transformr
before I could begin with gganimate proper. All-in-all my talk title could not
be more stress-inducing…

Thankfully I had a pretty clear vision in my head (which was also why I wanted
to talk about it) so the motivation was there to drag me along for the ride.
Another great benefit of developing tools for data visualisation in general and
animation in particular, is that it sets Twitter on fire. After getting tweenr
and transformr into a shape sufficient to support gganimate, I began to create
the backbone of the package, and once I shared the first animation created with
it, it was clear that I was in the pursuit of something that resonated with a
lot of people.

To my great surprise I was able to get gganimate to a state where it actually
supported the main grammar I had in mind prior useR, and I could begin to make
the presentation I had in mind:


useR was a great experience, not only because I was able to give the talk I had
hoped for, but also due to the greatness of the organisers and the attendees. I
was able to get to meet a lot of the members of R Core for the first time and
they were very supportive of my quest to improve the performance of the R
graphic stack (last slide of my talk), so I had high hopes that this might be
achievable within the next 5-10 years (it is no small task). I had been
surprised about the support for my ideas about animations and their relevance
within the R community, so in general the conference left my invigorated and
with the stamina to complete gganimate.

Intermezzo

I managed to release a couple of packages that do not fit into the narrative I’m
trying to create for this year, but they deserve a mention none the less.

In the beginning of the year I was able to finish of
particles, a port and extension of the
d3-force algorithm developed by Mike Bostock. It can be used for both great fun
and work and did among other things result in this beautiful pixel-decomposition
of Hadley:

While making improvements to tweenr in anticipation of gganimate it became clear
that colour conversion was a main bottleneck and I ended up developing
farver to improve on this. Beyond very
fast colour conversion it also allow a range of different colour distance
calculations to be performed. Some of the discussion that followed the
development of ths package led to Brodie Gaslam improving the colour conversion
performance in base R and while it is not as fast as farver, it is pretty close
and future versions of R will definetly benefit from his great contribution.

I haven’t had much time to make generative art this year, but I did manage to
find time for some infrastructure work that will support my endavours in this
space in the future. The ambient package
is able to produce all sorts of multidimensional noise in a very performant way
due to the speed of the underlying C++ library. I’m planning to expand on this
package quite a bit when I get the time as I have lots of cool ideas for how to
manipulate noise in a tidy manner.

How you use colours in data visualisation is extremely important, which is also
why the data visualisation community has embraced the viridis colour scale to
the extend that they have. I’ve personally grown tired of the aesthetic though,
so when I saw a range of perceptualy uniform palettes developed by Fabio Crameri
was quick to bring it to R with the scico
package. To my surprise the development of a colour palette packages became my
most contentious contribution this year (that I know of), so I welcome everyone
who is tired of colour palette packages to ignore it alltogether.

transition_hobby_work()

Prior to useR I had began to receives some cryptic questions from Hadley and it
was clear that he was either trolling my or that something was brewing. During
the late summer it became clear that it was the latter (thankfully), as RStudio
wanted me to work full time on improving the R graphic stack. Working for
RStudio on something so aligned with my own interest is beyond what I had hoped
for, so despite my joy in working for the danish tax authorities the switch was
a no-brainer. I wish my former office all the best — they are doing incredible
work – and look forward to seeing some of them at RStudio conf in Austin later
in the month.

Being part of the tidyverse team has so far been a great experience. I’ve been
lucky enough to meet several of them already as part of the different
conferences I attended this year, so working remotely with them doesn’t feel
that strange. It can be intimidating to work with such a talented team, but if
that is the least of my concerns I’m pretty sure I can manage that.

I look forward to share the performance improvements I’m making with all of you
throughout the coming years, and hopefully I’ll have time to also improve on
some of my packages that has received less attention during the development of
gganimate.

Happy New Year!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Considering sensitivity to unmeasured confounding: part 1

Wed, 01/02/2019 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Principled causal inference methods can be used to compare the effects of different exposures or treatments we have observed in non-experimental settings. These methods, which include matching (with or without propensity scores), inverse probability weighting, and various g-methods, help us create comparable groups to simulate a randomized experiment. All of these approaches rely on a key assumption of no unmeasured confounding. The problem is, short of subject matter knowledge, there is no way to test this assumption empirically.

The general approach to this problem has been to posit a level of unmeasured confounding that would be necessary to alter the conclusions of a study. The classic example (which also is probably the first) comes from the debate on the effects of smoking on lung cancer. There were some folks who argued that there was a genetic factor that was leading people to smoke and was simultaneously the cause of cancer. The great statistician Jerome Cornfield (who, I just saw on Wikipedia, happens to have shared my birthday), showed that an unobserved confounder (like a particular genetic factor) would need to lead to a 9-fold increase in the odds of smoking to explain away the association between smoking and cancer. Since such a strong factor was not likely to exist, he argued, the observed association was most likely real. (For a detailed discussion on various approaches to these kinds of sensitivity analyses, look at this paper by Liu, Kuramoto, and Stuart.)

My goal here is to think a bit more about what it means for a measured association to be sensitive to unmeasured confounding. When I originally started thinking about this, I thought that an association will be sensitive to unmeasured confounding if the underlying data generation process (DGP) actually includes an unmeasured confounder. Sure, if this is the case – that there actually is unmeasured confounding – then it is more likely that a finding will be sensitive to unmeasured confounding. But, this isn’t really that interesting, because we can’t observe the underlying DGP. And it is not necessarily the case that data sensitive to unmeasured confounding was in fact generated through some process with an unmeasured confounder.

Is there room for an alternative data generation process?

When considering sensitivity, it may be more useful to talk about the plausibility of alternative models. In this context, sensitivity is inherently related to the (1) the strength of the association of the observed exposure and outcome, and (2) the uncertainty (i.e. variability) around that association. Put succinctly, a relatively weak association with a lot of variability will be much more sensitive to unmeasured confounding than a strong association with little uncertainty. If you think in visual terms, when thinking about sensitivity, you might ask “do the data provide room for an alternative model?”

An alternative model

Let’s say we observe some exposure \(D\) and we are interested in its causal relationship with an outcome \(Y\), which we also observe. I am assuming \(D\) and \(Y\) are both continuous and normally distributed, which makes all of this work, but also limits how far I can take this. (To be more general, we will ultimately need more powerful tools, such as the R package treatSens, but more on that later.) Also, let’s assume for simplicity’s sake that there are no measured confounders – though that is not a requirement here.

With this observed data, we can go ahead and fit a simple linear regression model:

\[ Y = k_0 + k_1D,\]
where \(k_1\) is the parameter of interest, the measured association of exposure \(D\) with the outcome \(Y\). (Again for simplicity, I am assuming \(k_1 > 0\).)

The question is, is there a possible underlying data generating process where \(D\) plays a minor role or none at all? For example, is there a possible DGP that looks like this:

\[
\begin{aligned}
D &= \alpha_0 + \alpha_1 U + \epsilon_d \\
Y &= \beta_0 + \beta_1 D + \beta_2 U + \epsilon_y,
\end{aligned}
\]

where \(\beta_1 << k_1\), or perhaps \(\beta_1 = 0\)? That is, is there a process that generates the same observed distribution even though \(D\) is not a cause of \(Y\)? If so, how can we characterize that process, and is it plausible?

Simulation

The observed DGP can be defined using simstudy. We can assume that the continuous exposure \(D\) can always be normalized (by centering and dividing by the standard deviation). In this example, the coefficients \(k_0 = 0\) and \(k_1 = 1.5\), so that a unit change in the normalized exposure leads, on average, to a positive change in \(Y\) of 1.5 units:

defO <- defData(varname = "D", formula = 0, variance = 1) defO <- defData(defO, varname = "ey", formula = 0, variance = 25) defO <- defData(defO, varname = "Y", formula = "1.5 * D + ey", dist = "nonrandom")

We can generate the data and take a look at it:

set.seed(20181201) dtO <- genData(1200, defO)

 

Can we specify another DGP that removes \(D\) from the process that defines \(Y\)? The answer in this case is “yes.” Here is one such example where both \(D\) and \(Y\) are a function of some unmeasured confounder \(U\), but \(Y\) is a function of \(U\) alone. The variance and coefficient specifications for this DGP may seem a bit arbitrary (and maybe even lucky), but how I arrived at these quantities will be the focus of the second part of this post, coming soon. (My real goal here is to pique your interest.)

defA1 <- defData(varname = "U", formula = 0, variance = 1) defA1 <- defData(defA1, varname = "ed", formula = 0, variance = 0.727) defA1 <- defData(defA1, varname = "D", formula = "0.513 * U + ed", dist = "nonrandom") defA1 <- defData(defA1, varname = "ey", formula = 0, variance = 20.412) defA1 <- defData(defA1, varname = "Y", formula = "2.715 * U + ey", dist = "nonrandom")

After generating this second data set, we can see that they look pretty similar to each other:

set.seed(20181201) dtO <- genData(1200, defO) dtA1 <- genData(1200, defA1)

If the data are indeed similar, the covariance matrices generated by each of the data sets should also be similar, and they do appear to be:

dtO[, round(var(cbind(Y, D)), 1)] ## Y D ## Y 27.8 1.4 ## D 1.4 1.0 dtA1[, round(var(cbind(Y, D)), 1)] ## Y D ## Y 26.8 1.3 ## D 1.3 1.0 Non-unique data generating process

The DGP defined by defA1 is not a unique alternative. There are actually an infinite number of alternatives – here are two more, what I am calling “Alternative 2” and “Alternative 3” to go along with the first.

defA2 <- defData(varname = "U", formula = 0, variance = 1) defA2 <- defData(defA2, varname = "ed", formula = 0, variance = 0.794) defA2 <- defData(defA2, varname = "D", formula = "0.444 * U + ed", dist = "nonrandom") defA2 <- defData(defA2, varname = "ey", formula = 0, variance = 17.939) defA2 <- defData(defA2, varname = "Y", formula = "3.138 * U + ey", dist = "nonrandom") defA3 <- defData(varname = "U", formula = 0, variance = 1) defA3 <- defData(defA3, varname = "ed", formula = 0, variance = 0.435) defA3 <- defData(defA3, varname = "D", formula = "0.745 * U + ed", dist = "nonrandom") defA3 <- defData(defA3, varname = "ey", formula = 0, variance = 24.292) defA3 <- defData(defA3, varname = "Y", formula = "1.869 * U + ey", dist = "nonrandom")

Rather than looking at plots of the four data sets generated by these equivalent processes, I fit four linear regression models based on the observed \(D\) and \(Y\). The parameter estimates and residual standard error estimates are quite close for all four:

Comparison of different data generating processes
Observed Alt 1 Alt 2 Alt 3 D 1.41 1.41 1.41 1.37 (0.15) (0.15) (0.15) (0.15) Constant 0.38 -0.33 -0.32 -0.33 (0.15) (0.14) (0.14) (0.15) Residual Std. Error (df = 1198) 5.08 4.99 4.98 5.02

 

Characterizing each data generation process

While each of the alternate DGPs lead to the same (or very similar) observed data distribution, the underlying relationships between \(U\), \(D\), and \(Y\) are quite different. In particular, if we inspect the correlations, we can see that they are quite different for each of the three alternatives. In fact, as you will see next time, all we need to do is specify a range of correlations for \(U\) and \(D\) to derive a curve that defines all the alternatives for a particular value of \(\beta_1\).

dtA1[, .(cor(U, D), cor(U, Y))] ## V1 V2 ## 1: 0.511 0.496 dtA2[, .(cor(U, D), cor(U, Y))] ## V1 V2 ## 1: 0.441 0.579 dtA3[, .(cor(U, D), cor(U, Y))] ## V1 V2 ## 1: 0.748 0.331 Less sensitivity

So, what does it mean for an observed data set to be sensitive to unmeasured confounding? I would suggest that if an equivalent derived alternative DGP is based on “lower” correlations of \(U\) and \(D\) and/or \(U\) and \(Y\), then the observed data are more sensitive. What “low” correlation is will probably depend on the subject matter. I would say that the data we have been looking at above is moderately sensitive to unmeasured confounding.

Here is an example of an observed data that might be considerably less sensitive:

defS <- updateDef(defO, changevar = "ey", newvariance = 4) defAS <- defData(varname = "U", formula = 0, variance = 1) defAS <- defData(defAS, varname = "ed", formula = 0, variance = 0.414) defAS <- defData(defAS, varname = "D", formula = "0.759 * U + ed", dist = "nonrandom") defAS <- defData(defAS, varname = "ey", formula = 0, variance = 2.613) defAS <- defData(defAS, varname = "Y", formula = "1.907 * U + ey", dist = "nonrandom") set.seed(20181201) dtS <- genData(1200, defS) dtAS <- genData(1200, defAS)

The plots look similar, as do the covariance matrix describing the observed data:

dtS[, round(var(cbind(Y, D)), 1)] ## Y D ## Y 6.3 1.4 ## D 1.4 1.0 dtAS[, round(var(cbind(Y, D)), 1)] ## Y D ## Y 6.0 1.4 ## D 1.4 1.0

In this case, the both the correlations in the alternative DGP are quite high, suggesting a higher bar is needed to remove the association between \(D\) and \(Y\) entirely:

dtAS[, .(cor(U, D), cor(U, Y))] ## V1 V2 ## 1: 0.762 0.754

In the second part of this post I will show how I derived the alternative DGPs, and then use that derivation to create an R function to generate sensitivity curves that allow us to visualize sensitivity in terms of the correlation parameters \(\rho_{UD}\) and \(\rho_{UY}\).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Nimble tweak to use specific python version or virtual environment in RStudio

Tue, 01/01/2019 - 18:58

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

Reticulate made switch between R & Python easy, and doing its best to facilitate both worlds of data science. Meanwhile, I noticed that most of my followers or students raised the issues of uneasy switch between different python versions or virtual environments in RStudio while using repl_python(). Here is my nimble tweak, hope, RStudio will come up with a better solution in coming days as in Spyder’s (pointing to interpreter). As, most of us use, Rprofile.site which will be available in ~/R/etc folder for setting R environment variables, exploit the same for setting needed python version or virtual environment: # python version Sys.setenv(RETICULATE_PYTHON = “~/python37/python.exe”) # python virtual environment Sys.setenv(RETICULATE_PYTHON = “~/envs/stan_env/python.exe”) Currently, commenting and un-commenting needed version in Rprofile.site, and restarting session seems to be not a bad nimble tweak for bringing up needed python environment in RStudio. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Seeing the wood for the trees

Tue, 01/01/2019 - 12:36

(This article was first published on R – thinkr, and kindly contributed to R-bloggers)

Visualising “bigger data”

In the blog post Criminal goings-on in a random forest, we used supervised machine learning to see how well we could predict crime in London. We began by rendering and exploring some of the many facets of the recorded crime summary data at London borough-level .

There comes a point though where the many faces of the data require something more than a static visualisation. And there are alternative options. We can make “bigger data” visualisations more consumable and engaging. In this post we’ll go deeper into the original data with a more interactive and flexible approach.

Framing the questions

Suppose we want to explore the lower-level minor crime categories by borough. That would entail 1,024 mini visualisations. That’s a lot of scrolling. Or a dizzying set of Powerpoint slides.

Suppose too that we want to know which of these mini plots exhibits either the steepest increase or decrease over time. We may want to do this selectively by borough. Or by major crime category.

With a “trelliscope” approach, we can create a set of filterable and sortable panels wrapped in an engaging experience. And by anticipating the questions uppermost in the minds of the data’s ultimate consumers, we can create appropriate “cognostics” tailored to the task.

Play with the app trelliscopejs & rbokeh

Play with the app full-screen, and see the code, here.

R toolkit

R packages and functions used.

  Packages Functions purrr map[1]; set_names[1] furrr future_map2_dfr[1] future plan[1] readr read_csv[1]; read_lines[1] dplyr mutate[6]; filter[4]; group_by[3]; if_else[3]; tibble[3]; summarise[2]; arrange[1]; as_tibble[1]; desc[1]; select[1] tidyr nest[1]; unnest[1] stringr str_c[4]; fixed[1]; str_count[1]; str_detect[1]; str_remove[1]; str_remove_all[1] lubridate month[1]; year[1]; ymd[1] tibble enframe[1] rebus lookahead[3]; whole_word[2]; lookbehind[1]; one_or_more[1] stats coef[1] base library[8]; c[2]; function[2]; list[2]; sum[2]; conflicts[1]; cumsum[1]; max[1]; min[1]; months[1]; Negate[1]; search[1] ggplot2 element_text[3]; element_rect[2]; labs[2]; facet_wrap[1]; geom_line[1]; ggplot[1]; ggtitle[1]; theme[1] trelliscopejs cog[3]; map_cog[1]; map_plot[1]; trelliscope[1] rbokeh figure[1]; ly_lines[1]; ly_points[1]; theme_plot[1] ggthemes scale_colour_economist[1]; theme_economist[1] kableExtra kable[1]; kable_styling[1]

The post Seeing the wood for the trees appeared first on thinkr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – thinkr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing RcppDynProg

Mon, 12/31/2018 - 21:02

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

RcppDynProg is a new Rcpp based R package that implements simple, but powerful, table-based dynamic programming. This package can be used to optimally solve the minimum cost partition into intervals problem (described below) and is useful in building piecewise estimates of functions (shown in this note).

The abstract problem

The primary problem RcppDynProg::solve_dynamic_program() is designed to solve is formally given as follows.

Minimum cost partition into intervals.

Given: a positive integer n and an a n by n matrix called costs.

Find: an increasing sequence of integers soln with length(soln)==k (>=2), soln[1] == 1, and soln[k] == n+1 such that sum[i=1,...,k-1] costs[soln[i], soln[i+1]-1] is minimized.

To rephrase: costs[i,j] is specifying the cost of taking the interval of integers {i,...,j} (inclusive) as a single element of our solution. The problem is to find the minimum cost partition of the set of integers {1,...,n} as a sequence of intervals. A user supplies a matrix of costs of every possible interval of integers, and the solver then finds what disjoint set of intervals that cover {1,...,n} have the lowest sum of costs. The user encodes their optimization problem a family of interval costs (n(n-1)/2 of them, which is a lot- but is tractable) and the algorithm quickly finds the best simultaneous set of intervals (there are 2^(n-1) partitions into intervals, so exhaustive search would not be practical).

We can illustrate this abstract problem as follows (if this is too abstract, please skip forward to the concrete application).

Suppose we have the following cost matrix.

costs <- matrix(c(1.5, NA ,NA ,1 ,0 , NA, 5, -1, 1), nrow = 3) print(costs) # [,1] [,2] [,3] # [1,] 1.5 1 5 # [2,] NA 0 -1 # [3,] NA NA 1

Then the optimal partition is found as follows.

library("RcppDynProg") soln <- solve_dynamic_program(costs, nrow(costs)) print(soln) # [1] 1 2 4

The sequence [1, 2, 4] is a just compact representation for the following sequence of intervals.

lapply( seq_len(length(soln)-1), function(i) { soln[i]:(soln[i+1]-1) }) # [[1]] # [1] 1 # # [[2]] # [1] 2 3

Which is saying the optimal partition into intervals is to the sequence of sets [{1}, {2, 3}] which has total cost costs[1,1] + costs[2,3]. The dynamic programming solver knew to take the expensive set {1} to allow the cheap set {2, 3} to be in its chosen partition. This is the essence of dynamic programming: finding an optimal global solution, even if it requires odd-looking local choices.

An application

The intended application of RcppDynProg is to find optimal piecewise solutions to single-variable modeling problems. For example consider the following data.

In the above we have an input (or independent variable) x and an observed outcome (or dependent variable) y_observed (portrayed as points). y_observed is the unobserved idea value y_ideal (portrayed by the dashed curve) plus independent noise. The modeling goal is to get close the y_ideal curve using the y_observed observations. Obviously this can be done with a smoothing spline, but let’s use RcppDynProg to find a piecewise linear fit.

To encode this as a dynamic programming problem we need to build a cost matrix that for every consecutive interval of x-values we have estimated the out-of sample quality of fit. This is supplied by the function RcppDynProg::lin_costs() (using the PRESS statistic), but lets take a quick look at the idea.

The following interval is a good interval, as all the chosen points (shown in dark blue) are in a nearly linear arrangement. The in-sample price of the interval would be the total sum of residuals of a linear model fit on the selected region (and the out of sample price would be given by the PRESS statistic).

The "cost" (or loss) of this interval can be estimated as shown.

print(good_interval_indexes) # interval # [1] 94 139 print(1 + good_interval_indexes[2] - good_interval_indexes[1]) # width # [1] 46 fit <- lm(y_observed ~ x, data = d[good_interval_indexes[1]:good_interval_indexes[2], ]) sum(fit$residuals^2) # cost for interval # [1] 2.807998

The following interval is a bad interval, as all the chosen points (shown in dark blue) are not in a nearly linear arrangement.

print(bad_interval_indexes) # interval # [1] 116 161 print(1 + bad_interval_indexes[2] - bad_interval_indexes[1]) # width # [1] 46 fit <- lm(y_observed ~ x, data = d[bad_interval_indexes[1]:bad_interval_indexes[2], ]) sum(fit$residuals^2) # cost for interval # [1] 5.242647

The user would price all of the intervals individually, and then ask the solver to find the best simultaneous set of intervals.

The complete solution is worked as follows (using the RcppDynProg::solve_for_partition() function which wraps all the steps together, converting from indices to x-coordinates).

x_cuts <- solve_for_partition(d$x, d$y_observed, penalty = 1) print(x_cuts) # x pred group what # 1 0.05 -0.1570880 1 left # 2 4.65 1.1593754 1 right # 3 4.70 1.0653666 2 left # 4 6.95 -0.9770792 2 right # 5 7.00 -1.2254925 3 left # 6 9.20 0.8971391 3 right # 7 9.25 1.3792437 4 left # 8 11.10 -1.1542021 4 right # 9 11.15 -1.0418353 5 left # 10 12.50 1.1519490 5 right # 11 12.55 1.3964906 6 left # 12 13.75 -1.2045219 6 right # 13 13.80 -1.3791405 7 left # 14 15.00 1.0195679 7 right d$estimate <- approx(x_cuts$x, x_cuts$pred, xout = d$x, method = "linear", rule = 2)$y d$group <- as.character( findInterval(d$x, x_cuts[x_cuts$what=="left", "x"])) plt2 <- ggplot(data= d, aes(x = x)) + geom_line(aes(y = y_ideal), linetype=2) + geom_point(aes(y = y_observed, color = group)) + geom_line(aes(y = estimate, color = group)) + ylab("y") + ggtitle("RcppDynProg piecewise linear estimate", subtitle = "dots: observed values, segments: observed group means, dashed line: unobserved true values") + theme(legend.position = "none") + scale_color_brewer(palette = "Dark2") print(plt2)

RcppDynProg::solve_for_partition() finds a partition of a relation into a number of linear estimates. Each interval is priced using out-of sample cost via the PRESS statistic plus the specified penalty (to discourage small intervals). Notice, however, the user did not have to specify a k (or number of intervals) to a get good result.

The entire modeling procedure is wrapped as a vtreat custom-coder in the function RcppDynProg::piecewise_linear(). This allows such variable treatments to be easily incorporated into modeling pipelines (example here).

In addition to a piecewise linear solver we include a piecewise constant solver, which is demonstrated here. Other applications can include peak detection, or any other application where the per-segment metrics are independent.

The methodology

The solver is fast through to the use of 3 techniques:

  1. RcppDynProg::solve_for_partition() includes a problem reduction heuristic in the spirit of the parameterized complexity methodology.
  2. Ordered (or interval) partition problems are amenable to dynamic programming because initial segments of an interval partition have succinct summaries (just the right-most index and how many segments were used to get to this point).
  3. RcppDynProg is a fast C++ implementation using Rcpp.

Some basic timings show the C++ implementation can be over 200 times faster than a direct transliteration R of the same code (so not vectorized, not fully R idiomatic, some time lost to seqi() abstraction), and over 400 times faster than a Python direct transliteration of the same code (so not optimized, and not "Pythonic"). The non-optimized and non-adapted nature of the code translations unfortunately exaggerates the speedup, however the Rcpp is likely buying as a solid factor of over 100- as C++ is going to be much more efficient at all of the index-chasing this dynamic programming solution is based on.

A note on problem complexity: general partition problems (where we do not restrict the subsets to be intervals) are NP-hard, so not thought to be amenable to efficient general solutions at scale (subset sum problems being good examples).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Exploring 2018 R-bloggers & R Weekly Posts with Feedly & the ‘seymour’ package

Mon, 12/31/2018 - 13:36

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Well, 2018 has flown by and today seems like an appropriate time to take a look at the landscape of R bloggerdom as seen through the eyes of readers of R-bloggers and R Weekly. We’ll do this via a new package designed to make it easier to treat Feedly as a data source: seymour [GL | GH] (which is a pun-ified name based on a well-known phrase from Little Shop of Horrors).

The seymour package builds upon an introductory Feedly API blog post from back in April 2018 and covers most of the “getters” in the API (i.e. you won’t be adding anything to or modifying anything in Feedly through this package unless you PR into it with said functions). An impetus for finally creating the package came about when I realized that you don’t need a Feedly account to use the search or stream endpoints. You do get more data back if you have a developer token and can also access your own custom Feedly components if you have one. If you are a “knowledge worker” and do not have a Feedly account (and, really, a Feedly Pro account) you are missing out. But, this isn’t a rah-rah post about Feedly, it’s a rah-rah post about R! Onwards!

Feeling Out The Feeds

There are a bunch of different ways to get Feedly metadata about an RSS feed. One easy way is to just use the RSS feed URL itself:

library(seymour) # git[la|hu]b/hrbrmstr/seymour library(hrbrthemes) # git[la|hu]b/hrbrmstr/hrbrthemes library(lubridate) library(tidyverse) r_bloggers <- feedly_feed_meta("http://feeds.feedburner.com/RBloggers") r_weekly <- feedly_feed_meta("https://rweekly.org/atom.xml") r_weekly_live <- feedly_feed_meta("https://feeds.feedburner.com/rweeklylive") glimpse(r_bloggers) ## Observations: 1 ## Variables: 14 ## $ feedId "feed/http://feeds.feedburner.com/RBloggers" ## $ id "feed/http://feeds.feedburner.com/RBloggers" ## $ title "R-bloggers" ## $ subscribers 24518 ## $ updated 1.546227e+12 ## $ velocity 44.3 ## $ website "https://www.r-bloggers.com" ## $ topics data sci.... ## $ partial FALSE ## $ iconUrl "https://storage.googleapis.com/test-site-assets/X... ## $ visualUrl "https://storage.googleapis.com/test-site-assets/X... ## $ language "en" ## $ contentType "longform" ## $ description "Daily news and tutorials about R, contributed by ... glimpse(r_weekly) ## Observations: 1 ## Variables: 13 ## $ feedId "feed/https://rweekly.org/atom.xml" ## $ id "feed/https://rweekly.org/atom.xml" ## $ title "RWeekly.org - Blogs to Learn R from the Community" ## $ subscribers 876 ## $ updated 1.546235e+12 ## $ velocity 1.1 ## $ website "https://rweekly.org/" ## $ topics data sci.... ## $ partial FALSE ## $ iconUrl "https://storage.googleapis.com/test-site-assets/2... ## $ visualUrl "https://storage.googleapis.com/test-site-assets/2... ## $ contentType "longform" ## $ language "en" glimpse(r_weekly_live) ## Observations: 1 ## Variables: 9 ## $ id "feed/https://feeds.feedburner.com/rweeklylive" ## $ feedId "feed/https://feeds.feedburner.com/rweeklylive" ## $ title "R Weekly Live: R Focus" ## $ subscribers 1 ## $ updated 1.5461e+12 ## $ velocity 14.7 ## $ website "https://rweekly.org/live" ## $ language "en" ## $ description "Live Updates from R Weekly"

Feedly uses some special terms, one of which (above) is velocity. “Velocity” is simply the average number of articles published weekly (Feedly’s platform updates that every few weeks for each feed). R-bloggers has over 24,000 Feedly subscribers so any post-rankings we do here should be fairly representative. I included both the “live” and the week-based R Weekly feeds as I wanted to compare post coverage between R-bloggers and R Weekly in terms of raw content.

On the other hand, R Weekly’s “weekly” RSS feed has less than 1,000 subscribers. WAT?! While I have mostly nothing against R-bloggers-proper I heartily encourage ardent readers to also subscribe to R Weekly and perhaps even consider switching to it (or at least adding the individual blog feeds they monitor to your own Feedly). It wasn’t until the Feedly API that I had any idea of how many folks were really viewing my R blog posts since we must provide a full post RSS feed to R-bloggers and get very little in return (at least in terms of data). R Weekly uses a link counter but redirects all clicks to the blog author’s site where we can use logs or analytics platforms to measure engagement. R Weekly is also run by a group of volunteers (more eyes == more posts they catch!) and has a Patreon where the current combined weekly net is likely not enough to buy each volunteer a latte. No ads, a great team and direct engagement stats for the community of R bloggers seems like a great deal for $1.00 USD. If you weren’t persuaded by the above rant, then perhaps at least consider installing this (from source that you control).

Lastly, I believe I’m that “1” subscriber to R Weekly Live O_o. But, I digress.

We’ve got the feedIds (which can be used as “stream” ids) so let’s get cracking!

Binding Up The Posts

We need to use the feedId in calls to feedly_stream() to get the individual posts. The API claims there’s a temporal parameter that allows one to get posts only after a certain date but I couldn’t get it to work (PRs are welcome on any community source code portal you’re most comfortable in if you’re craftier than I am). As a result, we need to make a guess as to how many calls we need to make for two of the three feeds. Basic maths of 44 * 52 / 1000 suggests ~3 should suffice for R Weekly (live) and R-bloggers but let’s do 5 to be safe. We should be able to get R Weekly (weekly) in one go.

r_weekly_wk <- feedly_stream(r_weekly$feedId) range(r_weekly_wk$items$published) # my preview of this said it got back to 2016! ## [1] "2016-05-20 20:00:00 EDT" "2018-12-30 19:00:00 EST" # NOTE: If this were more than 3 I'd use a loop/iterator # In reality, I should make a helper function to do this for you (PRs welcome) r_blog_1 <- feedly_stream(r_bloggers$feedId) r_blog_2 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_1$continuation) r_blog_3 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_2$continuation) r_weekly_live_1 <- feedly_stream(r_weekly_live$feedId) r_weekly_live_2 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_1$continuation) r_weekly_live_3 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_2$continuation) bind_rows(r_blog_1$items, r_blog_2$items, r_blog_3$items) %>% filter(published >= as.Date("2018-01-01")) -> r_blog_stream bind_rows(r_weekly_live_1$items, r_weekly_live_2$items, r_weekly_live_3$items) %>% filter(published >= as.Date("2018-01-01")) -> r_weekly_live_stream r_weekly_wk_stream <- filter(r_weekly_wk$items, published >= as.Date("2018-01-01"))

Let’s take a look:

glimpse(r_weekly_wk_stream) ## Observations: 54 ## Variables: 27 ## $ id "2nIALmjjlFcpPJKakm2k8hjka0FzpApixM7HHu8B0... ## $ originid "https://rweekly.org/2018-53", "https://rw... ## $ fingerprint "114357f1", "199f78d0", "9adc236e", "63f99... ## $ title "R Weekly 2018-53 vroom, Classification", ... ## $ updated 2018-12-30 19:00:00, 2018-12-23 19:00:00,... ## $ crawled 2018-12-31 00:51:39, 2018-12-23 23:46:49,... ## $ published 2018-12-30 19:00:00, 2018-12-23 19:00:00,... ## $ alternate [ "https://rweekly.org/2018-53.html", "https... ## $ unread TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F... ## $ categories [ 1, 5, 5, 3, 2, 3, 1, 2, 3, 2, 4, 3, 2, 2, ... ## $ engagementrate 0.33, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ recrawled NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ tags [NULL, NULL, NULL, NULL, NULL, NULL, NULL... ## $ content_content "

Hello and welcome to this new issue! "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ... ## $ origin_streamid "feed/https://rweekly.org/atom.xml", "feed... ## $ origin_title "RWeekly.org - Blogs to Learn R from the C... ## $ origin_htmlurl "https://rweekly.org/", "https://rweekly.o... ## $ visual_processor "feedly-nikon-v3.1", "feedly-nikon-v3.1", ... ## $ visual_url "https://github.com/rweekly/image/raw/mast... ## $ visual_width 372, 672, 1000, 1000, 1000, 1001, 1000, 10... ## $ visual_height 479, 480, 480, 556, 714, 624, 237, 381, 36... ## $ visual_contenttype "image/png", "image/png", "image/gif", "im... ## $ webfeeds_icon "https://storage.googleapis.com/test-site-... ## $ decorations_dropbox NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... glimpse(r_weekly_live_stream) ## Observations: 1,333 ## Variables: 27 ## $ id "rhkRVQ8KjjGRDQxeehIj6RRIBGntdni0ZHwPTR8B3... ## $ originid "https://link.rweekly.org/ckb", "https://l... ## $ fingerprint "c11a0782", "c1897fc3", "c0b36206", "7049e... ## $ title "Top Tweets of 2018", "My #Best9of2018 twe... ## $ crawled 2018-12-29 11:11:52, 2018-12-28 11:24:22,... ## $ published 2018-12-28 19:00:00, 2018-12-27 19:00:00,... ## $ canonical [ [ FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ... ## $ categories [ [ NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ ampurl NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ cdnampurl NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ engagement NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ summary_content "

maraaverick.rbind.io

"ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ... ## $ origin_streamid "feed/https://feeds.feedburner.com/rweekly... ## $ origin_title "R Weekly Live: R Focus", "R Weekly Live: ... ## $ origin_htmlurl "https://rweekly.org/live", "https://rweek... ## $ visual_url NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ visual_processor NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ visual_width NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ visual_height NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ visual_contenttype NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ decorations_dropbox NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ decorations_pocket NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... glimpse(r_blog_stream) ## Observations: 2,332 ## Variables: 34 ## $ id "XGq6cYRY3hH9/vdZr0WOJiPdAe0u6dQ2ddUFEsTqP... ## $ keywords ["R bloggers", "R bloggers", "R bloggers"... ## $ originid "https://datascienceplus.com/?p=19513", "h... ## $ fingerprint "2f32071a", "332f9548", "2e6f8adb", "3d7ed... ## $ title "Leaf Plant Classification: Statistical Le... ## $ crawled 2018-12-30 22:35:22, 2018-12-30 19:01:25,... ## $ published 2018-12-30 19:26:20, 2018-12-30 13:18:00,... ## $ canonical [ "Giorgio Garziano", "Sascha W.", "Economet... ## $ alternate [ TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F... ## $ categories [ [ 50, 39, 482, 135, 33, 12, 13, 41, 50, 31, ... ## $ engagementrate 1.43, 0.98, 8.76, 2.45, 0.59, 0.21, 0.22, ... ## $ enclosure [NULL, NULL, NULL, NULL, [NULL, NULL, NULL, NULL, NULL, NULL, NULL... ## $ recrawled NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ updatecount NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ content_content "

"ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ... ## $ summary_content "CategoriesAdvanced Modeling\nTags\nLinear... ## $ summary_direction "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ... ## $ origin_streamid "feed/http://feeds.feedburner.com/RBlogger... ## $ origin_title "R-bloggers", "R-bloggers", "R-bloggers", ... ## $ origin_htmlurl "https://www.r-bloggers.com", "https://www... ## $ visual_processor "feedly-nikon-v3.1", "feedly-nikon-v3.1", ... ## $ visual_url "https://i0.wp.com/datascienceplus.com/wp-... ## $ visual_width 383, 400, NA, 286, 456, 250, 450, 456, 397... ## $ visual_height 309, 300, NA, 490, 253, 247, 450, 253, 333... ## $ visual_contenttype "image/png", "image/png", NA, "image/png",... ## $ webfeeds_icon "https://storage.googleapis.com/test-site-... ## $ decorations_dropbox NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ decorations_pocket NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

And also check how far into December for each did I get as of this post? (I’ll check again after the 31 and update if needed).

range(r_weekly_wk_stream$published) ## [1] "2018-01-07 19:00:00 EST" "2018-12-30 19:00:00 EST" range(r_blog_stream$published) ## [1] "2018-01-01 11:00:27 EST" "2018-12-30 19:26:20 EST" range(r_weekly_live_stream$published) ## [1] "2018-01-01 19:00:00 EST" "2018-12-28 19:00:00 EST" Digging Into The Weeds Feeds

In the above glimpses there’s another special term, engagement. Feedly defines this as an “indicator of how popular this entry is. The higher the number, the more readers have read, saved or shared this particular entry”. We’ll use this to look at the most “engaged” content in a bit. What’s noticeable from the start is that R Weekly Live has 1,333 entries and R-bloggers has 2,330 entries (so, nearly double the number of entries). Those counts are a bit of “fake news” when it comes to overall unique posts as can be seen by:

bind_rows( mutate(r_weekly_live_stream, src = "R Weekly (Live)"), mutate(r_blog_stream, src = "R-bloggers") ) %>% mutate(wk = lubridate::week(published)) -> y2018 filter(y2018, title == "RcppArmadillo 0.9.100.5.0") %>% select(src, title, originid, published) %>% gt::gt()

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Fira Sans', 'Droid Sans', 'Helvetica Neue', Arial, sans-serif; }

#gcnbcstlpe .gt_table { border-collapse: collapse; margin-left: auto; margin-right: auto; color: #000000; font-size: 16px; background-color: #FFFFFF; /* table.background.color / width: auto; / table.width / border-top-style: solid; / table.border.top.style / border-top-width: 2px; / table.border.top.width / border-top-color: #A8A8A8; / table.border.top.color */ }

#gcnbcstlpe .gt_heading { background-color: #FFFFFF; /* heading.background.color */ border-bottom-color: #FFFFFF; }

#gcnbcstlpe .gt_title { color: #000000; font-size: 125%; /* heading.title.font.size / padding-top: 4px; / heading.top.padding */ padding-bottom: 1px; border-bottom-color: #FFFFFF; border-bottom-width: 0; }

#gcnbcstlpe .gt_subtitle { color: #000000; font-size: 85%; /* heading.subtitle.font.size / padding-top: 1px; padding-bottom: 4px; / heading.bottom.padding */ border-top-color: #FFFFFF; border-top-width: 0; }

#gcnbcstlpe .gt_bottom_border { border-bottom-style: solid; /* heading.border.bottom.style / border-bottom-width: 2px; / heading.border.bottom.width / border-bottom-color: #A8A8A8; / heading.border.bottom.color */ }

#gcnbcstlpe .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; padding-top: 4px; padding-bottom: 4px; }

#gcnbcstlpe .gt_col_heading { color: #000000; background-color: #FFFFFF; /* column_labels.background.color / font-size: 16px; / column_labels.font.size / font-weight: initial; / column_labels.font.weight */ padding: 10px; margin: 10px; }

#gcnbcstlpe .gt_sep_right { border-right: 5px solid #FFFFFF; }

#gcnbcstlpe .gt_group_heading { padding: 8px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#gcnbcstlpe .gt_empty_group_heading { padding: 0.5px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#gcnbcstlpe .gt_striped tr:nth-child(even) { background-color: #f2f2f2; }

#gcnbcstlpe .gt_row { padding: 10px; /* row.padding */ margin: 10px; }

#gcnbcstlpe .gt_stub { border-right-style: solid; border-right-width: 2px; border-right-color: #A8A8A8; text-indent: 5px; }

#gcnbcstlpe .gt_stub.gt_row { background-color: #FFFFFF; }

#gcnbcstlpe .gt_summary_row { background-color: #FFFFFF; /* summary_row.background.color / padding: 6px; / summary_row.padding / text-transform: inherit; / summary_row.text_transform */ }

#gcnbcstlpe .gt_first_summary_row { border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; }

#gcnbcstlpe .gt_table_body { border-top-style: solid; /* field.border.top.style / border-top-width: 2px; / field.border.top.width / border-top-color: #A8A8A8; / field.border.top.color / border-bottom-style: solid; / field.border.bottom.style / border-bottom-width: 2px; / field.border.bottom.width / border-bottom-color: #A8A8A8; / field.border.bottom.color */ }

#gcnbcstlpe .gt_footnote { font-size: 90%; /* footnote.font.size / padding: 4px; / footnote.padding */ }

#gcnbcstlpe .gt_sourcenote { font-size: 90%; /* sourcenote.font.size / padding: 4px; / sourcenote.padding */ }

#gcnbcstlpe .gt_center { text-align: center; }

#gcnbcstlpe .gt_left { text-align: left; }

#gcnbcstlpe .gt_right { text-align: right; font-variant-numeric: tabular-nums; }

#gcnbcstlpe .gt_font_normal { font-weight: normal; }

#gcnbcstlpe .gt_font_bold { font-weight: bold; }

#gcnbcstlpe .gt_font_italic { font-style: italic; }

#gcnbcstlpe .gt_super { font-size: 65%; }

#gcnbcstlpe .gt_footnote_glyph { font-style: italic; font-size: 65%; }

src title originid published R Weekly (Live) RcppArmadillo 0.9.100.5.0 https://link.rweekly.org/bg6 2018-08-17 07:55:00 R Weekly (Live) RcppArmadillo 0.9.100.5.0 https://link.rweekly.org/bfr 2018-08-16 21:20:00 R-bloggers RcppArmadillo 0.9.100.5.0 https://www.r-bloggers.com/?guid=f8865e8a004f772bdb64e3c4763a0fe5 2018-08-17 08:00:00 R-bloggers RcppArmadillo 0.9.100.5.0 https://www.r-bloggers.com/?guid=3046299f73344a927f787322c867233b 2018-08-16 21:20:00

Feedly has many processes going on behind the scenes to identify new entries and update entries as original sources are modified. This “duplication” (thankfully) doesn’t happen alot:

count(y2018, src, wk, title, sort=TRUE) %>% filter(n > 1) %>% arrange(wk) %>% gt::gt() %>% gt::fmt_number(c("wk", "n"), decimals = 0)

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Fira Sans', 'Droid Sans', 'Helvetica Neue', Arial, sans-serif; }

#ymdqkmtxsw .gt_table { border-collapse: collapse; margin-left: auto; margin-right: auto; color: #000000; font-size: 16px; background-color: #FFFFFF; /* table.background.color / width: auto; / table.width / border-top-style: solid; / table.border.top.style / border-top-width: 2px; / table.border.top.width / border-top-color: #A8A8A8; / table.border.top.color */ }

#ymdqkmtxsw .gt_heading { background-color: #FFFFFF; /* heading.background.color */ border-bottom-color: #FFFFFF; }

#ymdqkmtxsw .gt_title { color: #000000; font-size: 125%; /* heading.title.font.size / padding-top: 4px; / heading.top.padding */ padding-bottom: 1px; border-bottom-color: #FFFFFF; border-bottom-width: 0; }

#ymdqkmtxsw .gt_subtitle { color: #000000; font-size: 85%; /* heading.subtitle.font.size / padding-top: 1px; padding-bottom: 4px; / heading.bottom.padding */ border-top-color: #FFFFFF; border-top-width: 0; }

#ymdqkmtxsw .gt_bottom_border { border-bottom-style: solid; /* heading.border.bottom.style / border-bottom-width: 2px; / heading.border.bottom.width / border-bottom-color: #A8A8A8; / heading.border.bottom.color */ }

#ymdqkmtxsw .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; padding-top: 4px; padding-bottom: 4px; }

#ymdqkmtxsw .gt_col_heading { color: #000000; background-color: #FFFFFF; /* column_labels.background.color / font-size: 16px; / column_labels.font.size / font-weight: initial; / column_labels.font.weight */ padding: 10px; margin: 10px; }

#ymdqkmtxsw .gt_sep_right { border-right: 5px solid #FFFFFF; }

#ymdqkmtxsw .gt_group_heading { padding: 8px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#ymdqkmtxsw .gt_empty_group_heading { padding: 0.5px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#ymdqkmtxsw .gt_striped tr:nth-child(even) { background-color: #f2f2f2; }

#ymdqkmtxsw .gt_row { padding: 10px; /* row.padding */ margin: 10px; }

#ymdqkmtxsw .gt_stub { border-right-style: solid; border-right-width: 2px; border-right-color: #A8A8A8; text-indent: 5px; }

#ymdqkmtxsw .gt_stub.gt_row { background-color: #FFFFFF; }

#ymdqkmtxsw .gt_summary_row { background-color: #FFFFFF; /* summary_row.background.color / padding: 6px; / summary_row.padding / text-transform: inherit; / summary_row.text_transform */ }

#ymdqkmtxsw .gt_first_summary_row { border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; }

#ymdqkmtxsw .gt_table_body { border-top-style: solid; /* field.border.top.style / border-top-width: 2px; / field.border.top.width / border-top-color: #A8A8A8; / field.border.top.color / border-bottom-style: solid; / field.border.bottom.style / border-bottom-width: 2px; / field.border.bottom.width / border-bottom-color: #A8A8A8; / field.border.bottom.color */ }

#ymdqkmtxsw .gt_footnote { font-size: 90%; /* footnote.font.size / padding: 4px; / footnote.padding */ }

#ymdqkmtxsw .gt_sourcenote { font-size: 90%; /* sourcenote.font.size / padding: 4px; / sourcenote.padding */ }

#ymdqkmtxsw .gt_center { text-align: center; }

#ymdqkmtxsw .gt_left { text-align: left; }

#ymdqkmtxsw .gt_right { text-align: right; font-variant-numeric: tabular-nums; }

#ymdqkmtxsw .gt_font_normal { font-weight: normal; }

#ymdqkmtxsw .gt_font_bold { font-weight: bold; }

#ymdqkmtxsw .gt_font_italic { font-style: italic; }

#ymdqkmtxsw .gt_super { font-size: 65%; }

#ymdqkmtxsw .gt_footnote_glyph { font-style: italic; font-size: 65%; }

src wk title n R-bloggers 3 conapomx data package 2 R Weekly (Live) 5 R in Latin America 2 R Weekly (Live) 12 Truncated Poisson distributions in R and Stan by @ellis2013nz 2 R Weekly (Live) 17 Regression Modeling Strategies 2 R Weekly (Live) 18 How much work is onboarding? 2 R Weekly (Live) 18 Survey books, courses and tools by @ellis2013nz 2 R-bloggers 20 Beautiful and Powerful Correlation Tables in R 2 R Weekly (Live) 24 R Consortium is soliciting your feedback on R package best practices 2 R Weekly (Live) 33 RcppArmadillo 0.9.100.5.0 2 R-bloggers 33 RcppArmadillo 0.9.100.5.0 2 R-bloggers 39 Individual level data 2 R Weekly (Live) 41 How R gets built on Windows 2 R Weekly (Live) 41 R Consortium grant applications due October 31 2 R Weekly (Live) 41 The Economist’s Big Mac Index is calculated with R 2 R Weekly (Live) 42 A small logical change with big impact 2 R Weekly (Live) 42 Maryland’s Bridge Safety, reported using R 2 R-bloggers 47 OneR – fascinating insights through simple rules 2

In fact, it happens infrequently enough that I’m going to let the “noise” stay in the data since Feedly technically is tracking some content change.

Let’s look at the week-over-week curation counts (neither source publishes original content, so using the term “published” seems ill fitting) for each:

count(y2018, src, wk) %>% ggplot(aes(wk, n)) + geom_segment(aes(xend=wk, yend=0, color = src), show.legend = FALSE) + facet_wrap(~src, ncol=1, scales="free_x") + labs( x = "Week #", y = "# Posts", title = "Weekly Post Curation Stats for R-bloggers & R Weekly (Live)" ) + theme_ft_rc(grid="Y")

Despite R-bloggers having curated more overall content, there’s plenty to read each week for consumers of either/both aggregators.

Speaking of consuming, let’s look at the distribution of engagement scores for both aggregators:

group_by(y2018, src) %>% summarise(v = list(broom::tidy(summary(engagement)))) %>% unnest() ## # A tibble: 2 x 8 ## src minimum q1 median mean q3 maximum na ## ## 1 R Weekly (Live) 0 0 0 0 0 0 1060 ## 2 R-bloggers 1 16 32.5 58.7 70 2023 NA

Well, it seems that it’s more difficult for Feedly to track engagement for the link-only R Weekly (Live) feed, so we’ll have to focus on R-bloggers for engagement views. Summary values are fine, but we can get a picture of the engagement distribution (we’ll do it monthly to get a bit more granularity, too):

filter(y2018, src == "R-bloggers") %>% mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>% ggplot(aes(month, engagement)) + geom_violin() + ggbeeswarm::geom_quasirandom( groupOnX = TRUE, size = 2, color = "#2b2b2b", fill = ft_cols$green, shape = 21, stroke = 0.25 ) + scale_y_comma(trans = "log10") + labs( x = NULL, y = "Engagement Score", title = "Monthly Post Engagement Distributions for R-bloggers Curated Posts", caption = "NOTE: Y-axis log10 Scale" ) + theme_ft_rc(grid="Y")

I wasn’t expecting each month’s distribution to be so similar. There are definitely outliers in terms of positive engagement so we should be able see what types of R-focused content piques the interest of the ~25,000 Feedly subscribers of R-bloggers.

filter(y2018, src == "R-bloggers") %>% group_by(author) %>% summarise(n_posts = n(), total_eng = sum(engagement), avg_eng = mean(engagement), med_eng = median(engagement)) %>% arrange(desc(n_posts)) %>% slice(1:20) %>% gt::gt() %>% gt::fmt_number(c("n_posts", "total_eng", "avg_eng", "med_eng"), decimals = 0)

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Fira Sans', 'Droid Sans', 'Helvetica Neue', Arial, sans-serif; }

#gxcouiuiwz .gt_table { border-collapse: collapse; margin-left: auto; margin-right: auto; color: #000000; font-size: 16px; background-color: #FFFFFF; /* table.background.color / width: auto; / table.width / border-top-style: solid; / table.border.top.style / border-top-width: 2px; / table.border.top.width / border-top-color: #A8A8A8; / table.border.top.color */ }

#gxcouiuiwz .gt_heading { background-color: #FFFFFF; /* heading.background.color */ border-bottom-color: #FFFFFF; }

#gxcouiuiwz .gt_title { color: #000000; font-size: 125%; /* heading.title.font.size / padding-top: 4px; / heading.top.padding */ padding-bottom: 1px; border-bottom-color: #FFFFFF; border-bottom-width: 0; }

#gxcouiuiwz .gt_subtitle { color: #000000; font-size: 85%; /* heading.subtitle.font.size / padding-top: 1px; padding-bottom: 4px; / heading.bottom.padding */ border-top-color: #FFFFFF; border-top-width: 0; }

#gxcouiuiwz .gt_bottom_border { border-bottom-style: solid; /* heading.border.bottom.style / border-bottom-width: 2px; / heading.border.bottom.width / border-bottom-color: #A8A8A8; / heading.border.bottom.color */ }

#gxcouiuiwz .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; padding-top: 4px; padding-bottom: 4px; }

#gxcouiuiwz .gt_col_heading { color: #000000; background-color: #FFFFFF; /* column_labels.background.color / font-size: 16px; / column_labels.font.size / font-weight: initial; / column_labels.font.weight */ padding: 10px; margin: 10px; }

#gxcouiuiwz .gt_sep_right { border-right: 5px solid #FFFFFF; }

#gxcouiuiwz .gt_group_heading { padding: 8px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#gxcouiuiwz .gt_empty_group_heading { padding: 0.5px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#gxcouiuiwz .gt_striped tr:nth-child(even) { background-color: #f2f2f2; }

#gxcouiuiwz .gt_row { padding: 10px; /* row.padding */ margin: 10px; }

#gxcouiuiwz .gt_stub { border-right-style: solid; border-right-width: 2px; border-right-color: #A8A8A8; text-indent: 5px; }

#gxcouiuiwz .gt_stub.gt_row { background-color: #FFFFFF; }

#gxcouiuiwz .gt_summary_row { background-color: #FFFFFF; /* summary_row.background.color / padding: 6px; / summary_row.padding / text-transform: inherit; / summary_row.text_transform */ }

#gxcouiuiwz .gt_first_summary_row { border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; }

#gxcouiuiwz .gt_table_body { border-top-style: solid; /* field.border.top.style / border-top-width: 2px; / field.border.top.width / border-top-color: #A8A8A8; / field.border.top.color / border-bottom-style: solid; / field.border.bottom.style / border-bottom-width: 2px; / field.border.bottom.width / border-bottom-color: #A8A8A8; / field.border.bottom.color */ }

#gxcouiuiwz .gt_footnote { font-size: 90%; /* footnote.font.size / padding: 4px; / footnote.padding */ }

#gxcouiuiwz .gt_sourcenote { font-size: 90%; /* sourcenote.font.size / padding: 4px; / sourcenote.padding */ }

#gxcouiuiwz .gt_center { text-align: center; }

#gxcouiuiwz .gt_left { text-align: left; }

#gxcouiuiwz .gt_right { text-align: right; font-variant-numeric: tabular-nums; }

#gxcouiuiwz .gt_font_normal { font-weight: normal; }

#gxcouiuiwz .gt_font_bold { font-weight: bold; }

#gxcouiuiwz .gt_font_italic { font-style: italic; }

#gxcouiuiwz .gt_super { font-size: 65%; }

#gxcouiuiwz .gt_footnote_glyph { font-style: italic; font-size: 65%; }

author n_posts total_eng avg_eng med_eng David Smith 116 9,791 84 47 John Mount 94 4,614 49 33 rOpenSci – open tools for open science 89 2,967 33 19 Thinking inside the box 85 1,510 18 14 R Views 60 4,142 69 47 hrbrmstr 55 1,179 21 16 Dr. Shirin Glander 54 2,747 51 25 xi’an 49 990 20 12 Mango Solutions 42 1,221 29 17 Econometrics and Free Software 33 2,858 87 60 business-science.io – Articles 31 4,484 145 70 NA 31 1,724 56 40 statcompute 29 1,329 46 33 Ryan Sheehy 25 1,271 51 45 Keith Goldfeld 24 1,305 54 43 free range statistics – R 23 440 19 12 Jakob Gepp 21 348 17 13 Tal Galili 21 1,587 76 22 Jozef’s Rblog 18 1,617 90 65 arthur charpentier 16 1,320 82 68

It is absolutely no surprise David comes in at number one in both post count and almost every engagement summary statistic since he’s a veritable blogging machine and creates + curates some super interesting content (whereas your’s truly doesn’t even make the median engagement cut ).

What were the most engaging posts?

filter(y2018, src == "R-bloggers") %>% arrange(desc(engagement)) %>% mutate(published = as.Date(published)) %>% select(engagement, title, published, author) %>% slice(1:50) %>% gt::gt() %>% gt::fmt_number(c("engagement"), decimals = 0)

html { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Fira Sans', 'Droid Sans', 'Helvetica Neue', Arial, sans-serif; }

#igaqkohfhd .gt_table { border-collapse: collapse; margin-left: auto; margin-right: auto; color: #000000; font-size: 16px; background-color: #FFFFFF; /* table.background.color / width: auto; / table.width / border-top-style: solid; / table.border.top.style / border-top-width: 2px; / table.border.top.width / border-top-color: #A8A8A8; / table.border.top.color */ }

#igaqkohfhd .gt_heading { background-color: #FFFFFF; /* heading.background.color */ border-bottom-color: #FFFFFF; }

#igaqkohfhd .gt_title { color: #000000; font-size: 125%; /* heading.title.font.size / padding-top: 4px; / heading.top.padding */ padding-bottom: 1px; border-bottom-color: #FFFFFF; border-bottom-width: 0; }

#igaqkohfhd .gt_subtitle { color: #000000; font-size: 85%; /* heading.subtitle.font.size / padding-top: 1px; padding-bottom: 4px; / heading.bottom.padding */ border-top-color: #FFFFFF; border-top-width: 0; }

#igaqkohfhd .gt_bottom_border { border-bottom-style: solid; /* heading.border.bottom.style / border-bottom-width: 2px; / heading.border.bottom.width / border-bottom-color: #A8A8A8; / heading.border.bottom.color */ }

#igaqkohfhd .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; padding-top: 4px; padding-bottom: 4px; }

#igaqkohfhd .gt_col_heading { color: #000000; background-color: #FFFFFF; /* column_labels.background.color / font-size: 16px; / column_labels.font.size / font-weight: initial; / column_labels.font.weight */ padding: 10px; margin: 10px; }

#igaqkohfhd .gt_sep_right { border-right: 5px solid #FFFFFF; }

#igaqkohfhd .gt_group_heading { padding: 8px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#igaqkohfhd .gt_empty_group_heading { padding: 0.5px; color: #000000; background-color: #FFFFFF; /* stub_group.background.color / font-size: 16px; / stub_group.font.size / font-weight: initial; / stub_group.font.weight / border-top-style: solid; / stub_group.border.top.style / border-top-width: 2px; / stub_group.border.top.width / border-top-color: #A8A8A8; / stub_group.border.top.color / border-bottom-style: solid; / stub_group.border.bottom .style / border-bottom-width: 2px; / stub_group.border.bottom .width / border-bottom-color: #A8A8A8; / stub_group.border.bottom .color */ }

#igaqkohfhd .gt_striped tr:nth-child(even) { background-color: #f2f2f2; }

#igaqkohfhd .gt_row { padding: 10px; /* row.padding */ margin: 10px; }

#igaqkohfhd .gt_stub { border-right-style: solid; border-right-width: 2px; border-right-color: #A8A8A8; text-indent: 5px; }

#igaqkohfhd .gt_stub.gt_row { background-color: #FFFFFF; }

#igaqkohfhd .gt_summary_row { background-color: #FFFFFF; /* summary_row.background.color / padding: 6px; / summary_row.padding / text-transform: inherit; / summary_row.text_transform */ }

#igaqkohfhd .gt_first_summary_row { border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; }

#igaqkohfhd .gt_table_body { border-top-style: solid; /* field.border.top.style / border-top-width: 2px; / field.border.top.width / border-top-color: #A8A8A8; / field.border.top.color / border-bottom-style: solid; / field.border.bottom.style / border-bottom-width: 2px; / field.border.bottom.width / border-bottom-color: #A8A8A8; / field.border.bottom.color */ }

#igaqkohfhd .gt_footnote { font-size: 90%; /* footnote.font.size / padding: 4px; / footnote.padding */ }

#igaqkohfhd .gt_sourcenote { font-size: 90%; /* sourcenote.font.size / padding: 4px; / sourcenote.padding */ }

#igaqkohfhd .gt_center { text-align: center; }

#igaqkohfhd .gt_left { text-align: left; }

#igaqkohfhd .gt_right { text-align: right; font-variant-numeric: tabular-nums; }

#igaqkohfhd .gt_font_normal { font-weight: normal; }

#igaqkohfhd .gt_font_bold { font-weight: bold; }

#igaqkohfhd .gt_font_italic { font-style: italic; }

#igaqkohfhd .gt_super { font-size: 65%; }

#igaqkohfhd .gt_footnote_glyph { font-style: italic; font-size: 65%; }

engagement title published author 2,023 Happy Birthday R 2018-08-27 eoda GmbH 1,132 15 Types of Regression you should know 2018-03-25 ListenData 697 R and Python: How to Integrate the Best of Both into Your Data Science Workflow 2018-10-08 business-science.io – Articles 690 Ultimate Python Cheatsheet: Data Science Workflow with Python 2018-11-18 business-science.io – Articles 639 Data Analysis with Python Course: How to read, wrangle, and analyze data 2018-10-31 Andrew Treadway 617 Machine Learning Results in R: one plot to rule them all! 2018-07-18 Bernardo Lares 614 R tip: Use Radix Sort 2018-08-21 John Mount 610 Data science courses in R (/python/etc.) for $10 at Udemy (Sitewide Sale until Aug 26th) 2018-08-24 Tal Galili 575 Why R for data science – and not Python? 2018-12-02 Learning Machines 560 Case Study: How To Build A High Performance Data Science Team 2018-09-18 business-science.io – Articles 516 R 3.5.0 is released! (major release with many new features) 2018-04-24 Tal Galili 482 R or Python? Why not both? Using Anaconda Python within R with {reticulate} 2018-12-30 Econometrics and Free Software 479 Sankey Diagram for the 2018 FIFA World Cup Forecast 2018-06-10 Achim Zeileis 477 5 amazing free tools that can help with publishing R results and blogging 2018-12-22 Jozef’s Rblog 462 What’s the difference between data science, machine learning, and artificial intelligence? 2018-01-09 David Robinson 456 XKCD “Curve Fitting”, in R 2018-09-28 David Smith 450 The prequel to the drake R package 2018-02-06 rOpenSci – open tools for open science 449 Who wrote that anonymous NYT op-ed? Text similarity analyses with R 2018-09-07 David Smith 437 Elegant regression results tables and plots in R: the finalfit package 2018-05-16 Ewen Harrison 428 How to implement neural networks in R 2018-01-12 David Smith 426 Data transformation in #tidyverse style: package sjmisc updated #rstats 2018-02-06 Daniel 413 Neural Networks Are Essentially Polynomial Regression 2018-06-20 matloff 403 Custom R charts coming to Excel 2018-05-11 David Smith 379 A perfect RStudio layout 2018-05-22 Ilya Kashnitsky 370 Drawing beautiful maps programmatically with R, sf and ggplot2 — Part 1: Basics 2018-10-25 Mel Moreno and Mathieu Basille 368 The Financial Times and BBC use R for publication graphics 2018-06-27 David Smith 367 Dealing with The Problem of Multicollinearity in R 2018-08-16 Perceptive Analytics 367 Excel is obsolete. Here are the Top 2 alternatives from R and Python. 2018-03-13 Appsilon Data Science Blog 365 New R Cheatsheet: Data Science Workflow with R 2018-11-04 business-science.io – Articles 361 Tips for analyzing Excel data in R 2018-08-30 David Smith 360 Importing 30GB of data in R with sparklyr 2018-02-16 Econometrics and Free Software 358 Scraping a website with 5 lines of R code 2018-01-24 David Smith 356 Clustering the Bible 2018-12-27 Learning Machines 356 Finally, You Can Plot H2O Decision Trees in R 2018-12-26 Gregory Kanevsky 356 Geocomputation with R – the afterword 2018-12-12 Rstats on Jakub Nowosad’s website 347 Time Series Deep Learning: Forecasting Sunspots With Keras Stateful LSTM In R 2018-04-18 business-science.io – Articles 343 Run Python from R 2018-03-27 Deepanshu Bhalla 336 Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models) 2018-07-24 Bernardo Lares 332 R Generation: 25 Years of R 2018-08-01 David Smith 329 How to extract data from a PDF file with R 2018-01-05 Packt Publishing 325 R or Python? Python or R? The ongoing debate. 2018-01-28 tomaztsql 322 How to perform Logistic Regression, LDA, & QDA in R 2018-01-05 Prashant Shekhar 321 Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity 2018-09-06 David Robinson 311 Intuition for principal component analysis (PCA) 2018-12-06 Learning Machines 310 Packages for Getting Started with Time Series Analysis in R 2018-02-18 atmathew 309 Announcing the R Markdown Book 2018-07-13 Yihui Xie 307 Automated Email Reports with R 2018-11-01 JOURNEYOFANALYTICS 304 future.apply – Parallelize Any Base R Apply Function 2018-06-23 JottR on R 298 How to build your own Neural Network from scratch in R 2018-10-09 Posts on Tychobra 293 RStudio 1.2 Preview: SQL Integration 2018-10-02 Jonathan McPherson

Weekly & monthly curated post descriptive statstic patterns haven’t changed much since the April post:

filter(y2018, src == "R-bloggers") %>% mutate(wkday = lubridate::wday(published, label = TRUE, abbr = TRUE)) %>% count(wkday) %>% ggplot(aes(wkday, n)) + geom_col(width = 0.5, fill = ft_cols$slate, color = NA) + scale_y_comma() + labs( x = NULL, y = "# Curated Posts", title = "Day-of-week Curated Post Count for the R-bloggers Feed" ) + theme_ft_rc(grid="Y")

filter(y2018, src == "R-bloggers") %>% mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>% count(month) %>% ggplot(aes(month, n)) + geom_col(width = 0.5, fill = ft_cols$slate, color = NA) + scale_y_comma() + labs( x = NULL, y = "# Curated Posts", title = "Monthly Curated Post Count for the R-bloggers Feed" ) + theme_ft_rc(grid="Y")

Surprisingly, monthly post count consistency (or even posting something each month) is not a common trait amongst the top 20 (by total engagement) authors:

w20 <- scales::wrap_format(20) filter(y2018, src == "R-bloggers") %>% filter(!is.na(author)) %>% # some posts don't have author attribution mutate(author_t = map_chr(w20(author), paste0, collapse="\n")) %>% # we need to wrap for facet titles (below) count(author, author_t, wt=engagement, sort=TRUE) %>% # get total author engagement slice(1:20) %>% # top 20 { .auth_ordr <<- . ; . } %>% # we use the order later left_join(filter(y2018, src == "R-bloggers"), "author") %>% mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>% count(month, author_t, sort = TRUE) %>% mutate(author_t = factor(author_t, levels = .auth_ordr$author_t)) %>% ggplot(aes(month, nn, author_t)) + geom_col(width = 0.5) + scale_x_discrete(labels=substring(month.abb, 1, 1)) + scale_y_comma() + facet_wrap(~author_t) + labs( x = NULL, y = "Curated Post Count", title = "Monthly Curated Post Counts-per-Author (Top 20 by Engagement)", subtitle = "Arranged by Total Author Engagement" ) + theme_ft_rc(grid="yY")

Overall, most authors favor shorter titles for their posts:

filter(y2018, src == "R-bloggers") %>% mutate( `Character Count Distribution` = nchar(title), `Word Count Distribution` = stringi::stri_count_boundaries(title, type = "word") ) %>% select(id, `Character Count Distribution`, `Word Count Distribution`) %>% gather(measure, value, -id) %>% ggplot(aes(value)) + ggalt::geom_bkde(alpha=1/3, color = ft_cols$slate, fill = ft_cols$slate) + scale_y_continuous(expand=c(0,0)) + facet_wrap(~measure, scales = "free") + labs( x = NULL, y = "Density", title = "Title Character/Word Count Distributions", subtitle = "~38 characters/11 words seems to be the sweet spot for most authors", caption = "Note Free X/Y Scales" ) + theme_ft_rc(grid="XY")

This post is already kinda tome-length so I’ll leave it to y’all to grab the data and dig in a bit more.

A Word About Using The content_content Field For R-bloggers Posts

Since R-bloggers requires a full feed from contributors, they, in-turn, post a “kinda” full-feed back out. I say “kinda” as they still haven’t fixed a reported bug in their processing engine which causes issues in (at least) Feedly’s RSS processing engine. If you use Feedly, take a look at the R-bloggers RSS feed entry for the recent “R or Python? Why not both? Using Anaconda Python within R with {reticulate}” post. It cuts off near “Let’s check its type:”. This is due to the way the < character is processed by the R-bloggers ingestion engine which turns the ## in the original post and doesn’t even display right on the R-bloggers page as it mangles the input and turns the descriptive output into an actuall tag: . It’s really an issue on both sides, but R-bloggers is doing the mangling and should seriously consider addressing it in 2019.

Since it is still not fixed, it forces you to go to R-bloggers (clicks FTW? and may partly explain why that example post has a 400+ engagement score) unless you scroll back up to the top of the Feedly view and go to the author’s blog page. Given that tibble output invariably has a < right up top, your best bet for getting more direct views of your own content is to get a code-block with printed ## < output in it as close to the beginning as possible (perhaps start each post with a print(tbl_df(mtcars)))? ).

Putting post-view-hacking levity aside, this content mangling means you can’t trust the content_content column in the stream data frame to have all the content; that is, if you were planning on taking the provided data and doing some topic clustering or content-based feature extraction for other stats/ML ops you’re out of luck and need to crawl the original site URLs on your own to get the main content for such analyses.

A Bit More About seymour

The seymour package has the following API functions:

  • feedly_access_token: Retrieve the Feedly Developer Token
  • feedly_collections: Retrieve Feedly Connections
  • feedly_feed_meta: Retrieve Metadata for a Feed
  • feedly_opml: Retrieve Your Feedly OPML File
  • feedly_profile: Retrieve Your Feedly Profile
  • feedly_search_contents: Search content of a stream
  • feedly_search_title: Find feeds based on title, url or ‘#topic’
  • feedly_stream: Retrieve contents of a Feedly “stream”
  • feedly_tags: Retrieve List of Tags

along with following helper function (which we’ll introduce in a minute):

  • render_stream: Render a Feedly Stream Data Frame to RMarkdown

and, the following helper reference (as Feedly has some “universal” streams):

  • global_resource_ids: Global Resource Ids Helper Reference

The render_stream() function is semi-useful on its own but was designed as more of a “you may want to replicate this on your own” (i.e. have a look at the source code and riff off of it). “Streams” are individual feeds, collections or even “boards” you make and with this new API package and the power of R Markdown, you can make your own “newsletter” like this:

fp <- feedly_profile() # get profile to get my id # use the id to get my "security" category feed in my feedly fs <- feedly_stream(sprintf("user/%s/category/security", fp$id)) # get the top 10 items with engagement >= third quartile of all posts # and don't include duplicates in the report mutate(fs$items, published = as.Date(published)) %>% filter(published >= as.Date("2018-12-01")) %>% filter(engagement > fivenum(engagement)[4]) %>% filter(!is.na(summary_content)) %>% mutate(alt_url = map_chr(alternate, ~.x[[1]])) %>% distinct(alt_url, .keep_all = TRUE) %>% slice(1:10) -> for_report # render the report render_stream( feedly_stream = for_report, title = "Cybersecurity News", include_visual = TRUE, browse = TRUE )

Which makes the following Rmd and HTML. (So, no need to “upgrade” to “Teams” to make newsletters!).

FIN

As noted, the 2018 data for R Weekly (Live) & R-bloggers is available and you can find the seymour package on [GL | GH].

If you’re not a Feedly user I strongly encourage you to give it a go! And, if you don’t subscribe to R Weekly, you should make that your first New Year’s Resolution.

Here’s looking to another year of great R content across the R blogosphere!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages