R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 22 min ago

### In case you missed it: July 2017 roundup

Fri, 08/11/2017 - 17:08

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from July of particular interest to R users.

A tutorial on using the rsparkling package to apply H20's algorithms to data in HDInsight.

Several exercises to learn parallel programming with the foreach package.

A presentation on the R6 class system, by Winston Chang.

Introducing "joyplots", a ggplot2 extension for visualizing multiple time series or distributions (with a nod to Joy Division).

SQL Server 2017, with many new R-related capabilities, is nearing release.

Ali Zaidi on using neural embeddings with R and Spark to analyze Github comments.

R ranks #6 in the 2017 IEEE Spectrum Top Programming Languages.

Course materials on "Data Analysis for the Life Sciences", from Rafael Irizarry.

How to securely store API keys in R scripts with the "secret" package.

An in-depth tutorial on implementing neural network algorithms in R.

Recordings from the useR!2017 conference in Brussels are now available. The conference was also livestreamed

Recording of my lightning talk, "R in Minecraft".

Uwe Ligges' useR!2017 keynote, "20 years of CRAN".

A framework for implementing a credit risk prediction system with Microsoft R.

High school students build a facial-recognition drone using Microsoft Cognitive Services and Raspberry Pi.

he R Consortium is conducting a survey of R users.

The R GUI framework "rattle" now supports XGBoost models.

My presentation at useR!2017, "R, Then and Now", on how perceptions of R have changed over the years.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fri, 08/11/2017 - 09:54

(This article was first published on Blog, and kindly contributed to R-bloggers)

Nathan Eastwood, Data Scientist

I was recently introduced to the Twitter Waterflow Problem and I decided it was interesting enough to try and complete the challenge in R.

Consider the following picture:

This plot shows a series of walls and empty valleys. We can represent this picture by an array of integers, where the value at each index is the height of the wall. So in this case, the array of integers can be defined as:

wall <- c(2, 5, 1, 2, 3, 4, 7, 7, 6)

Now imagine it rains. How much water is going to be accumulated in puddles between walls? No puddles are formed at edges of the wall and water is considered to simply run off the edge. We count volume in square blocks of 1×1. Thus, we are left with a puddle between column 2 and column 7 and the volume is 10.

The Approach

The approach I took was one of many ways you could solve this problem. I chose to treat each column (wall) as an index and then for each index I implement a loop:

• Find the height of the current index
• Find the maximum height of the walls to the left of the current index
• Find the maximum height of the walls to the right of the index

I then work out what the smallest maximum height is between the maximum heights to the left and right of the current index. If this smallest height minus the current index height is greater than zero, then I know how many blocks will fill with water for the current index. Of course, if the smallest maximum height to the left or right is less than the current height, then I get the run off. Converting this to code looks like this:

len <- length(wall) # pre-allocate memory to make the loop more efficient water <- rep(0, len) for (i in seq_along(wall)) { currentHeight <- wall[i] maxLeftHeight <- if (i > 1) { max(wall[1:(i - 1)]) } else { 0 } maxRightHeight <- if (i == len) { 0 } else { max(wall[(i + 1):len]) } smallestMaxHeight <- min(maxLeftHeight, maxRightHeight) water[i] <- if (smallestMaxHeight - currentHeight > 0) { smallestMaxHeight - currentHeight } else { 0 } } water [1] 0 0 4 3 2 1 0 0 0

This returns a vector of the number of blocks of water at each index.

The R6 Class

For this problem I chose to use the R6 class system because it is very self contained. The R6 class system is different from the functional S3 and S4 class systems found in base R in that it is an encapsulated class system. Some key differences between the two are:
Functional:

• Objects contain data
• Class methods are separate from objects
• Objects are immutable – they cannot be changed after they have been created

Encapsulated:

• Objects contain methods and data
• Objects are mutable

Winston Chang details these differences very well in his userR!2017 talk.

I wanted the user to be able to acquire two key pieces of information: the total blocks of water and the filled plot. I therefore created three public methods inside an R6 class and placed the class in a simple package which can be found here. public methods are a list of functions (and/or non-functions) which are essentially methods (or data) of the class that are intended to be used by the user of the class.

When writing a new R6 class, we often want to perform some initial functionality when we instantiate an object of this class. This is true in our case as when we instantiate a new object of class waterflow, we want it to calculate the heights of the water straight away. We therefore call our earlier function initialize and place it inside the public list.

initialize = function(wall = NULL) { if (is.null(wall)) { stop("Please provide some wall heights") } if (!is.numeric(wall)) { stop("Please provide a numeric vector") } len <- length(wall) water <- rep(0, len) for (i in seq_along(wall)) { currentHeight <- wall[i] maxLeftHeight <- if (i > 1) { max(wall[1:(i - 1)]) } else { 0 } maxRightHeight <- if (i == len) { 0 } else { max(wall[(i + 1):len]) } smallestMaxHeight <- min(maxLeftHeight, maxRightHeight) water[i] <- if (smallestMaxHeight - currentHeight > 0) { smallestMaxHeight - currentHeight } else { 0 } } private$heights <- private$tidyWater(water, wall) }

Note: we have one key difference here, we are assigning the results to members of the private list within our object. Once we have calculated our water heights, we use them along with our wall heights in the function private$tidyWater() and assign the resulting data.frame to private$heights. The private argument is a list which contains methods (and/or data) which are internal to the class and are not intended to be used by the user. We do similar things when writing packages – we explicitly export the functionality we want other people to use and don’t export functionality that is only used within the package itself.

So to use the class, first we instantiate the class with some data in a call to $new() which in turn runs the above initialize function. library(waterflow) wall <- c(2, 5, 1, 2, 3, 4, 7, 7, 6) p <- waterflow$new(wall)

So now we have an object called p which is of class waterflow (and R6). p contains the data, as well as the (public) methods we can perform on that data.

Typically when we return p, R6 objects have a default print method that lists all members of the object but here there is a custom $print() function that returns heights. p pos type val 1 1 water 0 2 2 water 0 3 3 water 4 4 4 water 3 5 5 water 2 6 6 water 1 7 7 water 0 8 8 water 0 9 9 water 0 10 1 wall 2 11 2 wall 5 12 3 wall 1 13 4 wall 2 14 5 wall 3 15 6 wall 4 16 7 wall 7 17 8 wall 7 18 9 wall 6 We can still return the members of the object by looking at the structure of p. str(p) Classes 'waterflow', 'R6' Public: clone: function (deep = FALSE) initialize: function (wall = NULL) plot: function () print: function () total: function () Private: heights: data.frame tidyWater: function (water, wall) To calculate the total water that fills the valleys, we sum over the heights object for the values of the water. total = function() { sum(private$heights[private$heights$type %in% "water", "val"]) }

Calling the public method $total() gives the expected result. p$total() [1] 10

To call the plot, we access the public method $plot(). p$plot()

The completed class looks like this

waterflow <- R6Class( "waterflow", public = list( initialize = function(wall = NULL) { if (is.null(wall)) { stop("Please provide some wall heights") } if (!is.numeric(wall)) { stop("Please provide a numeric vector") } len <- length(wall) water <- rep(0, len) for (i in seq_along(wall)) { currentHeight <- wall[i] maxLeftHeight <- if (i > 1) { max(wall[1:(i - 1)]) } else { 0 } maxRightHeight <- if (i == len) { 0 } else { max(wall[(i + 1):len]) } smallestMaxHeight <- min(maxLeftHeight, maxRightHeight) water[i] <- if (smallestMaxHeight - currentHeight > 0) { smallestMaxHeight - currentHeight } else { 0 } } private$heights <- private$tidyWater(water, wall) }, plot = function() { ggplot(private$heights) + geom_col( aes(x = pos + 1 / 2, y = val, fill = type), width = 1, show.legend = FALSE ) + scale_fill_manual(values = c("dodgerblue2", "grey50")) + scale_x_continuous(breaks = seq(0, max(private$heights$pos), 1)) + theme( panel.background = element_blank(), panel.ontop = TRUE, panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_line(colour = "white", size = 0.5), panel.grid.major.x = element_line(colour = "white", size = 0.5), panel.grid.major.y = element_line(colour = "white", size = 0.5), axis.ticks = element_blank(), text = element_blank() ) }, print = function() print(private$heights), total = function() { sum(private$heights[private$heights$type %in% "water", "val"]) } ), private = list( heights = NULL, tidyWater = function(water, wall) { data.frame( pos = seq_along(wall), type = factor( rep(c("water", "wall"), each = length(wall)), levels = c("water", "wall") ), val = c(water, wall) ) } ) ) Disadvantages of R6 Overall R6 is a great package with one or two downsides but these are mostly not too worrying. For one thing, R6 is a separate package that users have to load or use as an import but given that it is such a lightweight package and doesn’t have any imports of its own, it’s nothing to be concerned about. Also, R6 doesn’t have any strict type checking but this is remedied by including your own type checking in your methods. The main bug bear I have with R6 is the lack of support from Roxygen2 out of the box and this has been an open issue for a while. // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() {$('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); }); (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### #8: Customizing Spell Checks for R CMD check Fri, 08/11/2017 - 04:29 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) Welcome to the eight post in the ramblingly random R rants series, or R4 for short. We took a short break over the last few weeks due to some conferencing followed by some vacationing and general chill. But we’re back now, and this post gets us back to initial spirit of (hopefully) quick and useful posts. Perusing yesterday’s batch of CRANberries posts, I noticed a peculiar new directory shown the in the diffstat output we use to compare two subsequent source tarballs. It was entitled .aspell/, in the top-level directory, and in two new packages by R Core member Kurt Hornik himself. The context is, of course, the not infrequently-expressed desire to customize the spell checking done on CRAN incoming packages, see e.g. this r-package-devel thread. And now we can as I verified with (the upcoming next release of) RcppArmadillo, along with a recent-enough (i.e. last few days) version of r-devel. Just copying what Kurt did, i.e. adding a file .aspell/defaults.R, and in it pointing to rds file (named as the package) containing a character vector with words added to the spell checker’s universe is all it takes. For my package, see here for the peculiars. Or see here: edd@bud:~/git/rcpparmadillo/.aspell(master)$ cat defaults.R Rd_files <- vignettes <- R_files <- description <- list(encoding = "UTF-8", language = "en", dictionaries = c("en_stats", "RcppArmadillo")) edd@bud:~/git/rcpparmadillo/.aspell(master)$r -p -e 'readRDS("RcppArmadillo.rds")' [1] "MPL" "Sanderson" "Templated" [4] "decompositions" "onwards" "templated" edd@bud:~/git/rcpparmadillo/.aspell(master)$

And now R(-devel) CMD check --as-cran ... is silent about spelling. Yay!

But take this with a grain of salt as this does not yet seem to be "announced" as e.g. yesterday’s change in the CRAN Policy did not mention it. So things may well change — but hey, it worked for me.

And this all is about aspell, here is something topical about a spell to close the post:

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RStudio v1.1 Preview: Terminal

Fri, 08/11/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Today we’re excited to announce availability of our first Preview Release for RStudio 1.1, a major new release which includes the following new features:

• A Connections tab which makes it easy to connect to, explore, and view data in a variety of databases.
• A Terminal tab which provides fluid shell integration with the IDE, xterm emulation, and even support for full-screen terminal applications.
• An Object Explorer which can navigate deeply nested R data structures and objects.
• A new, modern dark theme and Retina-quality icons throughout.
• Improvements to the RStudio API which add power and flexibility to RStudio add-ins and packages.
• RStudio Server Pro support for floating licensing, notifications, self-service session management, a library of professional ODBC drivers, and more.
• Dozens of other small improvements and bugfixes.

You can try out these new features in the RStudio Preview Release.

Terminal

Over the next few weeks we’ll be blogging about each of these new features. We start today with an overview of the integrated support for full-featured system terminals via the Terminal tab.

The Terminal tab provides access to the system shell within the RStudio IDE. Potential uses include advanced source-control operations, execution of long-running jobs, remote logins, and interactive full-screen terminal applications (e.g. text editors, terminal multiplexers).

Opening Terminals

The Terminal tab is next to the Console tab. Switch to the Terminal tab to automatically create a new terminal, ready to accept commands. If the tab isn’t visible, show it via Shift+Alt+T or the Tools -> Terminal -> New Terminal menu. Here’s a terminal with the output of some simple commands:

Support for xterm enables use of full-screen programs:

Additional terminal sessions can be started using the New Terminal command on the terminal drop-down menu, or via Shift+Alt+T.

Each terminal session is independent, with its own system shell and buffer. Switch between them using the arrows next to the drop-down menu or by clicking on the terminal’s name in that menu.

Programs running in a terminal do not block the rest of the RStudio user-interface, so you can continue working in RStudio even when the terminal is busy. On Mac, Linux, or Server, a busy terminal will have (busy) next to its name, and the close [x] changes to a stop button:

If there is a busy terminal (Mac, Linux, or Server) trying to exit RStudio (or any other operation that will stop the current R session) will give a warning. Proceeding will kill the running programs.

Run in Terminal

When editing a shell script in RStudio, the Run Selected Line(s) command (Cmd+Enter on Mac / Ctrl+Enter on others) executes the current line, or selection, in the current terminal. This can be used to step through a shell script line-by-line and observe the results in the terminal.

Here’s an example where Cmd+Enter was hit three times, with focus on the editor and the cursor starting on the first line.

In other text file types, including R source files, the new Send to Terminal command (Cmd+Alt+Enter on Mac, Ctrl+Alt+Enter on others) may be invoked to send the current selection to the current terminal. This can be handy for other languages with a command-line interpreter. Below, Python was started in the current terminal, then Cmd+Alt+Enter was used to step through each line of the Python source file.

Closing Terminals

To close a terminal session, use the Close Terminal command on the Terminal dropdown menu, click the [x] on the far-right of the Terminal pane toolbar, or exit from within the shell itself.

If the Terminal tab is not useful to your workflows, simply click the [x] on the tab itself to close it, and it will not reappear in future RStudio sessions. To restore the tab, start a terminal via the Tools/Terminal/New Terminal menu command.

Terminal Options

Various aspects of the terminal can be configured with the new Terminal Options pane. Invoke with Tools/Global Options… and click on the Terminal icon.

Windows-Specific Shell Options

On the RStudio IDE for Microsoft Windows, you can select between Git-Bash, Command Prompt, Windows PowerShell, or the Windows Subsystem for Linux. The choices available depend on what is installed on the system.

We look forward to seeing how people use the Terminal tab in RStudio 1.1. If you want to give it a test drive, please download the RStudio Preview Release.

We hope you try out the preview and let us know how we can make it better.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New R Course: Spatial Statistics in R

Thu, 08/10/2017 - 21:00

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hey R users! Here’s another course launched this week: Spatial Statistics in R by Barry Rowlingson.

Everything happens somewhere, and increasingly the place where all these things happen is being recorded in a database. There is some truth behind the oft-repeated statement that 80% of data have a spatial component. So what can we do with this spatial data? Spatial statistics, of course! Location is an important explanatory variable in so many things – be it a disease outbreak, an animal’s choice of habitat, a traffic collision, or a vein of gold in the mountains – that we would be wise to include it whenever possible. This course will start you on your journey of spatial data analysis. You’ll learn what classes of statistical problems present themselves with spatial data, and the basic techniques of how to deal with them. You’ll see how to look at a mess of dots on a map and bring out meaningful insights.

Take me to chapter 1!

Spatial Statistics in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in spatial statistics!

What you’ll learn:

Chapter 1: Introduction

After a quick review of spatial statistics as a whole, you’ll go through some point-pattern analysis. You’ll learn how to recognize and test different types of spatial patterns.

Chapter 2: Point Pattern Analysis

Point Pattern Analysis answers questions about why things appear where they do. The things could be trees, disease cases, crimes, lightning strikes – anything with a point location.

Chapter 3: Areal Statistics

So much data is collected in administrative divisions that there are specialized techniques for analyzing them. This chapter presents several methods for exploring data in areas.

Chapter 4: Geostatistics

Originally developed for the mining industry, geostatistics covers the analysis of location-based measurement data. It enables model-based interpolation of measurements with uncertainty estimation.

Learn all there is to know about Spatial Statistics in R today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Trapezoidal Rule of Numerical Integration in R

Thu, 08/10/2017 - 20:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part of 8 in the series Numerical Analysis

The Trapezoidal Rule is another of Closed Newton-Cotes formulas for approximating the definite integral of a function. The trapezoidal rule is so named due to the area approximated under the integral representing a trapezoid. Although there exist much more accurate quadrature methods, the trapezoidal rule converges rather fast comparatively and is known to be extremely accurate when approximating the definite integral of periodic functions.

The Trapezoidal Rule

The trapezoidal rule is a 2-point Closed Newton-Cotes formula that is based somewhat on the midpoint rule, in which the interval is divided into subintervals of equal width:

Summed over the points, the approximation becomes:

Which is the definition of the Trapezoidal Rule for a uniformly-spaced grid of points. The error term is , which indicates the error can be computed exactly if (the function is twice differentiable on the interval ). The Trapezoidal Rule with the error term applied becomes:

Trapezoidal Rule Example

We will use the Trapezoidal Rule to approximate the following definite integral:

The following image depicts how the trapezoidal rule approximates the integral of the function in the interval. The darker area represents the actual area under the function. The trapezoid rule approximates the area under the function by constructing the trapezoid and calculating its area.

# Replicate the function in R f <- function(x) { return(x * sin(x)) } # Create a data frame to store all the points needed for the approximation trapezoid df <- data.frame(cbind(c(0, pi/4, pi/4, 0), c(0, f(pi/4), 0, 0))) # Plot the function and its approximation by the trapezoidal rule ggplot(data = data.frame(x = 0), mapping = aes(x = x)) + stat_function(fun = f, size = 1.05, alpha = 0.75, color='blue') + geom_segment(aes(x = 0, y = 0, xend = pi/4, yend = f(pi/4))) + geom_segment(aes(x = pi/4, y = 0, xend = pi/4, yend = f(pi/4))) + geom_polygon(data = df, aes(x=X1, y=X2), fill = 'blue', alpha = 0.2) + geom_area(stat = 'function', fun = f, fill = 'black', alpha = 0.3, xlim = c(0, pi/4)) + xlim(c(-0.5,1))

Approximating this integral with the Trapezoidal Rule:

(pi/4)/2 * ((0 * sin(0)) + (pi/4 * sin(pi/4))) ## [1] 0.2180895

Integrating the function over the interval yields , which gives an error of 0.0663395.

For a second example, apply the Trapezoidal Rule to approximate the definite integral:

Visualizing how the trapezoidal rule approximates the integral:

f2 <- function(x) { return(x^2 * exp(-x)) } df <- data.frame(cbind(c(0, 1, 1, 0), c(0, f2(1), 0, 0))) ggplot(data = data.frame(x = 0), mapping = aes(x = x)) + stat_function(fun = f2, size = 1.05, alpha = 0.75, color='blue') + geom_segment(aes(x = 0, y = 0, xend = 1, yend = f2(1))) + geom_segment(aes(x = 1, y = 0, xend = 1, yend = f2(1))) + geom_polygon(data = df, aes(x=X1, y=X2), fill = 'blue', alpha = 0.2) + geom_area(stat = 'function', fun = f2, fill = 'black', alpha = 0.3, xlim = c(0, 1)) + xlim(c(0,1))

0.5 * ((0 * 0^2 * exp(0)) + (1^2 * exp(-1))) ## [1] 0.1839397 2 - 5/exp(1) ## [1] 0.1606028

The function integrated over the interval is . Therefore the error is 0.0233369.

Trapezoidal Rule in R

The Trapezoidal Rule is comparatively straightforward to implement in R.

trapezoid <- function(f, a, b) { if (is.function(f) == FALSE) { stop('f must be a function with one parameter (variable)') } h <- b - a fxdx <- (h / 2) * (f(a) + f(b)) return(fxdx) }

Test our implementation using the functions in the examples above.

trapezoid(f, 0, pi/4) ## [1] 0.2180895 trapezoid(f2, 0, 1) ## [1] 0.1839397

We see the function reports the same values as our manual calculations. Though the Trapezoidal Rule is straightforward to implement and is rather accurate in its approximations, it begins to fail (as do the other Newton-Cotes formulas) when applied over large integration intervals. A quick example using the function integrated over the interval :

trapezoid(f2, 0, 10) ## [1] 0.02269996 # The function integrated over the interval 2 - (122/exp(10)) ## [1] 1.994461

The error becomes much higher as the integration interval increases. Composite methods, which implement a piecewise approach to numerical integration, are therefore generally preferred and used in actual practice.

Composite Trapezoidal Rule

The Composite Trapezoidal Rule, similar to other composite methods, divides the integral interval into subintervals. The Trapezoidal Rule is then performed on each of those subintervals. As before, we let the function be twice differentiable in the interval , or more formally, . We also let and for each . The Composite Trapezoidal Rule, with its error term, is defined as:

Where . As the Trapezoidal Rule only requires one interval in each iteration of the subintervals, can be either even or odd.

Let’s say we are interested in approximating the following integral using the Composite Trapezoidal Rule with subintervals:

Replicate the function in R.

f3 <- function(x) { return(exp(2 * x) * sin(3 * x)) }

Visualize the composite trapezoid rule and how it divides the interval into subintervals.

# Break the interval into n subintervals seg <- seq.int(0, 2, length.out = 9) # Initialize an empty vector to store function outputs fx <- vector(length = length(seg)) # For each subinterval, calculate the function for (i in 1:length(seg)) { fx[i] <- f3(seg[i]) } # Collect the needed points into a data.frame df <- data.frame(xend = seg, y = rep(0, 9), yend = fx, yend1 = c(fx[2:9], fx[9]), xend1 = c(seg[2:9], seg[9])) # plot the function and its approximation ggplot(data = df) + stat_function(fun = f3, size = 1.05, alpha = 0.75, color='blue') + geom_segment(aes(x=xend, y=y, xend=xend, yend=yend)) + geom_segment(aes(x=xend, y=yend, xend=xend1, yend=yend1)) + geom_ribbon(aes(x=xend, ymin=y, ymax=yend), fill = 'blue', alpha = 0.3) + geom_area(stat = 'function', fun = f3, fill = 'black', alpha = 0.3, xlim = c(0, 2)) + xlim(c(-0.5, 2))

The following function is an implementation of the composite trapezoidal rule.

composite.trapezoid <- function(f, a, b, n) { if (is.function(f) == FALSE) { stop('f must be a function with one parameter (variable)') } h <- (b - a) / n j <- 1:n - 1 xj <- a + j * h approx <- (h / 2) * (f(a) + 2 * sum(f(xj)) + f(b)) return(approx) }

Use the function to perform the composite trapezoid procedure.

composite.trapezoid(f3, 0, 2, 8) ## [1] -13.57598

Integrating the function over the interval gives the actual value:

The error is therefore .

Though the error is relatively large in this case, the composite trapezoid rule was more accurate than the trapezoid rule.

trapezoid(f3, 0, 2) ## [1] -15.25557

The error is . As the wideness of the interval increases, the less accurate the trapezoid rule becomes. Since this example considered the interval , the difference in accuracy between the composite and standard trapezoidal rules is not as drastic as it would be with a wider interval.

References

Burden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.

Weisstein, Eric W. “Trapezoidal Rule.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/TrapezoidalRule.html

The post The Trapezoidal Rule of Numerical Integration in R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to create interactive data visualizations with ggvis

Thu, 08/10/2017 - 18:36

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.

This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.

PACKAGE INSTALLATION & DATA FRAME

The first thing you have to do is install and load the ggvis package with:
install.packages("ggvis")
library(ggvis)

Moreover we need a data set to work with. Tha dataset we chose in our case is “Cars93” which contains data from 93 Cars on Sale in the USA in 1993 and we can find it in the MASS package which of course must be installed and called too. To install and call those packages and attach the “Cars93” dataset use:
install.packages("MASS")
library(MASS)
data("Cars93")
attach(Cars93)

You can use head(Cars93) in order to see the variables of your dataset.

Furthermore you need to install and call the shiny and the magrittr package with:
install.packages("shiny")
library(shiny)
install.packages("magrittr")
library(magrittr)

NOTE: Because of the fact that alla ggvis graphics are web graphics,if you’re not using RStudio (which provides a built-in browser), you’ll notice that this plot opens in your web browser.

The ggvis() function

The first thing we have to do is call ggvis(). The first argument is the data set that we want to plot, and the second describes which variables we will use. Look at the example below.
plot1 <- ggvis(Cars93, x = ~Length, y = ~Wheelbase)

This doesn’t plot anything because you haven’t set how to display your data. The example below creates a scatterplot:
layer_points(plot1)

There is an alternative and maybe more practical way to produce the same result with the %>% function of the magrittr package like the example below:
Cars93 %>%
ggvis(x = ~Length, y = ~Wheelbase) %>%
layer_points()

We use ~ before the variable name to indicate that we don’t want to literally use the value of the variable, but instead we want we want to use the variable inside in the dataset. We will use it from now on. This is how we can drop x and y.

You can add more variables to the plot by mapping them to other visual properties like fill, stroke, size and shape. Look at the examples below.
Cars93 %>% ggvis(~Length, ~Wheelbase, stroke = ~EngineSize) %>% layer_points() #stroke
Cars93 %>% ggvis(~Length, ~Wheelbase, fill = ~EngineSize) %>% layer_points() #fill
Cars93 %>% ggvis(~Length, ~Wheelbase, size = ~EngineSize) %>% layer_points() #size
Cars93 %>% ggvis(~Length, ~Wheelbase, shape = ~factor(Passengers)) %>% layer_points()#shape

If you want to make the points a fixed colour, size or shape, you need to use := instead of =. Look at the examples below.
Cars93 %>% ggvis(~Length, ~Wheelbase, fill := "yellow", stroke := "black") %>% layer_points() #fill
Cars93 %>% ggvis(~Length, ~Wheelbase, size := 350, opacity := 0.5) %>% layer_points() #size
Cars93 %>% ggvis(~Length, ~Wheelbase, shape := "cross") %>% layer_points() #shape

Learn more about using ggvis in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

• Work extensively with the ggvis package and its functionality
• Learn what visualizations exist for your specific use case
• And much more

Interaction

You can map visual properties to variables or set them to specific values, but it is far more interesting to connect them to interactive controls. Look at the two sliders’ example below
Cars93 %>%
ggvis(~Length, ~Wheelbase,
size := input_slider(1, 100),
opacity := input_slider(0, 1)
) %>%
layer_points()

You can also add interactivity to other plot parameters like the width and centers of histogram bins like the example below:
Cars93%>%
ggvis(~Length) %>%
layer_histograms(width = input_slider(0, 2, step = 0.10, label = "width"),
center = input_slider(0, 2, step = 0.05, label = "center"))

Except of input_slider(), ggvis also provides input_checkbox(), input_checkboxgroup(), input_numeric(), input_radiobuttons(), input_select() and input_text().

You can also use keyboard controls with left_right() and up_down(). The following example shows how to control the size of the points by pressing the left and right keyboard controls.
arrow % ggvis(~Length, ~Wheelbase, size := arrow, opacity := 0.5) %>% layer_points()

With tooltips you can add more complex interactivity:
Cars93 %>% ggvis(~Length, ~Wheelbase) %>%
layer_points() %>%
add_tooltip(function(df) df$Length) Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### dplyrXdf 0.10.0 beta prerelease Thu, 08/10/2017 - 18:30 (This article was first published on Revolutions, and kindly contributed to R-bloggers) I’m happy to announce that version 0.10.0 beta of the dplyrXdf package is now available. You can get it from Github: install_github("RevolutionAnalytics/dplyrXdf", build_vignettes=FALSE) This is a major update to dplyrXdf that adds the following features: • Support for the tidyeval framework that powers the latest version of dplyr • Works with Spark and Hadoop clusters and files in HDFS • Several utility functions to ease working with files and datasets • Many bugfixes and workarounds for issues with the underlying RevoScaleR functions This (pre-)release of dplyrXdf requires Microsoft R Server or Client version 8.0 or higher, and dplyr 0.7 or higher. If you’re using R Server, dplyr 0.7 won’t be in the MRAN snapshot that is your default repo, but you can get it from CRAN: install.packages("dplyr", repos="https://cloud.r-project.org") The tidyeval framework This completely changes the way in which dplyr handles standard evaluation. Previously, if you wanted to program with dplyr pipelines, you had to use special versions of the verbs ending with "_": mutate_, select_, and so on. You then provided inputs to these verbs via formulas or strings, in a way that was almost but not quite entirely unlike normal dplyr usage. For example, if you wanted to programmatically carry out a transformation on a given column in a data frame, you did the following: x <- "mpg" transmute_(mtcars, .dots=list(mpg2=paste0("2 * ", x))) # mpg2 #1 42.0 #2 42.0 #3 45.6 #4 42.8 #5 37.4 This is prone to errors, since it requires creating a string and then parsing it. Worse, it's also insecure, as you can't always guarantee that the input string won't be malicious. The tidyeval framework replaces all of that. In dplyr 0.7, you call the same functions for both interactive use and programming. The equivalent of the above in the new framework would be: # the rlang package implements the tidyeval framework used by dplyr library(rlang) x_sym <- sym(x) transmute(mtcars, mpg2=2 * (!!x_sym)) # mpg2 #1 42.0 #2 42.0 #3 45.6 #4 42.8 #5 37.4 Here, the !! symbol is a special operator that means to get the column name from the variable to its right. The verbs in dplyr 0.7 understand the special rules for working with quoted symbols introduced in the new framework. The same code also works in dplyrXdf 0.10: # use the new as_xdf function to import to an Xdf file mtx <- as_xdf(mtcars) transmute(mtx, mpg2=2 * (!!x_sym)) %>% as.data.frame # mpg2 #1 42.0 #2 42.0 #3 45.6 #4 42.8 #5 37.4 For more information about tidyeval, see the dplyr vignettes on programming and compatibility. New features in dplyrXdf Copy, move and delete Xdf files The following functions let you manipulate Xdf files as files: • copy_xdf and move_xdf copy and move an Xdf file, optionally renaming it as well. • rename_xdf does a strict rename, ie without changing the file’s location. • delete_xdf deletes the Xdf file. HDFS file transfers The following functions let you transfer files and datasets to and from HDFS, for working with a Spark or Hadoop cluster: • copy_to uploads a dataset (a data frame or data source object) from the native filesystem to HDFS, saving it as an Xdf file. • collect and compute do the reverse, downloading an Xdf file from HDFS. • hdfs_upload and hdfs_download transfer arbitrary files and directories to and from HDFS. Uploading and downloading works (or should work) both from the edge node and from a remote client. The interface is the same in both cases: no need to remember when to use rxHadoopCopyFromLocal and rxHadoopCopyFromClient. The hdfs_* functions mostly wrap the rxHadoop* functions, but also add extra functionality in some cases (eg vectorised copy/move, test for directory existence, etc). HDFS file management The following functions are for file management in HDFS, and mirror similar functions in base R for working with the native filesystem: • hdfs_dir lists files in a HDFS directory, like dir() for the native filesystem. • hdfs_dir_exists and hdfs_file_exists test for existence of a directory or file, like dir.exists() and file.exists(). • hdfs_file_copy, hdfs_file_move and hdfs_file_remove copy, move and delete files in a vectorised fashion, like file.copy(), file.rename() and unlink(). • hdfs_dir_create and hdfs_dir_remove make and delete directories, like dir.create() and unlink(recursive=TRUE). • in_hdfs returns whether a data source is in HDFS or not. As far as possible, the functions avoid reading the data via rxDataStep and so should be more efficient. The only times when rxDataStep is necessary are when importing from a non-Xdf data source, and converting between standard and composite Xdfs. Miscellaneous functions • as_xdf imports a dataset or data source into an Xdf file, optionally as composite. as_standard_xdf and as_composite_xdf are shortcuts for creating standard and composite Xdfs respectively. • is_xdf and is_composite_xdf return whether a data source is a (composite) Xdf. • local_exec runs an expression in the local compute context: useful for when you want to work with local Xdf files while connected to a remote cluster. Wrapup For more information, check out the package vignettes in particular "Using the dplyrXdf package" and "Working with HDFS". dplyrXdf 0.10 is tentatively scheduled for a final release at the same time as the next version of Microsoft R Server, or shortly afterwards. In the meantime, please download this and give it a try; if you run into any bugs, or if you have any feedback, you can email me or log an issue at the Github repo. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### New Course – Supervised Learning in R: Regression Thu, 08/10/2017 - 15:38 (This article was first published on DataCamp Blog, and kindly contributed to R-bloggers) Hello R users, new course hot off the press today by Nina Zumel – Supervised Learning in R: Regression! From a machine learning perspective, regression is the task of predicting numerical outcomes from various inputs. In this course, you’ll learn about different regression models, how to train these models in R, how to evaluate the models you train and use them to make predictions. Take me to chapter 1! Supervised Learning in R: Regression features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master in supervised learning with R! What you’ll learn: Chapter 1: What is Regression? In this chapter you are introduced to the concept of regression from a machine learning point of view. We present the fundamental regression method: linear regression. You will learn how to fit a linear regression model and to make predictions from that model. Chapter 2: Training and Evaluating Regression Models You will now learn how to evaluate how well your models perform. You will review how to evaluate a model graphically, and look at two basic metrics for regression models. You will also learn how to train a model that will perform well in the wild, not just on training data. Although we will demonstrate these techniques using linear regression, all these concepts apply to models fit with any regression algorithm. Chapter 3: Issues to Consider Before moving on to more sophisticated regression techniques, you will look at some other modeling issues: modeling with categorical inputs, interactions between variables, and when you might consider transforming inputs and outputs before modeling. While more sophisticated regression techniques manage some of these issues automatically, it’s important to be aware of them, in order to understand which methods best handle various issues — and which issues you must still manage yourself. Chapter 4: Dealing with Non-Linear Responses Now that you have mastered linear models, you will begin to look at techniques for modeling situations that don’t meet the assumptions of linearity. This includes predicting probabilities and frequencies (values bounded between 0 and 1); predicting counts (nonnegative integer values, and associated rates); and responses that have a nonlinear but additive relationship to the inputs. These algorithms are variations on the standard linear model. Chapter 5: Tree-Based Methods In this chapter, you will look at modeling algorithms that do not assume linearity or additivity, and that can learn limited types of interactions among input variables. These algorithms are *tree-based* methods that work by combining ensembles of *decision trees* that are learned from the training data. Get started with Supervised Learning in R: Regression today! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Data Science for Water Utilities Using R Thu, 08/10/2017 - 05:28 (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers) Data Science for Water Utilities will be a new book that explains how to use R to undertake analytics for problems typical to water utilities. Continue reading → The post Data Science for Water Utilities Using R appeared first on The Devil is in the Data. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Extend the tidyverse workshop Thu, 08/10/2017 - 02:00 (This article was first published on RStudio Blog, and kindly contributed to R-bloggers) Have you embraced the tidyverse? Do you now want to expand it to meet your needs? Then this is a NEW two-day hands on workshop designed for you! The goal of this workshop is to take you from someone who uses tidyverse functions to someone who can extend the tidyverse by: • Writing expressive code using advanced functional programming techniques • Designs consistent APIs using analogies to existing tools • Uses the S3 object system to make user friendly values • Can bundle functions with documentation and tests into a package to share with others. The class is taught by Hadley Wickham, Chief Scientist at RStudio, a member of the R Foundation, and Adjunct Professor at Stanford University and the University of Auckland. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. Much of the material for the course is drawn from two of his existing books, Advanced R and R Packages, but the course also includes a lot of new material that will eventually become a book called “Tidy tools”. As of today, there are just 30+ seats left. Discounts are still available for academics (students or faculty) and for 5 or more attendees from any organization. Email training@rstudio.com if you have any questions about the workshop that you don’t find answered on the registration page. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### sequence riddle Thu, 08/10/2017 - 00:17 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) The riddle this week on The Riddler was about finding the largest sequence of integers between 1 and 100 such that each integer is only used once and always followed by a multiple or a factor. A basic R code searching at random [and programmed during a massive downpour on Skye] led to a solution of 69: although there is no certainty this is the best p… And the solutions posted the next week showed sequences with length 77! [Interestingly, both posted solutions have a sequence starting with 87. And they seem to exploit the graph of connections between integers in a much more subtle way that my random exploration of subsequences.] Filed under: Kids, R Tagged: division, mathematical puzzle, rain, Scotland, Skye, The Riddler var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Boston EARL Keynote speaker announcement: Mara Averick Wed, 08/09/2017 - 15:37 (This article was first published on Blog, and kindly contributed to R-bloggers) Mango Solutions are excited to announce that Mara Averick, who many know as @dataandme on Twitter, will be joining us at Boston EARL on 1-3 November as one of our keynote speakers! Mara is a polymath and self-confessed data nerd. With a strong background in research, she has a breadth of experience in data analysis, visualization, and applications thereof. Currently, by day, she’s a Consultant at TCB Analytics. By night, you’ll find her sharing dope R related stuff on Twitter and translating heavily technical subject matter into easy reading for a non-technical audience. When she’s not talking data, she’s diving into NBA stats, exploring weird and wonderful words, and/or indulging in her obsession with all things Archer. We’re big fans of Mara here in the Mango office – we love her passion for data and R, and her ability to make them easily accessible. We can’t wait to see what she has to share with us in November. Want to join Mara at EARL Boston? Speak Abstract submissions are now open for EARL Boston. Share your R adventures and innovations with fellow R users. All accepted speakers receive a 1-day Conference pass and a ticket to the evening networking reception. Buy a ticket Early bird tickets are now available! Save more than$100 on a Full Conference pass.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### NYC Airbnb Insights

Wed, 08/09/2017 - 07:30

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Motivation

“Hey, Yu-Han. I am going to Time Square to celebrate New Year this December. Do you have any recommendation about hotels or short-term rentals?” Once I heard my friend’s question, I intuitively replied, “Why don’t you check out Airbnb?” Then, helping my friend to find a great place to stay over a week triggers my interests in exploring Airbnb’s listings in New York City. Is there a borough where locates more budget-friendly and highly rated rooms? When is the best time to place your order in high season?

Is there a borough where locates more budget-friendly and highly rated rooms? When is the best time to place your order in high season? Feel free to play around my shiny app to check out some cool functions. Also, all the codes are available on my GitHub.

The Datasets

To acquire the most up-to-date data, I used Airbnb New York City’s datasets from “Inside Airbnb”, which is an independent website offering non-commercial sets of data. For the interactive map, I applied the full 2017 data that includes over 40,500 listings, composed of entire houses, private rooms, and shared rooms. As to analyzing the listings and price changes over time, I collected data from 2015 to 2017, which contains more than 45 million observations.

Research Questions

What is the overall location distribution of Airbnb NYC?

Which borough has a better price performance ratio? (i.e. being able to enjoy great experience even with limited budgets)

What are Airbnb NYC’s price changes over years and months?

Functions of my Shiny App

The Overview Of The Dataset

There are over 340,000 hosts and 40,ooo listings in the May 2017 NYC dataset. The overall market value is estimated based on the multiplication of minimum nights and 12% of listing price. Since Airbnb receives 6 to 12% booking fee from guests and charges 3% commission fee from hosts for every successful transaction. Therefore, the total average percentage of profit that Airbnb earns is 12% of one complete transaction.

NYC Interactive Map

To have a quick glance at the overall location distribution in NYC, we can use the “CLUSTER” function to group up nearby listings. In this interactive map, users are able to filter boroughs, room types, price range, rating score, and the number of reviews. Meanwhile, there are bar charts illustrating the count of each room type and average price based on users’ requirements on the left-hand side. After checking the big picture, this shiny app allows users to zoom in the map and take a deeper look at every listing by selecting “CIRCLE” button and clicking on specific dots.

Listing, Boroughs, and Price Changes

As to exploring detailed information in each borough, users can slide the price and rating score bar to check which borough has more listings meeting your expectations.

Last but not least, finding a great time to book your room is another factor that travelers may consider. By choosing the price change over “month”, you can see that the average price drops a little bit in July.

Fun Facts

1. More than 80% of Airbnb properties in New York City are outside of mid-Manhattan area especially in East Village and Nolita.
2. Brooklyn is a great neighborhood regards to its cheaper cost per night and higher rating score. Therefore, if you are out of time and want to narrow down your options, Brooklyn can be put into your consideration.
3. According to the previous booking price (from May 2015) and future listing price (to May 2018), we know that the overall price for Airbnb NYC drops. There are some reasons why. First of all, New York State Senate passed new laws against short-term rental in 2016, which was a huge blow for Airbnb. On top of this regulation issue, the growth of competition can be another explanation for the decrease of price. Thirdly, there are more private rooms and shared rooms posted on Airbnb, which are much cheaper than entire apartments. The increase of private rooms and shared rooms leads to the price drop.
4. There are price peaks in May, June, August, and December. However, during summer time, there is a dip in July.

Summary: Suggestions for Airbnb, Hosts, and Guests

Airbnb

New York City is the biggest market for Airbnb. According to 2017 data, the estimation of Airbnb NYC’s market value is over 27 million. Merely in New York City, Airbnb’s profitability is about 3.24 million. For expanding the market and retaining users, Airbnb not only has to deal with regulatory problems but also needs to keep improving user experiences.

Hosts

For hosts who post up their listings in competitive areas, it is always smart to keep looking up others’ price. Or, offering some extra service such as offering breakfast, renting bikes and so forth would also attract more guests.

Guests

During summertime, July is a great time to place an order. Besides, there are more highly-rated yet budget-friendly listings in Brooklyn.

The post NYC Airbnb Insights appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Should we be concerned about incidence – prevalence bias?

Wed, 08/09/2017 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results – patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see this for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes. We might confuse an artificial effect as an intervention effect.

What could explain this phenomenon? The title of this post provides a hint: cases earlier in a study are more likely to be prevalent ones (i.e. they have been sick for a while), whereas later in the study cases tend to be incident (i.e. they only recently become sick). Even though both prevalent and incident cases are sick, the former may be sicker on average than the latter, simply because their condition has had more time develop.

We didn’t have any data to test out this hypothesis (if our grant proposal is funded, we will be able to do that), so I decided to see if I could simulate this phenomenon. In my continuing series exploring simulation using Rcpp, simstudy, and data.table, I am presenting some code that I used to do this.

Generating a population of patients

The first task is to generate a population of individuals, each of whom starts out healthy and potentially becomes sicker over time. Time starts in month 1 and ends at some fixed point – in the first example, I end at 400 months. Each individual has a starting health status and a start month. In the examples that follow, health status is 1 through 4, with 1 being healthy, 3 is quite sick, and 4 is death. And, you can think of the start month as the point where the individual ages into the study. (For example, if the study includes only people 65 and over, the start month is the month the individual turns 65.) If an individual starts in month 300, she will have no measurements in periods 1 through 299 (i.e. health status will be 0).

The first part of the simulation generates a start month and starting health status for each individual, and then generates a health status for each individual until the end of time. Some individuals may die, while others may go all the way to the end of the simulation in a healthy state.

Rcpp function to generate health status for each period

While it is generally preferable to avoid loops in R, sometimes it cannot be avoided. I believe generating a health status that depends on the previous health status (a Markov process) is one of those situations. So, I have written an Rcpp function to do this – it is orders of magnitude faster than doing this in R:

#include // [[Rcpp::depends(RcppArmadillo)]] using namespace Rcpp; // [[Rcpp::export]] IntegerVector MCsim( unsigned int nMonths, NumericMatrix P, int startStatus, unsigned int startMonth ) { IntegerVector sim( nMonths ); IntegerVector healthStats( P.ncol() ); NumericVector currentP; IntegerVector newstate; unsigned int q = P.ncol(); healthStats = Rcpp::seq(1, q); sim[startMonth - 1] = startStatus; /* Loop through each month for each individual */ for (unsigned int i = startMonth; i < nMonths; i++) { /* new state based on health status of last period and probability of transitioning to different state */ newstate = RcppArmadillo::sample( healthStats, 1, TRUE, P.row(sim(i-1) - 1) ); sim(i) = newstate(0); } return sim; }

Generating the data

The data generation process is shown below. The general outline of the process is (1) define transition probabilities, (2) define starting health status distribution, (3) generate starting health statuses and start months, and (4) generate health statuses for each follow-up month.

# Transition matrix for moving through health statuses P <- matrix(c(0.985, 0.015, 0.000, 0.000, 0.000, 0.950, 0.050, 0.000, 0.000, 0.000, 0.850, 0.150, 0.000, 0.000, 0.000, 1.000), nrow = 4, byrow = TRUE) maxFU = 400 nPerMonth = 350 N = maxFU * nPerMonth ddef <- defData(varname = "sHealth", formula = "0.80; 0.15; 0.05", dist = "categorical") # generate starting health values (1, 2, or 3) for all individuals set.seed(123) did <- genData(n = N, dtDefs = ddef) # each month, 350 age in to the sample did[, sMonth := rep(1:maxFU, each = nPerMonth)] # show table for 10 randomly selected individuals did[id %in% sample(x = did$id, size = 10, replace = FALSE)] ## id sHealth sMonth ## 1: 15343 2 44 ## 2: 19422 2 56 ## 3: 41426 1 119 ## 4: 50050 1 143 ## 5: 63042 1 181 ## 6: 83584 1 239 ## 7: 93295 1 267 ## 8: 110034 1 315 ## 9: 112164 3 321 ## 10: 123223 1 353 # generate the health status history based on the transition matrix dhealth <- did[, .(sHealth, sMonth, health = MCsim(maxFU, P, sHealth, sMonth)), keyby = id] dhealth[, month := c(1:.N), by = id] dhealth ## id sHealth sMonth health month ## 1: 1 1 1 1 1 ## 2: 1 1 1 1 2 ## 3: 1 1 1 1 3 ## 4: 1 1 1 1 4 ## 5: 1 1 1 1 5 ## --- ## 55999996: 140000 1 400 0 396 ## 55999997: 140000 1 400 0 397 ## 55999998: 140000 1 400 0 398 ## 55999999: 140000 1 400 0 399 ## 56000000: 140000 1 400 1 400 Simulation needs burn-in period The simulation process itself is biased in its early phases as there are too many individuals in the sample who have just aged in compared to those who are “older”. (This is sort of the the reverse of the incidence – prevalence bias.) Since individuals tend to have better health status when they are “younger”, the average health status of the simulation in its early phases is biased downwards by the preponderance of young individuals in the population. This suggests that any evaluation of simulated data needs to account for a “burn-in” period that ensures there is a mix of “younger” and “older” individuals. To show this, I have calculated an average health score for each period of the simulation and plotted the results. You can see that the sample stabilizes after about 200 months in this simulation. # count number of individuals with a particular heath statust each month cmonth <- dhealth[month > 0, .N, keyby = .(month, health)] cmonth ## month health N ## 1: 1 0 139650 ## 2: 1 1 286 ## 3: 1 2 47 ## 4: 1 3 17 ## 5: 2 0 139300 ## --- ## 1994: 399 4 112203 ## 1995: 400 1 18610 ## 1996: 400 2 6515 ## 1997: 400 3 2309 ## 1998: 400 4 112566 # transform data from "long" form to "wide" form and calculate average mtotal <- dcast(data = cmonth, formula = month ~ health, fill = 0, value.var = "N") mtotal[, total := 1 + 2 + 3] mtotal[, wavg := (1 + 2*2 + 3*3)/total] mtotal ## month 0 1 2 3 4 total wavg ## 1: 1 139650 286 47 17 0 350 1.231429 ## 2: 2 139300 558 106 32 4 696 1.244253 ## 3: 3 138950 829 168 45 8 1042 1.247601 ## 4: 4 138600 1104 215 66 15 1385 1.250542 ## 5: 5 138250 1362 278 87 23 1727 1.261726 ## --- ## 396: 396 1400 18616 6499 2351 111134 27466 1.407813 ## 397: 397 1050 18613 6537 2321 111479 27471 1.406938 ## 398: 398 700 18587 6561 2323 111829 27471 1.407957 ## 399: 399 350 18602 6541 2304 112203 27447 1.406201 ## 400: 400 0 18610 6515 2309 112566 27434 1.405810 ggplot(data = mtotal, aes(x=month, y=wavg)) + geom_line() + ylim(1.2, 1.5) + geom_hline(yintercept = 1.411, lty = 3) + geom_vline(xintercept = 200, lty = 3) + xlab("Month") + ylab("Average health status") + theme(panel.background = element_rect(fill = "grey90"), panel.grid = element_blank(), plot.title = element_text(size = 12, vjust = 0.5, hjust = 0) ) + ggtitle("Average health status of simulated population") Generating monthly study cohorts Now we are ready to see if we can simulate the incidence – prevalence bias. The idea here is to find the first month during which an individual (1) is “active” (i.e. the period being considered is on or after the individual’s start period), (2) has an emergency department visit, and (3) whose health status has reached a specified threshold. We can set a final parameter that looks back some number of months (say 6 or 12) to see if there have been any previous qualifying emergency room visits before the study start period (which in our case will be month 290 to mitigate an burn-in bias identified above). This “look-back” will be used to mitigate some of the bias by creating a washout period that makes the prevalent cases look more like incident cases. This look-back parameter is calculated each month for each individual using an Rcpp function that loops through each period: #include using namespace Rcpp; // [[Rcpp::export]] IntegerVector cAddPrior(IntegerVector idx, IntegerVector event, int lookback) { int nRow = idx.length(); IntegerVector sumPrior(nRow, NA_INTEGER); for (unsigned int i = lookback; i < nRow; i++) { IntegerVector seqx = Rcpp::seq(i-lookback, i-1); IntegerVector x = event[seqx]; sumPrior[i] = sum(x); } return(sumPrior); } Generating a single cohort The following code (1) generates a population (as we did above), (2) generates emergency department visits that are dependent on the health status (the sicker an individual is, the more likely they are to go to the ED), (3) calculates the number of eligible ED visits during the look-back period, and (4) creates the monthly cohorts based on the selection criteria. At the end, we calculate average health status for the cohort by month of cohort – this will be used to illustrate the bias. maxFU = 325 nPerMonth = 100 N = maxFU * nPerMonth START = 289 # to allow for adequate burn-in HEALTH = 2 LOOKBACK = 6 # how far to lookback set.seed(123) did <- genData(n = N, dtDefs = ddef) did[, sMonth := rep(1:maxFU, each = nPerMonth)] healthStats <- did[, .(sHealth, sMonth, health = MCsim(maxFU, P, sHealth, sMonth)), keyby = id] healthStats[, month := c(1:.N), by = id] # eliminate period without status measurement (0) & death (4) healthStats <- healthStats[!(health %in% c(0,4))] # ensure burn-in by starting with observations far # into simulation healthStats <- healthStats[month > (START - LOOKBACK)] # set probability of emergency department visit healthStats[, pED := (health == 1) * 0.02 + (health == 2) * 0.10 + (health == 3) * 0.20] # generate emergency department visit healthStats[, ed := rbinom(.N, 1, pED)] healthStats[, edAdj := ed * as.integer(health >= HEALTH)] # if you want to restrict healthStats[, pSum := cAddPrior(month, edAdj, lookback = LOOKBACK), keyby=id] # look at one individual healthStats[id == 28069] ## id sHealth sMonth health month pED ed edAdj pSum ## 1: 28069 1 281 1 284 0.02 0 0 NA ## 2: 28069 1 281 1 285 0.02 0 0 NA ## 3: 28069 1 281 1 286 0.02 0 0 NA ## 4: 28069 1 281 1 287 0.02 0 0 NA ## 5: 28069 1 281 2 288 0.10 0 0 NA ## 6: 28069 1 281 2 289 0.10 0 0 NA ## 7: 28069 1 281 2 290 0.10 1 1 0 ## 8: 28069 1 281 2 291 0.10 0 0 1 ## 9: 28069 1 281 2 292 0.10 0 0 1 ## 10: 28069 1 281 2 293 0.10 0 0 1 ## 11: 28069 1 281 2 294 0.10 0 0 1 ## 12: 28069 1 281 3 295 0.20 1 1 1 # cohort includes individuals with 1 prior ed visit in # previous 6 months cohort <- healthStats[edAdj == 1 & pSum == 0] cohort <- cohort[, .(month = min(month)), keyby = id] cohort ## id month ## 1: 53 306 ## 2: 82 313 ## 3: 140 324 ## 4: 585 291 ## 5: 790 299 ## --- ## 3933: 31718 324 ## 3934: 31744 325 ## 3935: 31810 325 ## 3936: 31860 325 ## 3937: 31887 325 # estimate average health status of monthly cohorts cohortStats <- healthStats[cohort, on = c("id","month")] sumStats <- cohortStats[ , .(avghealth = mean(health), n = .N), keyby = month] head(sumStats) ## month avghealth n ## 1: 290 2.248175 137 ## 2: 291 2.311765 170 ## 3: 292 2.367347 147 ## 4: 293 2.291925 161 ## 5: 294 2.366906 139 ## 6: 295 2.283871 155 Exploring bias Finally, we are at the point where we can see what, if any, bias results in selecting our cohorts under the scenario I’ve outlined above. We start by generating multiple iterations of populations and cohorts and estimating average health status by month under the assumption that we will have a look-back period of 0. That is, we will accept an individual into the first possible cohort regardless of her previous emergency department visit history. The plot below shows average across 1000 iterations. What we see is that the average health status of the cohorts in the first 20 months or so exceed the long run average. The incidence – prevalence bias is extremely strong if we ignore prior ED history! Taking history into account Once we start to incorporate ED history by using look-back periods greater than 0, we see that we can reduce bias considerably. The two plots below show the results of using look-back periods of 6 and 12 months. Both have reduced bias, but only at 12 months are we approaching something that actually looks desirable. In fact, under this scenario, we’d probably like to go back 24 months to eliminate the bias entirely. Of course, these particular results are dependent on the simulation assumptions, so determining an appropriate look-back period will certainly depend on the actual data. (When we do finally get the actual data, I will follow-up to let you know what kind of adjustment we needed to make in the real, non-simulated world.) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Portfolio Volatility Shiny App Wed, 08/09/2017 - 02:00 (This article was first published on R Views, and kindly contributed to R-bloggers) In our 3 previous posts, we walked through how to calculate portfolio volatility, then how to calculate rolling volatility, and then how to visualize rolling volatility. Today, we will wrap all of that work into a Shiny app that allows a user to construct his or her own five-asset portfolio, choose a benchmark and a time period, and visualize the rolling volatilities over time. Here is the final app: There will be a slight departure in form today because we will use a helpers.r file to hold our functions – those same functions that we worked so hard to create in the previous three posts. There are a few reasons to put them into a helper file. 1. The end user won’t be able to see them, which leads to a tangent on the idea of reproducibility. We need to ask, “Reproducible by whom?”. In this case, the Shiny app is 100% reproducible by anyone who has access to that helper file, which would be my colleagues with access to my files or Github repository. But, if the end user is an external client, for example, that end user wouldn’t have access and the analytic functions would remain a black box. In the world of finance, that’s necessary most of the time. 2. From a workflow perspective, that helper file allows us to test those functions in different formats. I can create an R Markdown report that uses the file, test different Shiny apps, or tweak the functions themselves without having to alter the actual application code. 3. From a stylistic perspective, the helper file keeps the Shiny app code a bit cleaner and much shorter. It’s not right or wrong to use a helpers file, but it’s worth thinking about as our Shiny apps get more involved. Here’s how we load that file and have access to the functions and objects in it: source("function-folder/simple-vol-helpers.r") Since this is a blog post and we want to be complete, the code chunk below contains all the code from that simple-vol-helpers.r file. # Calculate component returns componentReturns_df <- function(stock1, stock2, stock3, stock4, stock5, start_date){ symbols <- c(stock1, stock2, stock3, stock4, stock5) prices <- getSymbols(symbols, src = 'yahoo', from = start_date, auto.assign = TRUE, warnings = FALSE) %>% map(~Cl(get(.))) %>% reduce(merge) %>% colnames<-(symbols) # generate daily return series for funds prices_monthly <- to.monthly(prices, indexAt = "first", OHLC = FALSE) returns <- na.omit(ROC(prices_monthly, 1, type = "continuous")) returns_df <- returns %>% as_tibble(preserve_row_names = TRUE) %>% mutate(date = ymd(row.names)) %>% select(-row.names) %>% select(date, everything()) } # Calculate rolling Portfolio Standard Deviation rolling_portfolio_sd <- function(returns_df, start = 1, window = 6, weights){ start_date <- returns_df$date[start] end_date <- returns_df$date[c(start + window)] interval_to_use <- returns_df %>% filter(date >= start_date & date < end_date) returns_xts <- interval_to_use %>% as_xts(date_col = date) w <- weights results_as_xts <- StdDev(returns_xts, weights = w, portfolio_method = "single") results_as_xts <- round(results_as_xts, 4) * 100 results_to_tibble <- as_tibble(t(results_as_xts[,1])) %>% mutate(date = ymd(end_date)) %>% select(date, everything()) } # Look how long this code chunk is. Easier to stash this in a helpers.r file! All of those functions were explained and constructed in our previous Notebooks, so we won’t dwell on them today. Let’s move on to the appearance of the app itself! First, we need to create an input sidebar where the user can choose assets, weights, a date, and a benchmark for comparison. # This creates the sidebar input for the first stock and its weight. # We'll need to copy and paste this fluidRow for each of the assets in our portfolio. fluidRow( column(6, textInput("stock1", "Stock 1", "SPY")), column(4, numericInput("w1", "Portf. %", 40, min = 1, max = 100)) ) # Let the user choose a benchmark to compare to the portfolio volatility. # We'll default to the Russell 2000 small cap index. textInput("benchmark", "Benchmark for Comparison", "^RUT") fluidRow( column(6, dateInput("start_date", "Start Date", value = "2013-01-01")), column(3, numericInput("window", "Window", 6, min = 3, max = 20, step = 1)) ) # This action button is important for user experience and server resources. actionButton("go", "Submit") That last line creates an actionButton, which is important for the end user. We have more than 10 user inputs in that sidebar, and without that actionButton, the app will start firing and reloading every time a user changes any of the inputs. This would be annoying for the user and taxing on the server! We will make sure the reactives wait for the user to click that button by using eventReactive. For example, in the lines below, the app will wait to calculate the rolling portfolio volatility because the value of portfolio_rolling_vol is an eventReactive that won’t fire until input$go is true.

portfolio_rolling_vol <- eventReactive(input$go, { returns_df <- componentReturns_df(input$stock1, input$stock2, input$stock3, input$stock4, input$stock5, input$start_date) %>% mutate(date = ymd(date)) weights <- c(input$w1/100, input$w2/100, input$w3/100, input$w4/100, input$w5/100) window <- input$window roll_portfolio_result <- map_df(1:(nrow(returns_df) - window), rolling_portfolio_sd, returns_df = returns_df, window = window, weights = weights) %>% mutate(date = ymd(date)) %>% select(date, everything()) %>% as_xts(date_col = date) %>% colnames<-("Rolling Port SD") # an xts comes out of this }) The user is going to choose a benchmark for comparison and we need another eventReactive to take that input and calculate rolling volatility for the benchmark. The asset is passed via input$benchmark from the sidebar.

benchmark_rolling_vol <- eventReactive(input$go, { benchmark_prices <- getSymbols(input$benchmark, src = 'yahoo', from = input$start_date, auto.assign = TRUE, warnings = FALSE) benchmark_close <- Cl(get(benchmark_prices)) benchmark_prices_monthly <- to.monthly(benchmark_close, indexAt = "first", OHLC = FALSE) benchmark_returns <- na.omit(ROC(benchmark_prices_monthly, 1, type = "continuous")) benchmark_rolling_sd <- rollapply(benchmark_returns, input$window, function(x) StdDev(x)) benchmark_rolling_sd <- round(benchmark_rolling_sd, 4) * 100 })

Finally, when we visualize, it’s nice to include the chosen benchmark in the title. Thankfully, that is a simple eventReactive.

benchmark <- eventReactive(input$go, {input$benchmark})

We have now calculated three reactive objects: portfolio_rolling_vol(), benchmark_rolling_vol(), and benchmark(). We pass them to highcharter and tweak aesthetics on the y-axis.

renderHighchart({ highchart(type = "stock") %>% hc_title(text = paste("Portfolio Volatility vs", benchmark(), "Volatility", sep = " ")) %>% hc_yAxis(title = list(text = "Vol Percent"), labels = list(format = "{value}%"), opposite = FALSE) %>% hc_add_series(portfolio_rolling_vol(), name = "Portfolio Vol", color = "blue") %>% hc_add_series(benchmark_rolling_vol(), name = paste(benchmark(), "Vol", sep = " "), color = "green") %>% hc_add_theme(hc_theme_flat()) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE) })

We’ve presented nothing new of substance today, as those analytical functions were all built in previous posts. However, this app does allow the user to build a custom portfolio and compare to a benchmark of his or her choosing. Have fun with it, and try to find some assets whose volatility has been increasing since the election last November.

Next time, we’ll take a closer look at the VIX and how it compares to realized volatility. Until then!

_____='https://rviews.rstudio.com/2017/08/09/portfolio-volatility-shiny-app/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tutorial: Publish an R function as a SQL Server stored procedure with the sqlrutils package

Tue, 08/08/2017 - 21:58

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In SQL Server 2016 and later, you can publish an R function to the database as a stored procedure. This makes it possible to run your R function on the SQL Server itself, which makes the power of that server available for R computations, and also eliminates the time required to move data to and from the server. It also makes your R function available as a resource to DBAs for use in SQL queries, even if they don't know the R language.

Neils Berglund recently posted a detailed tutorial on using the sqlrutils package to publish an R function as a stored procedure. There are several steps to the process, but ultimately it boils down to calling registerStoredProcedure on your R function (and providing the necessary credentials).

If you don't have a connection (or the credentials) to publish to the database yourself, Niels also explains how to use the StoredProcedure function to generate a SQL Server source file defining a stored procedure including your R function, which you can provide to a DBA to deploy.

You can find the complete tutorial, including detailed explanations of the various parameters, at the link below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Soccer data sparring: Scraping, merging and analyzing exercises

Tue, 08/08/2017 - 19:03

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

While understanding and spending time improving specific techniques, and strengthening indvidual muscles is important, occasionally it is necessary to do some rounds of actual sparring to see your flow and spot weaknesses. This exercise sets forces you to use all that you have practiced: to scrape links, download data, regular expressions, merge data and then analyze it.

We will download data from the website football-data.co.uk that has data on some football/soccer leagues results and odds quoted by bookmakers where you can bet on the results.

Exercise 1

Use R to scan the German section on football-data.co.uk for any links and save them in a character vector called all_links. There are many ways to accomplish this.

Exercise 2

Among the links you found should be a number pointing to comma-separated values files with data on Bundesliga 1 and 2 separated by season. Now update all_links vector so that only links to csv files remain. Use regular expressions.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

• import data into R in several ways while also beeing able to identify a suitable import tool
• use SQL code within R
• And much more

Exercise 3

Again, update all_links so that only links to csv tables ‘from Bundesliga 1 from Season 1993/1994 to 2013/2014 inclusive’ remain.

Exercise 4

Import to a list in your workspace all the 21 remaining csv files in all_links, each one as a data.frame. Use read.csv, with the url and na.strings = c("", "NA"). Not that you might need to add a prefix for them, so the links are complete.

Exercise 5

Take the list and generate a one big data.frame with all the data.frames previously imported. One way to do this is using rbind.fill function from a well-known package. Name the new data.frame as bundesl.

Exercise 6

Take a good look at the new dataset. Our read.csv did not work perfectly on this data: it turns out that there are some empty rows and empty columns, identify and count them. Update the bundesl so it no longer has empty rows m nor columns.

Exercise 7

Format the Date column so R understands using as.Date().

Exercise 8

Remove all columns which are not 100% complete, and the variable Div as well.

Exercise 9

Which are the top 3 teams in terms of numbers of wins in Bundesliga 1 for our period? You are free to use base-R functions or any package. Be warned that his task is not as simple as it seems due the nature in the data and small inconsitency in the data.

Exercise 10

Which team has held the longest winning streak in our data?

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Understanding the Math of Correspondence Analysis with Examples in R

Tue, 08/08/2017 - 06:13

Correspondence analysis is a popular tool for visualizing the patterns in large tables. To many practitioners it is probably a black box. Table goes in, chart comes out. In this post I explain the mathematics of correspondence analysis. I show each step of the calculation, and I illustrate all the of the steps using R. If you’ve ever wanted a deeper understanding of what’s going on behind the scenes of correspondence analysis, then this post is for you.

Correspondence Analysis in R: A case study

The data that I analyze shows the the relationship between thoroughness of newspaper readership by education level. It is a contingency table, which is to say that each number in the table represents the number of people in each pair of categories. For example, the cell in the top-left corner tells us that 5 people with some primary education glanced at the newspaper. The table shows the data for a sample of 312 people (which is also the sum of the numbers displayed).

I show the code for generating this table below. I have named the resulting table N.

N = matrix(c(5, 18, 19, 12, 3, 7, 46, 29, 40, 7, 2, 20, 39, 49, 16), nrow = 5, dimnames = list( "Level of education" = c("Some primary", "Primary completed", "Some secondary", "Secondary completed", "Some tertiary"), "Category of readership" = c("Glance", "Fairly thorough", "Very thorough"))) Computing the observed proportions (P) in R

The first step in correspondence analysis is to sum up all the values in the table. I’ve called this total n.

n = sum(N)

Then, we compute the table of proportions, P. It is typical to use this same formula in other types of tables, even if the resulting numbers are not strictly-speaking proportions. Examples include correspondence analysis of tables of means or multiple response data.

P = N / n

This gives us the following table. To make it easy to read, I have done all the calculations in Displayr, which automatically formats R tables using HTML. If you do the calculations in normal R, you will instead get text-based table like the one above. Sign-in to Displayr and view the document that contains all the R calculations in this post.

Row and column masses

In the language of correspondence analysis, the sums of the rows and columns of the table of proportions are called masses. These are the inputs to lots of different calculations. The column masses in this example show that Glance, Fairly thorough, and Very thorough describe the reading habits of 18.3%, 41.3%, and 40.4% of the sample respectively. We can compute the column masses using the following R code:

column.masses = colSums(P)

The row masses are Some primary (4.5%), Primary completed (26.9%), Some secondary (27.9%), Secondary completed (32.4%), and Some tertiary (8.3%). These are computed using:

row.masses = rowSums(P) Expected proportions (E)

Referring back to the original table of proportions, 1.6% of people glanced and had some primary education. Is this number big or small? We can compute the value that we would expected to see under the assumption that there is no relationship between education and readership. The proportion that glance at a newspaper is 18.2% and 4.5% have only Some primary education. Thus, if there is no relationship between education and readership, we would expect that 4.5% of 18.2% of people (i.e., 0.008 = 0.8%) have both glanced and have primary education. We can compute the expected proportions for all the cells in the table in the same way.

The following R code computes all the values in a single line of code, where %o% means that a table is created by multiplying each of the row totals (row masses) by each of the column totals.

E = row.masses %o% column.masses Residuals (R)

We compute the residuals by subtracting the expected proportions from the observed proportions. Residuals in correspondence analysis have a different role to that which is typical in statistics. Typically in statistics the residuals quantify the extent of error in a model. In correspondence analysis, by contrast, the whole focus is on examining the residuals.

The residuals quantify the difference between the observed data and the data we would expect under the assumption that there is no relationship between the row and column categories of the table (i.e., education and readership, in our example).

R = P - E

The biggest residual is -0.045 for Primary completed and Very thorough. That is, the observed proportion of people that only completed primary school and are very thorough is 6.4%, and this is 4.5% lower than the expected proportion of 10.9%, which is computed under the assumption of no relationship between newspaper readership and education. Thus, the tentative conclusion that we can draw from this is that there is a negative association between having completed primary education and reading very thoroughly. That is, people with only primary school education are less likely to read very thoroughly than the average person.

Indexed residuals (I)

Take a look at the top row of the residuals shown in the table above. All of the numbers are close to 0. The obvious explanation for this – that having some primary education is unrelated to reading behavior –  is not correct. The real explanation is all the observed proportions (P) and the expected proportions (E) are small, because only 4.6% of the sample had this level of education. This highlights a problem with looking at residuals from a table. By ignoring the number of people in each of the rows and columns, we end up being most likely to find results only in rows and columns with larger totals (masses). We can solve for this problem by dividing the residuals by the expected values, which gives us a table of indexed residuals (I).

I = R / E

The indexed residuals have a straightforward interpretation. The further the value from the table, the larger the observed proportion relative to the expected proportion. We can now see a clear pattern. The biggest value on the table is the .95 for Some primary and Glance. This tells us that people with some primary education are almost twice as likely to Glance at a newspaper as we would expect if there were no relationship between education and reading. In other words, the observed value is 95% higher than the expected value. Reading along this first row, we see that there is a weaker, but positive, indexed residual of 0.21 for Fairly thorough and Some primary. This tells us that people with some primary education were 21% more likely to be fairly thorough readers that we would expect. And, a score of -.65 for Very thorough, tells us that people with Some primary education were 65% less likely to be Very thorough readers than expected. Reading through all the numbers on the table, the overall pattern is that higher levels of education equate to more thorough readership.

As we will see later, correspondence analysis is a technique designed for visualizing these indexed values.

Reconstituting indexed residuals from a map

The chart below is a correspondence analysis with the coordinates computed using row principal normalization. I will explain its computation later. Now, I am going to show how we can work backwards from this map to the indexed residuals, in much the same way that we can recreate orange juice from orange juice concentrate.  Some Primary has coordinates of (-.55, -.23) and Glance’s coordinates are (-.96, -1.89). We can compute the indexed value by multiplying together the two x coordinates and the two y coordinates, and summing them up. Thus we have -.55*-.96 + -.23 * -1.89 = .53 + .44 = .97. Taking rounding errors into account, this is identical to the value of .95 shown in the table above.

Unless you have studied some linear algebra, there is a good chance that this calculation, known as the dot product (or a scalar product or inner product), is not intuitive. Fortunately, it can be computed it in a different way that makes it more intuitive.

To compute the indexed residual for a couple of points, we start by measuring the distance between each of the points and the origin (see the image to the right). In the case of Some primary the distance is .59. Then, we compute the distance for Glance, which is 2.12. Then we compute the angle formed when we draw lines from each of the points to the origin. This is 41 degrees. Lastly, we multiple together each of these distances with the cosine of the angle. This gives us .59*2.12*cos(41°) = .59*2.12*.76 = .94. Once rounding errors are taken into account, is the same as the correct value of .95.

Now, perhaps this new formula looks no simpler than the dot product, but if you look at it a bit closer, it becomes pretty straightforward. The first two parts of the formula are the distance of each point from the origin (i.e., the (0,0) coordinate). Thus, all else being equal, the further the point is from the origin, the stronger the associations between that point and the other points on the map. So, looking at the top, we can see that the column category of Glance is the one which is most discriminating in terms of the readership categories.

The second part to interpretation, which will likely bring you back to high school, is the meaning of the cosine. If two points are in exactly the same direction from the origin (i.e., the are on the same line), the cosine of the angle is 1. The bigger the angle, the smaller the cosine, until we get to a right-angle (90° or 270°), at which point the cosine is 0. And, when the lines are going in exactly opposite directions (i.e., so the line between the two points goes through the origin), the cosine of the angle is -1. So, when you have a small angle from the lines connecting the points to the origin, the the association is relatively strong (i.e., a positive indexed residual). When there is a right angle there is no association (i.e., no residual). When there is a wide angle, a negative residual is the outcome.

Putting all this together allows us to work out the following things from the row principal correspondence analysis map above, which I have reproduced below to limit scrolling:

• People with only Primary completed are relatively unlikely to be Very thorough.
• Those with Some primary are more likely to Glance.
• People with Primary completed are more likely to be Fairly thorough.
• The more education somebody has, the more likely they are to be Very thorough.

Reconstituting residuals from bigger tables

If you look at the chart above, you can see that it shows percentages in the and labels. (I will describe how these are computed below.) They indicate how much of the variation in the indexed residuals is explained by the horizontal and vertical coordinates. As these add up to 100%, we can perfectly reconstitute the indexed residuals from the data. For most tables, however, they add up to less than 100%.  This means that there is some degree of information missing from the map. This is not unlike reconstituted orange juice, which falls short of fresh orange juice.

The post How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) provides a much more thorough (but un-mathematical) description of issues arising with the interpretation of correspondence analysis.

Singular values, eigenvalues, and variance explained

In the previous two sections I described the relationship between the coordinates on the map and the indexed residuals. In this section I am going to explain how the coordinates are computed from the indexed residuals.

The first step in computing the coordinates is to do an near-magical bit of mathematics called a Singular Value Decomposition (SVD). I have had a go at expressing this in layperson’s language in my post An Intuitive Explanation of the Singular Value Decomposition (SVD): A Tutorial in R, which works through the same example that I have used in this post.

The code that I used for performing the SVD of the indexed residuals is shown below. The first line computes Z, by multiplying together each of indexed residuals by the square root of their corresponding expected values. This seems a bit mysterious at first, but two interesting things are going on here.

First, Z is a standardized residual, which is a rather cool type of statistic in its own right. Second, and more importantly from the perspective of correspondence analysis, what this does is cause the singular value decomposition to be weighted, such that cells with a higher expected value are given a higher weight in the data. As often the expected values are related to the sample size, this weighting means that smaller cells on the table, for which the sampling error will be larger, are down-weighted. In other words, this weighting makes correspondence analysis relatively robust to outliers caused by sampling error, when the table being analyzed is a contingency table.

Z = I * sqrt(E) SVD = svd(Z) rownames(SVD$u) = rownames(P) rownames(SVD$v) = colnames(P)

A singular value decomposition has three outputs:

• A vector, d, contains the singular values.
• A matrix u which contains the left singular vectors.
• A matrix v with the right singular vectors.

The left singular vectors correspond to the categories in the rows of the table and the right singular vectors correspond to the columns. Each of the singular values, and the corresponding vectors (i.e., columns of u and v), correspond to a dimension. As we will see, the coordinates used to plot row and column categories are derived from the first two dimensions.

Squared singular values are known as eigenvalues. The eigenvalues in our example are .0704, .0129, and .0000.

eigenvalues = SVD$d^2 Each of these eigenvalues is proportional to the amount of variance explained by the columns. By summing them up and expressing them as a proportion, which is done by the R function prop.table(eigenvalues), we compute that the first dimension of our correspondence analysis explains 84.5% of the variance in the data and the second 15.5%, which are the numbers shown in x and y labels of the scatterplot shown earlier. The third dimension explains 0.0% of the variance, so we can ignore it entirely. This is why we are able to perfectly reconstitute the indexed residuals from the correspondence analysis plot. Standard coordinates As mentioned, we have weighted the indexed residuals prior to performing the SVD. So, in order to get coordinates that represent the indexed residuals we now need to unweight the SVD’s outputs. We do this by dividing each row of the left singular vectors by the square root of the row masses (defined near the beginning of this post): standard.coordinates.rows = sweep(SVD$u, 1, sqrt(row.masses), "/")

This gives is the standard coordinates of the rows:

We do the same process for the right singular vectors, except we use the column masses:

standard.coordinates.columns = sweep(SVD$v, 1, sqrt(column.masses), "/") This gives is the standard coordinates of the columns, shown below. These are the coordinates that have been used to plot the column categories on the maps we in this post. Principal coordinates The principal coordinates are the standard coordinates multiplied by the corresponding singular values: principal.coordinates.rows = sweep(standard.coordinates.rows, 2, SVD$d, "*")

The positions of the row categories shown on the earlier plots are these principal coordinates. The principal coordinates for the education levels (rows) are shown in the table below.

The principal coordinates represent the distance between the row profiles of the original table. The row profiles are shown in the the table below. They are the raw data (N) divided by the row totals. Outside of correspondence analysis they are more commonly referred to as the row percentages of the contingency table. The more similar two rows’ principal coordinates, the more similar their row profiles. More precisely, when we plot the principal coordinates, the distances between the points are chi-square distances. These are the distances between the rows weighted based by the column masses. You can find the R calculations for the chi-square distances here.

The principal coordinates for the columns are computed in the same way:

principal.coordinates.columns = sweep(standard.coordinates.columns, 2, SVD\$d, "*")

In the row principal plot shown earlier, the row categories’ positions are the principal coordinates. The column categories are plotted based on the standard coordinates. This means that it is valid to compare row categories based on their proximity to each other. It is also valid to understand the relationship between the row and column coordinates based on their dot products. But, it is not valid to compare the column points based on their position. I discuss this in more detail in a post called Normalization and the Scaling Problem in Correspondence Analysis: A Tutorial Using R.

Quality

We have already looked at one metric of the quality of a correspondence analysis: the proportion of the variance explained. We can also compute the quality of the correspondence analysis for each of the points on a map. Recall that the further a point is from the origin, the greater that point is explained by the correspondence analysis. When we square the principal coordinates and express these as row proportions, we get measures of the quality of each dimension for each point. Sometimes these are referred to as the squared correlations and squared cosines.

pc = rbind(principal.coordinates.rows, principal.coordinates.columns) prop.table(pc ^ 2, 1)

The quality of the map for a particular category is usually defined as the sum of the scores it gets for the two dimensions that are plotted. In our example, these all add up to 100%.

Acknowledgments

The data in the example comes from Greenacre and Hastie’s 1987 paper “The geometric interpretation of correspondence analysis”, published in the Journal of the American Statistical Association.

Where practical, I have used the notation and terminology used in Michael Greenacre’s (2016) third edition of Correspondence Analysis in Practice. This excellent book contains many additional calculations for correspondence analysis diagnostics. The only intentional large deviation from Greenacre’s terminology relates to the description of the normalizations (I discuss the differences in terminology in Normalization and the Scaling Problem in Correspondence Analysis: A Tutorial Using R).

This post is partly based on a paper that I wrote for the International Journal of Market Research, “Improving the display of correspondence analysis using moon plots”, in 2011.

TRY IT OUT

You can sign-in to Displayr and view the document that contains all the R calculations in this post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Seeking guidance in choosing and evaluating R packages

Tue, 08/08/2017 - 02:00

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

At useR!2017 in Brussels last month, I contributed to an organized session focused on navigating the 11,000+ packages on CRAN. My collaborators on this session and I recently put together an overall summary of the session and our goals, and now I’d like to talk more about the specific issue of learning about R packages and deciding which ones to use. John and Spencer will write more soon about the two other issues of our focus:

• meta-packages that can unify multiple packages within domains and
• searching for packages.

In preparation for this session, I ran a brief online survey in the spring of 2017 to ask R users how they currently discover and learn about R packages. The results of this survey are available in an R package (SO META) on GitHub.

library(packagesurvey) data(package_survey)

There were 1039 respondents to the survey. You can easily explore how many respondents chose each answer to the single question on the survey, “How do you currently discover and learn about R packages?”

library(tidyverse) package_survey %>% mutate(total = n_distinct(respondent)) %>% count(answer, total) %>% arrange(desc(n)) %>% mutate(proportion = scales::percent(n / total)) %>% select(-total, -n) %>% kable(col.names = c("How do you currently discover and learn about R packages?", "% of respondents who chose each answer")) How do you currently discover and learn about R packages? % of respondents who chose each answer Social media such as blogs, R-bloggers, Twitter, Slack, or GitHub contacts 79.8% General search websites such as Google and Yahoo 57.0% Your personal network, such as colleagues and professors 41.6% Books, textbooks, or journal articles (JSS, JOSS, R-Journal) 31.9% Conferences, meet-ups, or seminars 24.1% CRAN Task Views 21.8% Email lists such as r-help, r-packages, or r-pkg-devel 15.3% R-specific search websites such as METACRAN (www.r-pkg.org) or Rdocumentation (https://www.rdocumentation.org/) 11.1% Other (send ideas to @juliasilge on Twitter!) 4.2% R packages built for search such as the sos package 2.2%

Responses to this survey were fielded from R email help lists, local R meetup groups, social media such as Twitter, and affinity groups such as R-Ladies. The respondents to this survey overwhelmingly look to social media including blogs and Twitter to learn about R packages, and also make use of general search sites and their personal network. I know this aligns with how I personally learn about R packages!

I heard some great and insightful answers from people contributing to the “other” option. R users use Stack Overflow to learn about R packages, as well as options like CRANberries and crantastic, both of which have RSS feeds that users follow. Other users mentioned learning by reading code on GitHub (this is one I have done too!), and other search websites including rpackages.io.

You might also be interested in when R users responded to the survey.

package_survey %>% distinct(respondent, .keep_all = TRUE) %>% ggplot(aes(response_time)) + geom_histogram(fill = "midnightblue") + labs(x = NULL, y = "Number of R users", title = "Responses to survey on package discovery over time")

At useR, after the large combined session, we broke out into three smaller sessions for discussion and brainstorming. I facilitated the breakout session focused on guidance for package choice and package evaluation. We had about 40 participants in our discussion on choosing and evaluating R packages. It was a fruitful discussion and several important themes emerged.

The Value of Personal Impact

Participants in our session emphasized how impactful personal relationships can be in how packages are shared and evaluated. Some participants discussed how building local networks of R users may be more important in this effort than top-down, technological solutions. Our survey does show that personal recommendations have been important for many individuals in evaluating R packages. This is yet another area where local user groups can continue to have important impact. Some ways to share this experience more broadly would be online video series or live data analysis, such as those by Sean Taylor and Roger Peng.

Some participants wondered whether the idea of a CRAN Task View is outdated in the current climate with so many packages, and whether it is even possible for one person to main one effectively. Others responded that CTVs are all about curation, which is still important, perhaps even more important now. We had at least one CTV maintainer present in our breakout session, and several things were presented as important in order for CTV maintainers to do their jobs:

• Package maintainers should update their NEWS files.
• Package maintainers need to write good documentation.

These are helpful for all R users, of course, but also for maintainers of CRAN Task Views. The pkgdown package was mentioned as a great way to make documentation visible.

CRAN and You

Participants had several ideas about how things are done on CRAN now and adjustments that might be made in the interest of discovering and evaluating packages. One idea that came up several times was the possibility of keywords or tagging for packages. I have since learned that there is support for some tagging architecture for packages on CRAN (for example, see here) in the DESCRIPTION file using ACM, JEL, or MSC classifications. These are fairly unwieldy lists currently and something like an RStudio addin could be used to navigate them, if they were widely used.

Another desire participants voiced was for more information directly on CRAN, such as the number of downloads for packages. Participants also suggested that vignettes for context-specific tasks like the Bioconductor Workflows would be helpful for package discovery and evaluation, either associated with CRAN or perhaps the R Journal. Finally, there was some discussion about whether the very minimal gate-keeping on CRAN was good or bad for the community, although the general feeling was that efforts to keep packages off CRAN would not be positive.

More data, more problems

Some of the package developers at the session wondered why, when R is a data-centric language, developers have such primitive analytics about their users. Issues of user privacy are central here, but there might be opt-in options that could help both package developers and users make better decisions. The idea of a recommender system for R packages was brought up multiple times, perhaps a Tinder for R packages like papr, the Tinder for academic preprints. Both the users and developers present thought that data on package use (instead of package downloads alone) would be helpful in evaluating how important or helpful R packages are. Participants also discussed the possibility of a linter for analysis scripts, similar in concept to linters for code, that would suggest packages and good practice. Such a linter would necessarily be opinionated, but almost all of the efforts to suggest and evaluate R packages are at some level.

Moving forward

You can look to hear more from my collaborators on this session soon. I am really happy that this discussion is happening in our community. One thing that I am taking from it is increased respect and value for the work done by local meetup group organizers and individuals who contribute to spreading the R love, both online and in their communities. Turns out that is how people learn! Something else I am moving forward with is a continued commitment to growing my skills as a package developer. Writing good documentation and adopting best practices makes this whole challenge better for everyone, from the CRAN Task View maintainer to the new R user. I am very happy to hear feedback or questions from you!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...