Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 1 min ago

Celebrating 20 years of CRAN

Mon, 10/09/2017 - 18:04

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

In April I spoke for the Royal Statistical Society (Glasgow branch) at their event celebrating 20 years of CRAN. The other speakers were Charis Chanialidis and Colin Gillespie. Like most people I find watching myself speak really cringe worthy, but perhaps it’s not so bad! The three talks are here:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio v1.1 Released

Mon, 10/09/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce the general availability of RStudio 1.1. Highlights include:

  • A connections tab which makes it easy to connect to, explore, and view data in a variety of databases.
  • A terminal tab which provides fluid shell integration with the IDE, xterm emulation, and even support for full-screen terminal applications.
  • An object explorer which can navigate deeply nested R data structures and objects.
  • A new, modern dark theme and Retina-quality icons throughout.
  • Dozens of other small improvements and bugfixes.

RStudio Server Pro 1.1 is also now available. Some of the new Pro features include:

  • Support for floating licenses, which make it easy to run RStudio Server Pro in Docker containers, virtual machines, and cloud computing instances.
  • Improved session management, which allows analysts to label, multi-select, force quit and otherwise self-manage their R sessions.
  • Tools for administrators, including the ability to send users notifications and automatically clean up unused sessions, freeing disk space and resources.

You can download RStudio v1.1 today, and let us know what you think on the RStudio IDE community forum.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to sample from multidimensional distributions using Gibbs sampling?

Mon, 10/09/2017 - 02:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

We will show how to perform multivariate random sampling using one of the Markov Chain Monte Carlo (MCMC) algorithms, called the Gibbs sampler.

Random sampling with rabbit on the bed plane

via GIPHY

To start, what are MCMC algorithms and what are they based on?

Suppose we are interested in generating a random variable X with a distribution of \pi, over \mathcal{X}.
If we are not able to do this directly, we will be satisfied with generating a sequence of random variables X_0, X_1, \ldots, which in a sense tending to a distribution of \pi. The MCMC approach explains how to do so:

Build a Markov chain X_0, X_1, \ldots, for \mathcal{X}, whose stationary distribution is \pi.
If the structure is correct, we should expect random variables to converge to \pi.

In addition, we can expect that
for function f: \mathcal{X} \rightarrow \mathbb{R}, occurs:
\mathbb{E}(f) = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=0}^{n-1}f(X_{i})

with probability equals to 1.

In essence, we want the above equality to occur for any arbitrary random variable X_0.

One of the MCMC algorithms that guarantee the above properties is the so-called Gibbs sampler.

Gibbs Sampler – description of the algorithm.

Assumptions:

  • \pi is defined on the product space \mathcal{X} = \Pi_{i=1}^{d}\mathcal{X_{i}}
  • We are able to draw from the conditional distributions
    \pi(X_{i}|X_{-i}),
    where
    X_{-i} = (X_{1}, \ldots X_{i-1}, X_{i+1}, \ldots X_{d})

Algorithm steps:

  1. Select the initial values X_{k}^{(0)}, k = 1, \ldots d
  2. For t=1,2,\ldots repeat:

    For k = 1, \ldots d sample X_{k}^{(t)} from distribution \pi(X_{k}^{(t)}\rvert X_{1}^{(t)}, \ldots X_{k-1}^{(t)}, X_{k+1}^{(t-1)}, \ldots X_{d}^{(t-1)})

  3. Repeat step 2 until the distribution of vector (X_{1}^{(t)}, \ldots X_{d}^{(t)}) stabilizes.
  4. The subsequent iterations of point 2 will provide a randomization of \pi.

How do we understand point 3?
We are most satisfied when average coordinates stabilize to some accuracy.

Intuition in two-dimensional case:

Source: [3]

Gibbs sampling for randomization with a two-dimensional normal distribution.

We will sample from the distribution of \theta \sim \mathcal{N}(0, [\sigma_{ij}]), where
\sigma_{11} = \sigma_{22} = 1 and \sigma_{12} = \sigma_{21} = \rho.

Knowing joint density of \theta, it’s easy to show, that:
\theta_{1}|\theta_{2} \sim \mathcal{N}(\rho\theta_{2}, 1-\rho^{2})
\theta_{2}|\theta_{1} \sim \mathcal{N}(\rho\theta_{1}, 1-\rho^{2})

R implementation:

gibbs_normal_sampling <- function(n_iteration, init_point, ro) { # init point is some numeric vector of length equals to 2 theta_1 <- c(init_point[1], numeric(n_iteration)) theta_2 <- c(init_point[2], numeric(n_iteration)) for (i in 2:(n_iteration+1)) { theta_1[i] <- rnorm(1, ro * theta_2[i-1], sqrt(1 - ro^2)) theta_2[i] <- rnorm(1, ro * theta_1[i], sqrt(1 - ro^2)) } list(theta_1, theta_2) }

Using the above function, let’s see how the 10 points were scored at \rho = 0.5:

And for 10000 iterations:

We leave you a comparison of how the stability of the parameters changes depending on the selected \rho parameter.

Let’s move on to use the Gibbs sampler to estimate the density parameters.

We will show the use of the Gibbs sampler and bayesian statistics to estimate the mean parameters in the mix of normal distributions.

Assumptions (simplified case):

  • iid. sample Y=Y_{1},\ldots Y_{N} comes from a mixture of normal distributions (1-\pi)\mathcal{N}(\mu_{1}, \sigma_{1}^{2})+\pi\mathcal{N}(\mu_{2}, \sigma_{2}^{2}), where \pi, \sigma_{1} i \sigma_{2} are known.
  • For i=1,2 \mu_{i} \sim \mathcal{N}(0, 1) (a priori distributions) and \mu_{1} with \mu_{2} are independent.
  • \Delta = (\Delta_{1}, \ldots, \Delta_{N}) – the classification vector (unobserved) from which Y is derived (when \Delta_{i} = 0, Y_{i} is drawn from \mathcal{N}(\mu_{1}, \sigma_{1}^{2}), when \Delta_{i} = 1 then drawn from \mathcal{N}(\mu_{2}, \sigma_{2}^{2})).
  • \mathbb{P}(\Delta_{i} = 1) = \pi

With above assumptions it can be shown that:

  • f(\Delta) = (1-\pi)^{N-\sum_{i=1}^{N}\Delta_{i}}\cdot\pi^{\sum_{i=1}^{N}\Delta_{i}}
  • (\mu_{1}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}(1-\Delta_{i})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i})}, \frac{1}{1+\sum_{i=1}^{N}(1-\Delta_{i})})
  • (\mu_{2}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}\Delta_{i}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}}, \frac{1}{1+\sum_{i=1}^{N}\Delta_{i}})
  • \mathbb{P} (\Delta_{i} = 1\rvert\mu_{1}, \mu_{2}, Y) = \frac{\pi \phi_{(\mu_{2}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_1, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_2, \sigma_{2})}(y_{i})}

The form of the algorithm:

  1. Choose starting point for the mean (\mu_{1}^{0}, \mu_{2}^{0})
  2. In the k-th iteration do:
    • With the probability:
      \frac{\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_{1}^{(k-1)}, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})} draw \Delta_{i}^{(k)} for i = 1, \ldots, N
    • Calculate:
      \hat{\mu_{1}} = \frac{\sum_{i=1}^{N}(1-\Delta_{i}^{(k)})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})}
      \hat{\mu_{2}} = \frac{\sum_{i=1}^{N}\Delta_{i}^{(k)}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}}
    • Generate:
      \mu_{1}^{(k)} \sim \mathcal{N}(\hat{\mu_{1}}, \frac{1}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})})
      \mu_{2}^{(k)} \sim \mathcal{N}(\hat{\mu_{2}}, \frac{1}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}})
  3. Perform step 2. until the distribution of vector (\Delta, \mu_{1}, \mu_{2}) stabilizes.

How to do this in R?

At start let’s generate random sample from mixture of normals with parameters (\pi, \mu_1, \mu_2, \sigma_1, \sigma_2) = (0.7, 10, 2, 1, 2).

set.seed(12345) mu_1 <- 10 mu_2 <- 2 sigma_1 <- 1 sigma_2 <- 2 pi_known <- 0.7 n <- 2000 pi_sampled <- rbinom(n, 1, pi_known) y_1 <- rnorm(n, mu_1, sigma_1) y_2 <- rnorm(n, mu_2, sigma_2) y <- (1 - pi_sampled) * y_1 + pi_sampled * y_2

Take a quick look at a histogram of our data:

The following task is to estimate \mu_1 and \mu_2 from above model.

mu_init <- c(12, 0) n_iterations <- 300 delta <- vector("list", n_iterations) mu_1_vec <- c(mu_init[1], numeric(n_iterations)) mu_2_vec <- c(mu_init[2], numeric(n_iterations)) delta_probability <- function(i, y, mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known) { pi_known * dnorm(y, mu_2_vec[i - 1], sigma_2) / ((1- pi_known) * dnorm(y, mu_1_vec[i - 1], sigma_1) + pi_known * dnorm(y, mu_2_vec[i - 1], sigma_2)) } mu_1_mean <- function(delta, i, y) { sum((1 - delta[[i - 1]]) * y) / (1 + sum(1 - delta[[i - 1]])) } mu_2_mean <- function(delta, i, y) { sum(delta[[i - 1]] * y) / (1 + sum(delta[[i - 1]])) } for (j in 2:(n_iterations + 1)) { delta[[j - 1]] <- y %>% map_int( ~ rbinom(1, 1, delta_probability(j, ., mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known))) mu_1_vec[j] <- rnorm(1, mu_1_mean(delta, j, y), sqrt(1 / (1 + sum(1 - delta[[j - 1]])))) mu_2_vec[j] <- rnorm(1, mu_2_mean(delta, j, y), sqrt(1 / (1 + sum(delta[[j - 1]])))) }

Let’s see the relation of sampled data to known one:

The following plot presents the mean of the \Delta vector at each iteration:

Let’s check how mean of parameters \mu_1 and \mu_2 stabilize at used algorithm:

Note how little iteration was enough to stabilize the parameters.

Finally let’s see estimated density with our initial sample:

To those concerned about the topic, refer to [1] where you can find a generalization of normal distribution mixture by extending a priori distributions to other parameters.

It is also worth to compare the above algorithm with its deterministic counterpart, Expectation Maximization (EM) algorithm see [2].

Write your question and comments below. We’d love to hear what you think.

Resources:

[1] http://gauss.stat.su.se/rr/RR2006_1.pdf

[2] http://web.stanford.edu/~hastie/ElemStatLearn/

[3] http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

mea culpa!

Mon, 10/09/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An entry about our Bayesian Essentials book on X validated alerted me to a typo in the derivation of the Gaussian posterior..! When deriving the posterior (which was left as an exercise in the Bayesian Core), I just forgot the term expressing the divergence between the prior mean and the sample mean. Mea culpa!!!

Filed under: Books, Kids, R, Statistics, University life Tagged: Bayesian Analysis, Bayesian Core, Bayesian Essentials with R, Book, cross validated, Gaussian model, typo

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A step change in managing your calendar, without social media

Sun, 10/08/2017 - 19:36

(This article was first published on DanielPocock.com - r-project, and kindly contributed to R-bloggers)

Have you been to an event recently involving free software or a related topic? How did you find it? Are you organizing an event and don’t want to fall into the trap of using Facebook or Meetup or other services that compete for a share of your community’s attention?

Are you keen to find events in foreign destinations related to your interest areas to coincide with other travel intentions?

Have you been concerned when your GSoC or Outreachy interns lost a week of their project going through the bureaucracy to get a visa for your community’s event? Would you like to make it easier for them to find the best events in the countries that welcome and respect visitors?

In many recent discussions about free software activism, people have struggled to break out of the illusion that social media is the way to cultivate new contacts. Wouldn’t it be great to make more meaningful contacts by attending more a more diverse range of events rather than losing time on social media?

Making it happen

There are already a number of tools (for example, Drupal plugins and WordPress plugins) for promoting your events on the web and in iCalendar format. There are also a number of sites like Agenda du Libre and GriCal who aggregate events from multiple communities where people can browse them.

How can we take these concepts further and make a convenient, compelling and global solution?

Can we harvest event data from a wide range of sources and compile it into a large database using something like PostgreSQL or a NoSQL solution or even a distributed solution like OpenDHT?

Can we use big data techniques to mine these datasources and help match people to events without compromising on privacy?

Why not build an automated iCalendar “to-do” list of deadlines for events you want to be reminded about, so you never miss the deadlines for travel sponsorship or submitting a talk proposal?

I’ve started documenting an architecture for this on the Debian wiki and proposed it as an Outreachy project. It will also be offered as part of GSoC in 2018.

Ways to get involved

If you would like to help this project, please consider introducing yourself on the debian-outreach mailing list and helping to mentor or refer interns for the project. You can also help contribute ideas for the specification through the mailing list or wiki.

Mini DebConf Prishtina 2017

This weekend I’ve been at the MiniDebConf in Prishtina, Kosovo. It has been hosted by the amazing Prishtina hackerspace community.

Watch out for future events in Prishtina, the pizzas are huge, but that didn’t stop them disappearing before we finished the photos:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DanielPocock.com - r-project. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spatial networks – case study St James centre, Edinburgh (1/3)

Sun, 10/08/2017 - 14:00

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

Last year I spent a bit of time learning about routing and network analysis. I created a map of distance from each GB postcode to the nearest railway station. At the time my local council were also looking at changing school catchments and building new schools. This also seemed like an excellent project for routing analysis. I made some progress on this, but never wrote it up.

One aspect I was really interested in was common routes. When you create a shortest path network from many points to a single destination you get a lot of lines which converge on the same place. Can we use this for anything else? Can we count how many lines share the same pathways? This would be particularly useful for prioritising investment in safe routes to school, or walking and cycling infrastructure.

I was spurred to look at this analysis again from the tweet below:

Have to question the merits of people coming from the west driving across the city centre to get to St James’s Ctr.

— Cllr Scott Arthur (@ProfScottThinks) October 6, 2017

For background, the St James’s Centre in Edinburgh is being demolished and a new shopping centre built to replace it. There’s been a lot of (justified) complaint surrounding the new road layout at Piccadilly Place, because it prioritises motorised traffic over more efficient and effective ways of transporting people in cities (foot, bike, bus and tram). On top of this, the new St James’ centre will have a greatly increased number of spaces for parked cars (1800 from 550. With 300 for residents).

Scott’s tweet made me think about catchment area for the St James’ centre. How far away are different parts of Edinburgh? Which roads would be most travelled to get there? If one increases the number of parking spaces at a location, surely the number of cars travelling to that location will increase. What will the impact on the road network be?

There are some great resources to help us with this. The data source I’m going to use is Ordnance Survey OpenData. In particular we need Open Roads and CodePoint (postcodes). Software I’ll use for this task are QGIS, GRASS GIS and R. These are open source and require a little practice to use, but the effort pays off!

After reading data into GRASS I calculated the distance network from all postcodes to the St James’ Centre:

# connect postcodes to streets as layer 2 v.net --overwrite input=roads points=postcodes output=roads_net1 operation=connect thresh=400 arc_layer=1 node_layer=2 # connect st james to streets as layer 3 v.net --overwrite input=roads_net1 points=stjames output=roads_net2 operation=connect thresh=400 arc_layer=1 node_layer=3 # shortest paths from postcodes (points in layer 2) to st james (points in layer 3) v.net.distance in=roads_net2 out=pc_2_stjames flayer=2 to_layer=3 # Join postcode and distance tables v.db.join map=postcodes column=cat other_table=pc_2_stjames other_column=cat # Make a km column v.db.addcolumn map=postcodes columns="dist_km double precision" v.db.update map=postcodes column=dist_km qcol="dist/1000" # Write to shp (for mapping in QGIS) v.out.ogr -s input=postcodes output=pc_2_stjames format=ESRI_Shapefile output_layer=pc_2_station

We can see the results of this below:

Shortest road distance between each EH postcode and the St James’ Centre.

We can also zoom this map on the area inside the bypass and adjust the colouring to give us more resolution for shorter distances.

Shortest road distance between each EH postcode and the St James’ Centre, zoomed on the area within the Edinburgh bypass.

Now we can use R to investigate how many postcodes are within these distances. R can talk directly to GRASS, or we can use the exported shp file.

library(rgdal) # Read shp file # Note OGR wants the full path to the file, you can't use '~' postcodes = readOGR("/home/me/Projects/network/data/pc_2_stjames.shp") hist(postcodes$dist_km, breaks=50) # Rank the distances x = sort(postcodes$dist_km) png("~/Cloud/Michael/Projects/network/figures/stjames_postcode-distance.png", height=600, width=800) par(cex=1.5) plot(x, type="l", main="EH postcode: shortest road distances to St James centre", xlab="Number of postcodes", ylab="Distance (km)", lwd=3) points(max(which(x<2)), 2, pch=19, cex=2, col="purple4") points(max(which(x<5)), 5, pch=19, cex=2, col="darkorange") legend("topleft", c(paste(max(which(x<2)), "postcodes within 2 km (walking distance)"), paste(max(which(x<5)), "postcodes within 5 km (cycling distance)")), col=c("purple4", "darkorange"), pch=19, pt.cex=2) dev.off() # Get percentages round(100 * max(which(x<2)) / length(x)) round(100 * (max(which(x<5)) - max(which(x<2))) / length(x))

Ranked distance between EH postcodes and the St James centre, with markers for walking and cycling distance.

As we can see, 1868 postcodes are within walking distance of the St James’ Centre and 7044 within cycling distance. If we

  1. Ignore the other shopping centres around Edinburgh,
  2. Assume that population density is consistent between postcodes,
  3. Take the complete EH postcode locations as the catchment area,

then 8 % of total customers are within walking distance (2 km), a further 22 % are within cycling distance (2-5 km) and the remaining 70 % would need to travel by other means. I wonder if there will be 330 cycle parking spaces at the St James centre? This would be proportional to the number of car parking spaces. This assumes that those beyond cycling distance need to drive, but much of the population will be near a bus stop.

Additional work:

  • Create an allocation network for all Edinburgh shopping centres with car parks. What is St James’ catchment from that?
  • Use the census data to compare car ownership with the St James’ allocation network. Car ownership is probably quite low in the city centre!
  • Consider travel time, not just distance.
  • How many of those outside walking/cycling distance are near a bus stop?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

It’s tibbletime v0.0.2: Time-Aware Tibbles, New Functions, Weather Analysis and More

Sun, 10/08/2017 - 02:00

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Today we are introducing tibbletime v0.0.2, and we’ve got a ton of new features in store for you. We have functions for converting to flexible time periods with the ~period formula~ and making/calculating custom rolling functions with rollify() (plus a bunch more new functionality!). We’ll take the new functionality for a spin with some weather data (from the weatherData package). However, the new tools make tibbletime useful in a number of broad applications such as forecasting, financial analysis, business analysis and more! We truly view tibbletime as the next phase of time series analysis in the tidyverse. If you like what we do, please connect with us on social media to stay up on the latest Business Science news, events and information!

Introduction

We are excited to announce the release of tibbletime v0.0.2 on CRAN. Loads of new
functionality have been added, including:

  • Generic period support: Perform time-based calculations by a number
    of supported periods using a new ~period formula~.

  • Creating series: Use create_series() to quickly create a tbl_time
    object initialized with a regular time series.

  • Rolling calculations: Turn any function into a rolling version of itself with
    rollify().

  • A number of smaller tweaks and helper functions to make life easier.

As we further develop tibbletime, it is becoming clearer that the package
is a tool that should be used in addition to the rest of the tidyverse.
The combination of the two makes time series analysis in the tidyverse much easier to do!

In this post

Today we will take a look at weather data for New York and San
Francisco from 2013. It will be an exploratory analysis
to show off some of the new features in tibbletime, but the package
itself has much broader application. As we will see, tibbletime’s time-based
functionality can be a valuable data manipulation tool to help with:

  • Product and sales forecasting

  • Financial analysis with custom rolling functions

  • Grouping data into time buckets to analyze change over time, which is great for any part of an organization including sales, marketing, manufacturing, and HR!

Data and packages

The datasets used are from a neat package called weatherData. While weatherData has functionality to pull weather data for a number of cities, we will use the built-in datasets. We encourage you to explore the weatherData API if you’re interested in collecting weather data.

To get started, load the following packages:

  • tibbletime: Time-aware tibbles for the tidyverse
  • tidyverse: Loads packages including dplyr, tidyr, purrr, and ggplot
  • corrr: Tidy correlations
  • weatherData: Slick package for getting weather data

Also, load the datasets from weatherData, “NewYork2013” and “SFO2013”.

# Load libraries library(tibbletime) # Make sure you have 0.0.2 from CRAN! library(tidyverse) library(corrr) library(weatherData) # Load weatherData datasets NYC <- NewYork2013 SFO <- SFO2013 Combine and convert

To tidy up, we first join our data sets together using bind_rows(). Passing
a named list of tibbles along with specifying the .id argument allows
bind_rows() to create a new City reference column for us.

# Tidying up the weather data weather <- bind_rows(list(NYC = NYC, SFO = SFO), .id = "City") %>% as_tibble() weather ## # A tibble: 19,706 x 3 ## City Time Temperature ## ## 1 NYC 2013-01-01 00:51:00 41.0 ## 2 NYC 2013-01-01 01:51:00 39.9 ## 3 NYC 2013-01-01 02:51:00 41.0 ## 4 NYC 2013-01-01 03:51:00 41.0 ## 5 NYC 2013-01-01 04:51:00 41.0 ## 6 NYC 2013-01-01 05:51:00 39.9 ## 7 NYC 2013-01-01 06:51:00 39.9 ## 8 NYC 2013-01-01 07:51:00 39.9 ## 9 NYC 2013-01-01 08:51:00 39.9 ## 10 NYC 2013-01-01 09:51:00 39.9 ## # ... with 19,696 more rows

Next, we will convert to tbl_time and group by our City variable. Note that we know this is a tbl_time object by Index: Time that gets printed along with the tibble.

# Convert to tbl_time class weather <- weather %>% mutate(Time = as.POSIXct(Time)) %>% as_tbl_time(Time) %>% group_by(City) weather ## # A time tibble: 19,706 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 2013-01-01 00:51:00 41.0 ## 2 NYC 2013-01-01 01:51:00 39.9 ## 3 NYC 2013-01-01 02:51:00 41.0 ## 4 NYC 2013-01-01 03:51:00 41.0 ## 5 NYC 2013-01-01 04:51:00 41.0 ## 6 NYC 2013-01-01 05:51:00 39.9 ## 7 NYC 2013-01-01 06:51:00 39.9 ## 8 NYC 2013-01-01 07:51:00 39.9 ## 9 NYC 2013-01-01 08:51:00 39.9 ## 10 NYC 2013-01-01 09:51:00 39.9 ## # ... with 19,696 more rows Period conversion

The first new idea to introduce is the ~period formula~. This tells the tibbletime functions how you want to time-group your data. It is specified
as multiple ~ period, with examples being 1~d for “every 1 day,” and
4~m for “every 4 months.”

# Changing to 1 row every 2 days. # The minimum date per interval is selected by default as_period(weather, 2~d) ## # A time tibble: 366 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 2013-01-01 00:51:00 41.0 ## 2 NYC 2013-01-03 00:51:00 30.0 ## 3 NYC 2013-01-05 00:51:00 36.0 ## 4 NYC 2013-01-07 00:51:00 42.1 ## 5 NYC 2013-01-09 00:51:00 39.2 ## 6 NYC 2013-01-11 00:51:00 39.0 ## 7 NYC 2013-01-13 00:46:00 42.8 ## 8 NYC 2013-01-15 00:51:00 39.0 ## 9 NYC 2013-01-17 00:51:00 39.0 ## 10 NYC 2013-01-19 00:51:00 30.9 ## # ... with 356 more rows

In our original data, it looks like weather is an hourly dataset, with each new
data point coming in on the 51st minute of the hour for NYC and the 56th minute
for SFO. Unfortunately, a number of points don’t follow this. Check out the following rows:

# Problem: Some timestamp points don't follow hourly pattern slice(weather, 12:14) ## # A time tibble: 6 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 2013-01-01 11:51:00 39.9 ## 2 NYC 2013-01-01 12:18:00 37.4 ## 3 NYC 2013-01-01 12:51:00 37.9 ## 4 SFO 2013-01-01 08:56:00 45.0 ## 5 SFO 2013-01-01 09:56:00 46.9 ## 6 SFO 2013-01-01 10:56:00 46.0

What we want is 1 row per hour, and in this case we get two rows for NYC hour 12.
We can use as_period() to ensure that we only have 1 row for each hour

# Get 1 row per hour with as_period() weather <- as_period(weather, 1~h) slice(weather, 12:14) ## # A time tibble: 6 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 2013-01-01 11:51:00 39.9 ## 2 NYC 2013-01-01 12:18:00 37.4 ## 3 NYC 2013-01-01 13:51:00 37.9 ## 4 SFO 2013-01-01 11:56:00 48.9 ## 5 SFO 2013-01-01 12:56:00 51.1 ## 6 SFO 2013-01-01 13:56:00 52.0

Now that we have our data in an hourly format, we probably don’t care about
which minute it came in on. We can floor the date column using a helper function,
time_floor(). Credit to Hadley Wickham because this is essentially a convenient
wrapper around lubridate::floor_date(). Setting the period to 1~h floors
each row to the beginning of the last hour.

# Time floor: Shift timestamps to a time-based floor weather <- time_floor(weather, 1~h) weather ## # A time tibble: 17,489 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 2013-01-01 00:00:00 41.0 ## 2 NYC 2013-01-01 01:00:00 39.9 ## 3 NYC 2013-01-01 02:00:00 41.0 ## 4 NYC 2013-01-01 03:00:00 41.0 ## 5 NYC 2013-01-01 04:00:00 41.0 ## 6 NYC 2013-01-01 05:00:00 39.9 ## 7 NYC 2013-01-01 06:00:00 39.9 ## 8 NYC 2013-01-01 07:00:00 39.9 ## 9 NYC 2013-01-01 08:00:00 39.9 ## 10 NYC 2013-01-01 09:00:00 39.9 ## # ... with 17,479 more rows Visualize the data

Now that we have cleaned up a bit, let’s visualize the data.

# Yikes: Hourly is a bit too much data for the chart ggplot(weather, aes(x = Time, y = Temperature, color = City)) + geom_line() + theme_minimal()

Seems like hourly data is a bit overwhelming for this kind of chart. Let’s
convert to daily and try again.

# Convert to daily makes the plot much more readable weather %>% as_period(1~d) %>% ggplot(aes(x = Time, y = Temperature, color = City)) + geom_line() + theme_minimal()

That’s better. It looks like NYC has a much wider range of temperatures than
SFO. Both seem to be hotter in summer months.

Period-based summaries

The dplyr::summarise() function is very useful for taking grouped summaries.
time_summarise() takes this a step further by allowing you to summarise by
period.

Below we take a look at the average and standard deviation of the temperatures
calculated at monthly and bimonthly intervals.

# Weather average by 1 month (monthly) weather_avg <- weather %>% # Monthly average / sd time_summarise(1~m, avg = mean(Temperature), sd = sd(Temperature)) weather_avg ## # A time tibble: 24 x 4 ## # Index: Time ## # Groups: City [2] ## City Time avg sd ## * ## 1 NYC 2013-01-31 23:00:00 35.91238 9.855091 ## 2 NYC 2013-02-28 23:00:00 34.28445 6.670289 ## 3 NYC 2013-03-31 23:00:00 39.96095 5.977762 ## 4 NYC 2013-04-30 23:00:00 52.08597 8.452899 ## 5 NYC 2013-05-31 23:00:00 62.65565 9.884137 ## 6 NYC 2013-06-30 23:00:00 73.25931 7.583734 ## 7 NYC 2013-07-31 23:00:00 80.70498 7.268836 ## 8 NYC 2013-08-31 23:00:00 75.01752 4.783213 ## 9 NYC 2013-09-30 23:00:00 67.88597 8.102304 ## 10 NYC 2013-10-31 23:00:00 60.51425 8.165931 ## # ... with 14 more rows # Weather average by 2 months (bimonthly) weather_2m_avg <- weather %>% # Bimonthly average / sd time_summarise(2~m, avg = mean(Temperature), sd = sd(Temperature)) weather_2m_avg ## # A time tibble: 12 x 4 ## # Index: Time ## # Groups: City [2] ## City Time avg sd ## * ## 1 NYC 2013-02-28 23:00:00 35.14108 8.532226 ## 2 NYC 2013-04-30 23:00:00 45.94041 9.491227 ## 3 NYC 2013-06-30 23:00:00 67.87056 10.295737 ## 4 NYC 2013-08-31 23:00:00 77.86316 6.777490 ## 5 NYC 2013-10-31 23:00:00 64.13969 8.928570 ## 6 NYC 2013-12-31 23:00:00 41.69274 10.711184 ## 7 SFO 2013-02-28 23:00:00 49.26967 4.901310 ## 8 SFO 2013-04-30 23:00:00 54.79945 6.072042 ## 9 SFO 2013-06-30 23:00:00 59.95865 6.529238 ## 10 SFO 2013-08-31 23:00:00 61.63802 5.163107 ## 11 SFO 2013-10-31 23:00:00 61.38558 6.923694 ## 12 SFO 2013-12-31 23:00:00 53.05468 6.346301 A closer look at July

July seemed to be one of the hottest months for NYC, let’s take a closer look at it.

To just grab July dates, use time_filter(). If you haven’t seen this before, a time formula is used to specify the dates to filter for. The one-sided formula below expands to include dates between, 2013-07-01 00:00:00 ~ 2013-07-31 23:59:59.

july <- weather %>% time_filter(~2013-7) july ## # A time tibble: 1,486 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 2013-07-01 00:00:00 73.4 ## 2 NYC 2013-07-01 01:00:00 73.9 ## 3 NYC 2013-07-01 02:00:00 73.4 ## 4 NYC 2013-07-01 03:00:00 73.9 ## 5 NYC 2013-07-01 04:00:00 73.9 ## 6 NYC 2013-07-01 05:00:00 73.9 ## 7 NYC 2013-07-01 06:00:00 75.9 ## 8 NYC 2013-07-01 07:00:00 75.9 ## 9 NYC 2013-07-01 08:00:00 75.2 ## 10 NYC 2013-07-01 09:00:00 77.0 ## # ... with 1,476 more rows

To visualize July’s weather, we will make a boxplot of the temperatures.
Specifically, we will slice July into intervals of 2 days, and create a series
of boxplots based on the data inside those intervals. To do this, we will
use time_collapse(), which collapses a column of dates into a column of the same
lenth, but where every row in a time interval shares the same date. You can use this resulting
column for further grouping or labeling operations.

# Every row where the date falls between # (2013-07-01 00:00:00, 2013-07-02 23:59:59) # shares the same date, and so on for the entire series july_collapsed <- july %>% time_collapse(2~d) july_collapsed ## # A time tibble: 1,486 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 2013-07-02 23:00:00 73.4 ## 2 NYC 2013-07-02 23:00:00 73.9 ## 3 NYC 2013-07-02 23:00:00 73.4 ## 4 NYC 2013-07-02 23:00:00 73.9 ## 5 NYC 2013-07-02 23:00:00 73.9 ## 6 NYC 2013-07-02 23:00:00 73.9 ## 7 NYC 2013-07-02 23:00:00 75.9 ## 8 NYC 2013-07-02 23:00:00 75.9 ## 9 NYC 2013-07-02 23:00:00 75.2 ## 10 NYC 2013-07-02 23:00:00 77.0 ## # ... with 1,476 more rows

Let’s visualize to see if we can gain any insights. Wow, San Fran maintained a constant cool average of 60 degrees in the hottest month
of the year!

# Plot Temperature in July july_collapsed %>% ggplot(aes(x = reorder(format(Time, '%b-%d'), Time), y = Temperature, color = City)) + geom_boxplot() + labs(x = "", title = "Temperature in July, 2013") + theme_minimal()

Period and rolling correlations

Finally, we will look at the correlation of temperatures in our two cities in a few different ways.

First, let’s look at the overall correlation. The corrr package provides a nice way to accomplish this with data frames.

weather %>% spread(key = City, value = Temperature) %>% select(NYC, SFO) %>% corrr::correlate() ## # A tibble: 2 x 3 ## rowname NYC SFO ## ## 1 NYC NA 0.6510299 ## 2 SFO 0.6510299 NA

Next, let’s look at monthly correlations. The general idea will be
to nest each month into it’s own data frame, apply correlate() to each
nested data frame, and then unnest the results. We will use time_nest() to easily perform the monthly nesting.

monthly_nest <- weather %>% spread(key = City, value = Temperature) %>% # Nest by month, don't add the original dates to the inner tibbles time_nest(1~m, keep_inner_dates = FALSE) monthly_nest ## # A time tibble: 12 x 2 ## # Index: Time ## Time data ## * ## 1 2013-01-31 23:00:00 ## 2 2013-02-28 23:00:00 ## 3 2013-03-31 23:00:00 ## 4 2013-04-30 23:00:00 ## 5 2013-05-31 23:00:00 ## 6 2013-06-30 23:00:00 ## 7 2013-07-31 23:00:00 ## 8 2013-08-31 23:00:00 ## 9 2013-09-30 23:00:00 ## 10 2013-10-31 23:00:00 ## 11 2013-11-30 23:00:00 ## 12 2013-12-31 23:00:00

For each month, calculate the correlation tibble and then focus() on the NYC column. Then unnest and floor the results.

monthly_nest %>% mutate(monthly_cor = map(data, ~corrr::correlate(.x) %>% corrr::focus(NYC)) ) %>% unnest(monthly_cor) %>% time_floor(1~d) ## # A time tibble: 12 x 4 ## # Index: Time ## Time data rowname NYC ## * ## 1 2013-01-31 SFO -0.10281153 ## 2 2013-02-28 SFO 0.38288119 ## 3 2013-03-31 SFO 0.52432022 ## 4 2013-04-30 SFO 0.34258085 ## 5 2013-05-31 SFO 0.07814153 ## 6 2013-06-30 SFO 0.52024900 ## 7 2013-07-31 SFO 0.29163801 ## 8 2013-08-31 SFO 0.45479643 ## 9 2013-09-30 SFO 0.48056194 ## 10 2013-10-31 SFO 0.59429495 ## 11 2013-11-30 SFO 0.35513490 ## 12 2013-12-31 SFO 0.17559596

It seems that summer and fall months tend to have higher correlation than colder months.

And finally we will calculate the rolling correlation of NYC and SFO temperatures. The “width” of our roll will be monthly, or 360 hours since we are in hourly format.

There are a number of ways to do this, but for this example
we introduce rollify(), which takes any function that you give it and creates a rolling version of it. The first argument to rollify() is the function that you want to modify, and it is passed to rollify() in the same way that you would pass a function to purrr::map(). The second argument is the window size. Call the rolling function just as you would call a non-rolling version
of cor() from inside mutate().

# Rolling custom functions with rollify() rolling_cor <- rollify(~cor(.x, .y, use = "pairwise.complete.obs"), window = 360) weather_rol_cor <- weather %>% spread(key = City, value = Temperature) %>% # Mutate with a rolling function! mutate(rolling_cor = rolling_cor(NYC, SFO)) # Plot it! ggplot(weather_rol_cor, aes(x = Time, y = rolling_cor)) + geom_line() + labs(x = "Date", y = "Rolling correlation", title = "1 month rolling correlation of NYC and SFO temperatures") + theme_minimal()

It looks like the correlation is definitely not stable throughout the year,
so that initial correlation value of .65 definitely has to be taken
with a grain of salt!

Rolling Functions: Pros/Cons and Recommendations

There are a number of ways to do rolling functions, and we recommend based on your needs. If you are interested in:

  • Flexibility: Use rollify(). You can literally turn any function into a “tidy” rolling function. Think everything from rolling statistics to rolling regressions. Whatever you can dream up, it can do. The speed is fast, but not quite as fast as other Rcpp based libraries.

  • Performance: Use the roll package, which uses RcppParallel as its backend making it the fastest option available. The only downside is flexibility since you cannot create custom rolling functions and are bound to those available.

Wrapping up

We’ve touched on a few of the new features in tibbletime v0.0.2. Notably:

  • rollify() for rolling functions

  • as_period() with generic periods

  • time_collapse() for collapsing date columns

A full change log can be found in the NEWS file on Github or CRAN.

We are always open to new ideas and encourage you to submit an issue on our
Github repo here.

Have fun with tibbletime!

Final thoughts

Mind you this is only v0.0.2. We have a lot of work to do, but we couldn’t
wait any longer to share this. Feel free to kick the tires on tibbletime, and let us know your thoughts. Please submit any comments, issues or bug reports to us on GitHub here. Enjoy!

About Business Science

Business Science takes the headache out of data science. We specialize in applying machine learning and data science in business applications. We help businesses that seek to build out this capability but may not have the resources currently to implement predictive analytics. Business Science works with clients as diverse as startups to Fortune 500 and seeks to guide organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

Connect with Business Science

Connect, communicate and collaborate with us! The easiest way to do so is via social media. Connect with us out on:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NFL Series

Sun, 10/08/2017 - 02:00

(This article was first published on StatOfMind, and kindly contributed to R-bloggers)

If you have previously attempted to analyze NFL data, it is likely that you have tried to scrape ESPN or football-reference, which provides a wealth on statistics surrounding game data. However, if you ever wanted to obtain truly in-depth data, then it is likely that you found yourself leveraging the API maintained by NFL.com. Unfortunately, it’s data is surfaced in a JSON format that leaves a lot to be desired (i.e. it’s a nightmare). Lucky for me, I was recently scrolling through my Twitter feed and came across an interesting mention of a new R package called nflscrapR. After some exploration, I came through this quote from the author of the package:

NFL fans and sports enthusiastic alike, I would like to introduce the nflscrapR package, an American football data aggregator that will scrape, clean, and parse play-by-play data across games, seasons, and careers.

The nflscrapR essentiallys surfaces all play-by-play data for the last 7 seasons, and this has motivated me to start a deep dive on NFL data. I plan to have two main topics, one that focusses on players at specific positions, and another focussing on team dynamics and patterns. To kick this series off, we will begin with an exploration into the performance of NFL’s best running backs.

1. Prerequisites

In order to reproduce the figures below, wou will need to have R (v3.2.3 or above) installed on your machine. There are also a couple of additional libraries that will be required. Details on how to install those are shown in the commands below.

# install.packages('devtools') library(devtools) #> Skipping install for github remote, the SHA1 (05815ef8) has not changed since last install. #> Use `force = TRUE` to force installation devtools::install_github(repo = "maksimhorowitz/nflscrapR", force=TRUE) 2 Collecting the Data

With the nflscrapR library now installed, you are now ready to collect play-by-play data for the 2016-2017 NFL season. Start by loading the library and collect the data using the command below:

# Load the package library(nflscrapR) library(ggplot2) library(dplyr) library(pheatmap) # Collect data for 2016 NFL season pbp_2016 <- season_play_by_play(2016)

Overall the pbp_2016 dataframe contains 100 data points for 45,737 plays, but for the purpose of this post, we will be focussing exclusively on fields related to running backs (In future posts, we will explore data relevant to other positions on the football field). In addition, we’ll focus primarily on frequently used running backs, which we empirically define as any player that has had at least 200 rushes over the course of the 2016-2017 season.

# Get all players with at least 200 rushes during the season min_rush_cnt <- 200 rush_cnt <- pbp_2016 %>% filter(PlayType == 'Run') %>% group_by(Rusher) %>% summarise(rush_cnt = n(), total_yards = sum(Yards.Gained), mean_yards = round(mean(Yards.Gained), 2)) %>% filter(rush_cnt >= min_rush_cnt) %>% arrange(desc(rush_cnt)) # Get all rushing data for eligible players rushing_stats <- pbp_2016 %>% filter(PlayType == 'Run' & Rusher %in% rush_cnt$Rusher & Yards.Gained <=50) %>% filter(down!=4 & !is.na(down)) %>% filter(!is.na(RunLocation))

Altogether, we find that a total of 19 players rushed over 200 times during the 2016-2017 season. A short summary of their performance is show below.



3. Who are the most consistent and productive running backs?

When talking about the overall performance of running backs, it is common for people to report the total number of yards that were rushed for, or the average yards per run. While these are perfectly acceptable numbers to share, I’ve always felt like they did not tell the whole story. For example, a player could have a high average yards per run, only for us to realize that he actually often loses yards on a run but makes up for it with a few very long runs. Therefore, I started by looking at the overall distribution of number of yards gained/lost for each play, with the hope that this would reveal whether some players were more consistent on a play-by-play basis than others. We can use the ggplot2 library to generate a density plot of yards gained per play for each of our eligible players:

# Compare distribution of rushes for eligible players ggplot(rushing_stats, aes(x = Yards.Gained, y = Rusher, fill=Rusher)) + geom_joy(scale = 3) + theme_joy() + scale_fill_manual(values=rep(c('gray', 'lightblue'), length(rushing_stats$Rusher)/2)) + scale_y_discrete(expand = c(0.01, 0)) + scale_x_continuous(expand = c(0, 0)) + theme(legend.position="none") + labs(x="Yards gained per play" ,y="")

Overall, we see that most running backs have a similar distribution of yards gained by run. However, we can see that LeSean McCoy (7th distribution from the top) has a much “flatter” distribution (i.e. more variance in the distribution of yards gained per run), meaning his performance can be a lot more variable/unpredictable for any given run.

4. When are running backs used?

Another statement that is also commonly reported is that running backs are primarily used in early downs. To verify whether this is generall true, we can compute the total amount of runs that each player made across different downs, and go even further by breaking this down by quarter too. The code chunk below counts the number of runs that each rushing back made during pairs of downs and quarters.

# Compare when rushers are used usage_stats <- pbp_2016 %>% filter(!is.na(down) & Rusher %in% rush_cnt$Rusher & qtr!=5) %>% group_by(Rusher, down, qtr) %>% summarise(cnt = n()) %>% mutate(qtr_down = paste("Q", qtr, "- Down: ", down, sep=""))

We can then leverage the d3heatmap to quickly generate a simple heatmap of how often running backs are used during specific downs and quarters.

library(d3heatmap) # pivot dataframe usage <- usage_stats %>% dcast(Rusher ~ qtr_down, value.var = "cnt") # clean data row.names(usage) <- usage$Rusher usage <- usage %>% select(-Rusher) usage[is.na(usage)] <- 0 # normalize data usage_norm <- t(apply(usage, 1, function(x) x/sum(x))) # Plot heatmap of proportions of rushes by different field locations and gaps p <- d3heatmap(usage_norm, colors="Blues", Colv=FALSE, show_grid=3) saveWidget(p, file="rusher_usage_down_quarter.html")

Proportion of rushes per quarter and downs for NFLs best running backs


In the plot above, we are essentially plotting the usage of each running back as a function of what stage of the game we are in. As we can see, it is abundantly clear that running backs are primarily used in the first two downs, and rarely during the third and fourth downs. Overall, there does not appear to be significant differences between how players are used. However, it does not answer whether some running backs perform better on certains downs, which is what we will address now.

5. Are some running backs better on certain downs?

Another question we can ask ourselves is whether some running backs perform better on later downs. To visualize this data, we can again generate a density plot of yards gained per play for each of our eligible players, while also facetting the data by downs.

# Compare distribution of rushes by downs ggplot(rushing_stats, aes(x = Yards.Gained, y = down)) + geom_joy(scale=1, rel_min_height=.03, fill='black') + scale_y_discrete(expand = c(0.01, 0)) + xlab('Value') + facet_wrap(~Rusher, scales='free', ncol=3) + theme_joy() + theme(axis.title.y = element_blank())+ labs(x="Yards gained per play" ,y="Down")

Again, we do not see any striking differences between players and the distribution of yards gained by down. However, it is interesting to note that most “long runs” (10 yards or above) tend to occur on the first down. When we look closely, we can also see that some rushers such as DeMarco Murray do exhibit visual differences between yards gained by downs. In this case, the “mass” of yards gained on the third down is much closer to zero than when compared to the “mass” for the first and second downs, which suggests that he may struggle during this down (this could be attributable to many factors: stamina, weaker offensive line on 3rd downs, etc…)

6. Where do the best running backs run?

It is fairly well accepted that the performance of a running back will be heavily influenced by the strength of the offensive line in front of them. With that in mind, let’s start by looking at the field location in which different running backs prefer to run. The plot below shows the number of yards gained by each running back based on which side of the field they ran towards (left, middle or right).

ggplot(data=rushing_stats, aes(x=RunLocation, y=Yards.Gained, color=RunLocation)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)

We can take this further by looking at the field location in which different running backs prefer to run. This can be achieved by generating a matrix that contains the proportion of rushes by field location for each player.

# Get proportions of rushes on different field locations rush_locations <- rushing_stats %>% filter(PlayType=='Run') %>% filter(!is.na(RunLocation)) %>% group_by(Rusher, RunLocation) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) loc_mat <- rush_locations %>% dcast(Rusher ~ RunLocation, value.var = "freq") row.names(loc_mat) <- loc_mat$Rusher loc_mat <- loc_mat %>% select(-Rusher)

The content of the loc_mat matrix contains the preferred rush locations of each running back, and can be plotted as a clustered heatmaps using the pheatmap library in R.

# Plot heatmap of proportions of rushes by different field locations pheatmap(loc_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)

The plot above highlights which running back are most similar in their run locations. Rushers such as J. Ajayi, E. Elliot and M. Ingram, clearly avoid running in the middle of the field. On the flipside, rushers such as M. Gordon D. Johnson, S.ware and F. Gore clearly prefer running down the middle rather than to the sides. These patterns could be attributed to the running styles of each rushers (speed, mobility, strength etc…), but also the strength of the offensive line at particular positions.

7. Who creates the gaps for the running backs?

We can also explore the number of yards gained by each running back based on the offensive line positions that created space for them.

ggplot(data=rushing_stats %>% filter(!is.na(RunGap)), aes(x=RunGap, y=Yards.Gained, color=RunGap)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)

The proportions of run opportunities that was enabled by each offensive line position can also be summarized in a matrix using the command below.

# Get proportions of gaps created by different offensive line positions rush_gaps <- rushing_stats %>% filter(!is.na(RunGap)) %>% filter(!is.na(RunGap)) %>% group_by(Rusher, RunGap) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) gap_mat <- rush_gaps %>% dcast(Rusher ~ RunGap, value.var = "freq") row.names(gap_mat) <- gap_mat$Rusher gap_mat <- gap_mat %>% select(-Rusher) # Plot heatmap of proportions of rushes by different field gaps pheatmap(gap_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)

Again, we see many differences among the leagues top running backs. Unsurprisingly, a number of players have the most run opportunities created by the guard position, but a few players such as F. Gore, L. McCoy and D. Johnson run in gaps created by the tackle position. Finally, S. Ware from the Kansas City Chiefs often runs through gaps created by tight ends, which may be more representative of the team’s formation.

Conclusion

In this introductory post, we have explored the performance and patterns of some of NFL’s best running backs. Overall, it was a fairly superficial analysis, as it never considered interactions with other components of the team, or temporal patterns, but it does show the depth and power of this data. In the next series, I will dive into the performance and behavior of wide receivers, so stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: StatOfMind. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Linear / Logistic Regression in R: Dealing With Unknown Factor Levels in Test Data

Sat, 10/07/2017 - 21:11

(This article was first published on R – Discovering Python & R, and kindly contributed to R-bloggers)

Let’s say you have data containing a categorical variable with 50 levels. When you divide the data into train and test sets, chances are you don’t have all 50 levels featuring in your training set.

This often happens when you divide the data set into train and test sets according to the distribution of the outcome variable. In doing so, chances are that our explanatory categorical variable might not be distributed exactly the same way in train and test sets – so much so that certain levels of this categorical variable are missing from the training set. The more levels there are to a categorical variable, it gets difficult for that variable to be similarly represented upon splitting the data.

Take for instance this example data set (train.csv + test.csv) which contains a categorical variable var_b that takes 349 unique levels. Our train data has 334 of these levels – on which the model is built – and hence 15 levels are excluded from our trained model. If you try making predictions on the test set with this model in R, it throws an error:
factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820,
25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460

If you’ve used R to model generalized linear class of models such as linear, logit or probit models, then chances are you’ve come across this problem – especially when you’re validating your trained model on test data.

The workaround to this problem is in the form of a function, remove_missing_levels  that I found here written by pat-s. You need magrittr library installed and it can only work on lm, glm and glmmPQL objects.

Once you’ve sourced the above function in R, you can seamlessly proceed with using your trained model to make predictions on the test set. The code below demonstrates this for the data set shared above. You can find these codes in one of my github repos and try it out yourself.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Discovering Python & R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Visualization Course for First-Year Students

Sat, 10/07/2017 - 17:26

(This article was first published on R Project – Citizen-Statistician, and kindly contributed to R-bloggers)

A little over a year ago, we decided to propose a data visualization course at the first-year level. We had been thinking about this for awhile, but never had the time to teach it given the scheduling constraints we had. When one of the other departments on campus was shut down and the faculty merged in with other departments, we felt that the time was ripe to make this proposal.

Course description of the EPsy 1261 data visualization course

In putting together the proposal, we knew that:

  • The course would be primarily composed of social science students. My department, Educational Psychology, attracts students from the College of Education and Human Development (e.g., Child Psychology, Social Work, Family Social Science).
  • To attract students, it would be helpful if the course would fulfill the University’s Liberal Education (LE) requirement for Mathematical Thinking.

This led to several challenges and long discussions about the curriculum for this course. For example:

  • Should the class focus on producing data visualizations (very exciting for the students) or on understanding/interpreting existing visualizations (useful for most social science students)?
  • If we were going to produce data visualizations, which software tool would we use? Could this level of student handle R?
  • In order to meet the LE requirement, the curriculum for the course would need to show a rigorous treatment of students actually “doing” mathematics. How could we do this?
  • Which types of visualizations would we include in the course?
  • Would we use a textbook? How might this inform the content of the course?
Software and Content

After several conversations among the teaching team, with stakeholder departments, and with colleagues teaching data visualization courses at other universities, we eventually proposed that the course:

  • Focus both on students’ being able to read and understand existing visualizations and produce a subset of these visualizations, and
  • Use R (primary tool) and RAWGraphs for the production of these plots.
Software: Use ggplot2 in R

The choice to use R was not an immediate one. We initially looked at using Tableau, but the default choices made by the software (e.g., to immediately plot summaries rather than raw data) and the cost for students after matriculating from the course eventually sealed its fate (we don’t use it). We contemplated using Excel for a minute (gasp!), but we vetoed that even quicker than Tableau. The RAWGraphs website, we felt, held a lot of promise as a software tool for the course. It had an intuitive drag-and-drop interface, and could be used to create many of the plots we wanted students to produce. Unfortunately, we were not able to get the bar graph widget to produce side-by-side bar plots easily (actually at all). The other drawback was that the drag-and-drop interactions made it a harder sell to the LE committee as a method of building students’ computational and mathematical thinking if we used it as the primary tool.

Once we settled on using R, we had to decide between using the suite of base plots, or ggplot2 (lattice was not in the running). We decided that ggplot made the most sense in terms of thinking about extensibility. Its syntax was based on a theoretical foundation for creating and thinking about plots, which also made it a natural choice for a data visualization course. The idea of mapping variables to aesthetics was also consistent with the language used in RAWGraphs, so it helped reenforce core ideas across the tools. Lastly, we felt that using the ggplot syntax would also help students transition to other tools (such as ggviz or plotly) more easily.

One thing that the teaching team completely agreed on (and was mentioned by almost everyone who we talked to who taught data visualization) was that we wanted students to be producing graphs very early in the course; giving them a sense of power and the reenforcement that they could be successful. We felt this might be difficult for students with the ggplot syntax. To ameliorate this, we wrote a course-specific R package (epsy1261; available on github) that allows students to create a few simple plots interactively by employing functionality from the manipulate package. (We could have also done this via Shiny, but I am not as well-versed in Shiny and only had a few hours to devote to this over the summer given other responsibilities.)

Interactive creation of the bar chart using the epsy1261 package. This allows students to input  minimal syntax, barchart(data), and then use interaction to create plots.

Course Content

We decided on a three-pronged approach to the course content. The first prong would be based on the production of common statistical plots: bar charts, scatterplots, and maps, and some variations of these (e.g., donut plots, treemaps, bubble charts). The second prong was focused on reading more complex plots (e.g., networks, alluvial plots), but not producing them, except maybe by hand. The third prong was a group project. This would give students a chance to use what they had learned, and also, perhaps, access other plots we had not covered. In addition, we wanted students to consider narrative in the presentation of these plots—to tell a data-driven story.

Along with this, we had hoped to introduce students to computational skills such as data summarization, tidying, and joining data sets. We also wanted to introduce concepts such as smoothing (especially for helping describe trend in scatterplots), color choice, and projection and coordinate systems (in maps). Other things we thought about were using R Markdown and data scraping.

Reality

The reality, as we are finding now that we are over a third of the way through the course, is that this amount of content was over-ambitious. We grossly under-estimated the amount of practice time these students would need, especially working with R. Two things play a role in this:

  1. The course attracted way more students than we expected for the first offering (our class size is 44) and there is a lot of heterogeneity of students’ experiences and academic background. For example, we have graduate students from the School of Design, some first years, and mostly sophomores and juniors. We also have a variety of majors including, design, the social sciences, and computer science.
  2. We hypothesize that students are not practicing much outside of class. This means they are essentially only using R twice a week for 75 minutes when they are in class. This amount of practice is too infrequent for students to really learn the syntax.

Most of the students’ computational experiences are minimal prior to taking this course. They are very competent at using point-and-click software (e.g., Google Docs), but have an abundance of trouble when forced to use syntax. The precision of case-sensitivity, commas, and parentheses is outside their wheelhouse.

I would go so far as to say that several of these students are intimidated by the computation, and completely panic on facing an error message. This has led to us having to really think through and spend time discussing computational workflows and dealing with how to “de-bug” syntax to find errors. All of this has added more time than we anticipated on the actual computing. (While this may add time, it is still educationally useful for these students.)

The teaching team meets weekly for 90 minutes to discuss and reflect on what happened in the course. We also plan what will happen in the upcoming week based on what we observed and what we see in students’ homework. As of now, we clearly see that students need more practice, and we have begun giving students the end result of a plot and asking them to re-create these.

I am still hoping to get to scatterplots and maps in the course. However, some of the other computational ideas (scraping, joining) may have to be relegated to conceptual ideas in a reading. We are also considering scrapping the project, at least for this semester. At the very least, we will change it to a more structured set of plots they need to produce rather than letting them choose the data sets, etc. Live and learn. Next time we offer the course it will be better.

*Technology note: RAWGraphs can be adapted by designing additional chart types, so in theory, if one had time, we could write our own version to be more compatible with the course. We are also considering using the ggplotgui package, which is a Shiny dashboard for creating ggplot plots.

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Project – Citizen-Statistician. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

From Static to Numeric to Single-Choice Exercises in R/exams

Sat, 10/07/2017 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Idea

Our colleagues over at the economics department became interested in using R/exams for
generating large-scale exams in their introductory economics courses. However, they face the challenge that
so far they had been writing static exercises and modified them by hand if they wanted to reuse them in a
different exam in another semester. To let R/exams do this job it is illustrated how a static arithmetic
exercise can be turned into a dynamic exercise template either in num
format with a numeric solution or into schoice format with a single-choice solution.
The idea for the exercise is a very basic price elasticity of demand task:

Consider the following inverse demand function:
p(x)=a–b·x

for the price
p

given the demanded quantity
x

. What is the price elastiticy of demand at a price of
p

?

The natural candidates for “parameterizing” this exercise are the
price
p

and the parameters
a

and
b

of the inverse demand function.
Based on these the solution is simply:

First, we obtain the demand function by inverting the inverse demand function:
x=D(p)=(a–p)/b

.
Then, at
p

the price elasticity of demand is
D‘(p)D(p)p=–1/bxp. Overview

Below various incarnations of this exercise are provided in both R/Markdown Rmd and R/LaTeX Rnw format.
The following table gives a brief overview of all available versions along with a short description of the idea behind it.
More detailed explanations are provided in the subsequent sections.

# Exercise templates Dynamic? Type Description 1 elasticity1.Rmd
elasticity1.Rnw No num Fixed parameters and numeric solution. 2 elasticity2.Rmd
elasticity2.Rnw No schoice As in #1 but with single-choice solution (five answer alternatives). 3 elasticity3.Rmd
elasticity3.Rnw Yes num Randomly drawn parameters with dynamic computation of correct solution, based on #1. 4 elasticity4.Rmd
elasticity4.Rnw Yes schoice Randomly drawn parameters (as in #3) with dynamically-generated single-choice solution (as in #2), computed by num_to_schoice(). 5 elasticity5.Rmd
elasticity5.Rnw Yes schoice As in #4 but with the last alternative: None of the above. Static numeric

The starting point is a completely static exercise as it had been used in a previous
introductory economics exam. The parameters had been set
to
p=5

,
a=50

,
and
b=0.5

.
This implies that
x=90

,
leading to an elasticity of
–0.111111

.

The corresponding R/exams templates simply hard-code these numbers into the question/solution
and wrap everything into R/Markdown (elasticity1.Rmd)
or R/LaTeX (elasticity1.Rnw).
Note that LaTeX is used in either case for the mathematical notation. In case you are unfamiliar
with the R/exams format, please check out the First Steps tutorial.

The meta-information simply sets extype to num, supplies the exsolution with a precision of three digits,
and allows an extol tolerance of 0.01. To see what the result looks like download the files linked above
and run exams2html() and/or exams2pdf(). (The examples below always use the R/Markdown version but the
R/LaTeX version can be used in exactly the same way.)

library("exams") exams2html("elasticity1.Rmd") exams2pdf("elasticity1.Rmd") Static single-choice

Single-choice versions of exercises are often desired for use in written exams
because they can be conveniently scanned and automatically evaluated. Thus, we need to come up
with a number of incorrect alternative solutions (or “distractors”). If desired, these could include
typical wrong solutions or a None of the others alternative.

The R/exams templates
elasticity2.Rmd and
elasticity2.Rnw are
essentially copies of the static numeric exercise above but:

  1. Question/solution now contain an answerlist (with five alternatives).
  2. The extype has been changed to schoice.
  3. The exsolution now contains a binary coding of the correct solution.
  4. Furthermore, to obtain some basic randomization we have turned on shuffling by setting exshuffle
    to TRUE. (Subsampling more than five alternatives would also be possible and would add some further randomization.)

As above exams2html() and/or exams2pdf() can be used to display the exercise interactively in R/exams.

Dynamic numeric

Next, the static exercise from above is made dynamic by drawing the parameters from a suitable
data-generating process. In this case, the following works well:

## p = a - b * x p <- sample(5:15, 1) fctr <- sample(c(2, 4, 5, 10), 1) x <- sample(5:15, 1) * fctr b <- sample(1:5, 1) / fctr a <- p + b * x ## elasticity sol <- -1/b * p/x

Note that in order to obtain “nice” numbers a common scaling factor fctr is used for both x and b.
Also, while the examinees are presented with parameters a and b and have to compute x,
the data-generating process actually draws x and b and computes a from that. Again, this makes it
easier to obtain “nice” numbers.

The R/exams templates
elasticity3.Rmd and
elasticity3.Rnw include
the data-generating process above as a code chunk either in R/Markdown or R/LaTeX format.
The parameters are then inserted into question/solution/metainformation using `r a` (R/Markdown)
or \Sexpr{a} (R/LaTeX). Sometimes the fmt() function is used for formatting the parameters
with a desired number of digits. (See ?fmt for more details.)

As before exams2html() and/or exams2pdf() can be used to display the random draws from
the exercise templates. For checking that the meta-information is included correctly, it is often
helpful to run

exams_metainfo(exams2html("elasticity3.Rmd"))

Furthermore, some tweaking is usually required when calibrating the parameter ranges in the
data-generating process. The stresstest_exercise() function draws a large number of random replications
and thus can help to spot errors that occur or to find solutions that are “extreme” (e.g., much larger or smaller than usual). To clean up your
global environment and run the function use something like this:

rm(list = ls()) s <- stresstest_exercise("elasticity3.Rmd", n = 200) plot(s) plot(s, type = "solution")

The latter command plots the correct solution against the (scalar) parameters that were
generated in the exercise. This might show patterns for which parameters the solution becomes
too large or too small etc. See ?stresstest_exercise for further features/details.

Dynamic single-choice

To go from the dynamic numeric exercise to a dynamic single-choice exercise we need to
extend the data-generating process to also produce a number of wrong alternatives.
The function num_to_schoice() helps with this by providing different sampling
mechanisms. It allows to set a range in which the alternatives have to be, a minimum
distance between all alternatives, possibly include typical wrong solutions, etc.
It also shuffles the resulting alternatives and tries to make sure that the correct
solution is not a certain order statistic (e.g., almost always the largest or
smallest alternative).

The R/exams templates
elasticity4.Rmd and
elasticity4.Rnw first
wrap the data-generating process into a while loop with while(sol > -0.11). This makes sure
that there is enough “space” for four wrong alternatives between -0.11 and 0, using a minimum
distance of 0.017. Subsequently, the four wrong alternatives are generated by:

## single-choice incl. typical errors err <- c(1/sol, sol/p, p/sol) err <- err[(err > -5) & (err < -0.2) & abs(err - sol) > 0.01] rng <- c(min(1.5 * sol, -1), -0.01) sc <- num_to_schoice(sol, wrong = err, range = rng, delta = 0.017, method = "delta", digits = 3)

This suggests a number of typical wrong solutions err (provided that they are not too small
or too large) and makes sure that the range rng is large enough. With these arguments
num_to_schoice() is run, see ?num_to_schoice for the details. The resulting sc list
contains suitable $questions and $solutions that can be easily embedded in an answerlist()
and in the meta-information. Sometimes it is useful to wrap num_to_schoice() into another
while() loop to make sure a valid result is found. See the deriv2
template for illustraion.

As before exams2html() and/or exams2pdf() can be used to display the random draws from
this exercise template. And some stress-testing could be carried out by:

rm(list = ls()) s <- stresstest_exercise("elasticity4.Rmd", n = 200) plot(s) plot(s, type = "ordering")

As a final variation we could include None of the above as the last alternative. This is
very easy because we can simply replace the fifth element of the question list with the corresponding
string:

sc$questions[5] <- "None of the above."

No matter whether this was actually the correct alternative or one of the incorrect alternatives, the
corresponding solution will stay the same. The R/exams templates
elasticity5.Rmd and
elasticity5.Rnw incorporate
this additional line of code in the data-generating process.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

getSymbols and Alpha Vantage

Fri, 10/06/2017 - 23:12

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

Thanks to Paul Teetor, getSymbols() can now import data from Alpha Vantage!  This feature is part of the quantmod 0.4-11 release, and provides another another data source to avoid any Yahoo Finance API changes*.

Alpha Vantage is a free web service that provides real-time and historical equity data.  They provide daily, weekly, and monthly history for both domestic and international markets, with up to 20 years of history. Dividend and split adjusted close prices are available for daily data. They also provide near real-time price bars at a resolution of 1 minute or more, for up to 10 recent days. All you need to get started is a one-time registration for an API key.  Alpha Vantage has clean, documented, public API that returns either JSON-encoded data or a CSV file.  The arguments to getSymbols.av() closely follow the native API, so be sure to use their documentation! To get started, install the latest quantmod from CRAN.  Then you call:   getSymbols(“MSFT”, src = “av”, api.key = “[your key]”)  Where you replace “[your key”] with the API key you receive after registration.  You can use setDefaults() to set your API key one time, and use it for all getSymbols.av() calls.   setDefaults(“getSymbols.av”, api.key = “[your key]”) *Speaking of API changes, this release also includes a fix for a Yahoo Finance change (#174). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stan Biweekly Roundup, 6 October 2017

Fri, 10/06/2017 - 22:40

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

I missed last week and almost forgot to add this week’s.

  • Jonah Gabry returned from teaching a one-week course for a special EU research institute in Spain.

  • Mitzi Morris has been knocking out bug fixes for the parser and some pull requests to refactor the underlying type inference to clear the way for tuples, sparse matrices, and higher-order functions.

  • Michael Betancourt with help from Sean Talts spent last week teaching an intro course to physicists about Stan. Charles Margossian attended and said it went really well.

  • Ben Goodrich, in addition to handling a slew of RStan issues has been diving into the math library to define derivatives for Bessel functions.

  • Aki Vehtari has put us in touch with the MxNet developers at Amazon UK and we had our first conference call with them to talk about adding sparse matrix functionality to Stan (Neil Lawrence is working there now).

  • Aki is also working on revising the EP as a way of life paper and finalizing other Stan-related papers.

  • Bob Carpenter and Andrew Gelman have recruited Advait Rajagopal to help us with the Coursera specialization we’re going to offer (contingent on coming to an agreement with Columbia). The plan’s to have four course: Intro to BDA (Andrew), Stan (Bob), MCMC (Bob), and Regression and other stories (Andrew).

  • Ben Bales finished the revised pull request for vectorized RNGS. Turns out these things are much easier to write than they are to test thoroughly. Pesky problems with instantiations by integers and what not turn up.

  • Daniel Lee is getting ready for ACoP, which Bill Gillespie and Charles Margossian will also be presenting at.

  • Steven Bronder and Rok Češnovar, with some help from Daniel Lee, are going to merge the ViennaCL library for GPU matrix ops with their own specializations for derivatives in Stan into the math library. This is getting close to being real for users.

  • Sean Talts when he wasn’t teaching or learning physics has been refactoring the Jenkins test facilities. As our tests get bigger and we get more developers, it’s getting harder and harder to maintain stable continuous integration testing.

  • Breck Baldwin is taking over dealing with StanCon. Our goal is to get up to 150 registrations.

  • Breck Baldwin has also been working with Andrew Gelman and Jonathan Auerbach on non-conventional statistics training (like at Maker Fairs)—they have the beginnings of a paper. Breck’s highly recommending the math musueum in NY to see how this kind of thing’s done.

  • Bob Carpenter published a Wiki page on a Stan 3 model concept, which is probably what we’ll be going with going forward. It’s pretty much like what we have now with better const correctness and some better organized utility functions.

  • Imad Ali went to the the New England Sports Stats conference. Expect to see more models of basketball using Stan soon.

  • Ben Goodrich fixed the problem with exception handling in RStan on some platforms (always a pain because it happened on Macs and he’s not a Mac user).

  • Advait Rajagopal has been working with Imad Ali on adding ARMA and ARIMA time-series functions to rstanarm.

  • Aki Vehtari enhanced the loo package with automated code for K-fold cross validation.

  • Lizzie Wolkovich visited us for a meeting (she’s on our NumFOCUS leadership body), where she reported that she and a postdoc have been working on calibrating Stan models for phenology (look it up).

  • Krzysztof Sakrejda has been working on proper standalone function generation for Rcpp. Turns out to be tricky with their namespace requirements, but I think we have it sorted out as of today.

  • Michael Andreae has kicked off is meta-analysis and graphics project at Penn State with Jonah Gabry and Ben Goodrich chipping in.

  • Ben Goodrich also fixed the infrastructure for RStan so that multiple models may be supported more easily, which should make it much easier for R package writers to incorporate Stan models.

  • Yuling Yao gave us the rundown on where ADVI testing stands. It may falsely report convergence when it’s not at a maximum, it may converge to a local minimum, or it may converge but the Gaussian approximation may be terrible, either in terms of the posterior means or the variances. He and Andrew Gelman are looking at using Pareto smoothed importance sampling (a la the loo package) to try to sort out the quality of the approximation. Yuling thinks convergence is mostly scaling issues and preconditioning along with natural gradients may solve the problem. It’s nice to see grad students sink their teeth into a problem! It’d be great if we could come up a more robust ADVI implementation that had diagnostic warnings if the approximation wasn’t reliable.

The post Stan Biweekly Roundup, 6 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A cRossword about R

Fri, 10/06/2017 - 18:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The members of the R Ladies DC user group put together an R-themed crossword for a recent networking event. It's a fun way to test out your R knowledge. (Click to enlarge, or download a printable version here.)

If you get stuck, you can find the answers here or at the link below. And if you'd like to join your local R-Ladies chapter, you can find a list of meetups here.

Github (rladies): meetup-presentations_dc / NetworkingCrosswordPuzzle-2017

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Practical Machine Learning with R and Python – Part 1

Fri, 10/06/2017 - 16:07

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Introduction

This is the 1st part of a series of posts I intend to write on some common Machine Learning Algorithms in R and Python. In this first part I cover the following Machine Learning Algorithms

  • Univariate Regression
  • Multivariate Regression
  • Polynomial Regression
  • K Nearest Neighbors Regression

The code includes the implementation in both R and Python. This series of posts are based on the following 2 MOOC courses I did at Stanford Online and at Coursera

  1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
  2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG). I also use the Boston data set from MASS package

While coding in R and Python I found that there were some aspects that were more convenient in one language and some in the other. For example, plotting the fit in R is straightforward in R, while computing the R squared, splitting as Train & Test sets etc. are already available in Python. In any case, these minor inconveniences can be easily be implemented in either language.

R squared computation in R is computed as follows


Note: You can download this R Markdown file and the associated data sets from Github at MachineLearning-RandPython
Note 1: This post was created as an R Markdown file in RStudio which has a cool feature of including R and Python snippets. The plot of matplotlib needs a workaround but otherwise this is a real cool feature of RStudio!

1.1a Univariate Regression – R code

Here a simple linear regression line is fitted between a single input feature and the target variable

# Source in the R function library source("RFunctions.R") # Read the Boston data file df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - Statistical Learning # Split the data into training and test sets (75:25) train_idx <- trainTestSplit(df,trainPercent=75,seed=5) train <- df[train_idx, ] test <- df[-train_idx, ] # Fit a linear regression line between 'Median value of owner occupied homes' vs 'lower status of # population' fit=lm(medv~lstat,data=df) # Display details of fir summary(fit) ## ## Call: ## lm(formula = medv ~ lstat, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.168 -3.990 -1.318 2.034 24.500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 34.55384 0.56263 61.41 <2e-16 *** ## lstat -0.95005 0.03873 -24.53 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.216 on 504 degrees of freedom ## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 ## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16 # Display the confidence intervals confint(fit) ## 2.5 % 97.5 % ## (Intercept) 33.448457 35.6592247 ## lstat -1.026148 -0.8739505 plot(df$lstat,df$medv, xlab="Lower status (%)",ylab="Median value of owned homes ($1000)", main="Median value of homes ($1000) vs Lowe status (%)") abline(fit) abline(fit,lwd=3) abline(fit,lwd=3,col="red")

rsquared=Rsquared(fit,test,test$medv) sprintf("R-squared for uni-variate regression (Boston.csv) is : %f", rsquared) ## [1] "R-squared for uni-variate regression (Boston.csv) is : 0.556964" 1.1b Univariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression #os.chdir("C:\\software\\machine-learning\\RandPython") # Read the CSV file df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") # Select the feature variable X=df['lstat'] # Select the target y=df['medv'] # Split into train and test sets (75:25) X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) X_train=X_train.values.reshape(-1,1) X_test=X_test.values.reshape(-1,1) # Fit a linear model linreg = LinearRegression().fit(X_train, y_train) # Print the training and test R squared score print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Plot the linear regression line fig=plt.scatter(X_train,y_train) # Create a range of points. Compute yhat=coeff1*x + intercept and plot x=np.linspace(0,40,20) fig1=plt.plot(x, linreg.coef_ * x + linreg.intercept_, color='red') fig1=plt.title("Median value of homes ($1000) vs Lowe status (%)") fig1=plt.xlabel("Lower status (%)") fig1=plt.ylabel("Median value of owned homes ($1000)") fig.figure.savefig('foo.png', bbox_inches='tight') fig1.figure.savefig('foo1.png', bbox_inches='tight') print "Finished" ## R-squared score (training): 0.571 ## R-squared score (test): 0.458 ## Finished

1.2a Multivariate Regression – R code # Read crimes data crimesDF <- read.csv("crimes.csv",stringsAsFactors = FALSE) # Remove the 1st 7 columns which do not impact output crimesDF1 <- crimesDF[,7:length(crimesDF)] # Convert all to numeric crimesDF2 <- sapply(crimesDF1,as.numeric) # Check for NAs a <- is.na(crimesDF2) # Set to 0 as an imputation crimesDF2[a] <-0 #Create as a dataframe crimesDF2 <- as.data.frame(crimesDF2) #Create a train/test split train_idx <- trainTestSplit(crimesDF2,trainPercent=75,seed=5) train <- crimesDF2[train_idx, ] test <- crimesDF2[-train_idx, ] # Fit a multivariate regression model between crimesPerPop and all other features fit <- lm(ViolentCrimesPerPop~.,data=train) # Compute and print R Squared rsquared=Rsquared(fit,test,test$ViolentCrimesPerPop) sprintf("R-squared for multi-variate regression (crimes.csv) is : %f", rsquared) ## [1] "R-squared for multi-variate regression (crimes.csv) is : 0.653940" 1.2b Multivariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression # Read the data crimesDF =pd.read_csv("crimes.csv",encoding="ISO-8859-1") #Remove the 1st 7 columns crimesDF1=crimesDF.iloc[:,7:crimesDF.shape[1]] # Convert to numeric crimesDF2 = crimesDF1.apply(pd.to_numeric, errors='coerce') # Impute NA to 0s crimesDF2.fillna(0, inplace=True) # Select the X (feature vatiables - all) X=crimesDF2.iloc[:,0:120] # Set the target y=crimesDF2.iloc[:,121] X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) # Fit a multivariate regression model linreg = LinearRegression().fit(X_train, y_train) # compute and print the R Square print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) ## R-squared score (training): 0.699 ## R-squared score (test): 0.677 1.3a Polynomial Regression – R

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

# Polynomial degree 1 df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) # Select key columns df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Split as train and test sets train_idx <- trainTestSplit(df3,trainPercent=75,seed=5) train <- df3[train_idx, ] test <- df3[-train_idx, ] # Fit a model of degree 1 fit <- lm(mpg~. ,data=train) rsquared1 <-Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : %f", rsquared1) ## [1] "R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : 0.763607" # Polynomial degree 2 - Quadratic x = as.matrix(df3[1:6]) # Make a polynomial of degree 2 for feature variables before split df4=as.data.frame(poly(x,2,raw=TRUE)) df5 <- cbind(df4,df3[7]) # Split into train and test set train_idx <- trainTestSplit(df5,trainPercent=75,seed=5) train <- df5[train_idx, ] test <- df5[-train_idx, ] # Fit the quadratic model fit <- lm(mpg~. ,data=train) # Compute R squared rsquared2=Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared2) ## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.831372" #Polynomial degree 3 x = as.matrix(df3[1:6]) # Make polynomial of degree 4 of feature variables before split df4=as.data.frame(poly(x,3,raw=TRUE)) df5 <- cbind(df4,df3[7]) train_idx <- trainTestSplit(df5,trainPercent=75,seed=5) train <- df5[train_idx, ] test <- df5[-train_idx, ] # Fit a model of degree 3 fit <- lm(mpg~. ,data=train) # Compute R squared rsquared3=Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared3) ## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.773225" df=data.frame(degree=c(1,2,3),Rsquared=c(rsquared1,rsquared2,rsquared3)) # Make a plot of Rsquared and degree ggplot(df,aes(x=degree,y=Rsquared)) +geom_point() + geom_line(color="blue") + ggtitle("Polynomial regression - R squared vs Degree of polynomial") + xlab("Degree") + ylab("R squared")

1.3a Polynomial Regression – Python

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns # Select key columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] # Convert columns to numeric autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Drop NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Polynomial degree 1 X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('R-squared score - Polynomial degree 1 (training): {:.3f}'.format(linreg.score(X_train, y_train))) # Compute R squared rsquared1 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 1 (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Polynomial degree 2 poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) # Compute R squared print('R-squared score - Polynomial degree 2 (training): {:.3f}'.format(linreg.score(X_train, y_train))) rsquared2 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 2 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) #Polynomial degree 3 poly = PolynomialFeatures(degree=3) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('(R-squared score -Polynomial degree 3 (training): {:.3f}' .format(linreg.score(X_train, y_train))) # Compute R squared rsquared3 =linreg.score(X_test, y_test) print('R-squared score Polynomial degree 3 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) degree=[1,2,3] rsquared =[rsquared1,rsquared2,rsquared3] fig2=plt.plot(degree,rsquared) fig2=plt.title("Polynomial regression - R squared vs Degree of polynomial") fig2=plt.xlabel("Degree") fig2=plt.ylabel("R squared") fig2.figure.savefig('foo2.png', bbox_inches='tight') print "Finished plotting and saving" ## R-squared score - Polynomial degree 1 (training): 0.811 ## R-squared score - Polynomial degree 1 (test): 0.799 ## R-squared score - Polynomial degree 2 (training): 0.861 ## R-squared score - Polynomial degree 2 (test): 0.847 ## ## (R-squared score -Polynomial degree 3 (training): 0.933 ## R-squared score Polynomial degree 3 (test): 0.710 ## ## Finished plotting and saving

1.4 K Nearest Neighbors

The code below implements KNN Regression both for R and Python. This is done for different neighbors. The R squared is computed in each case. This is repeated after performing feature scaling. It can be seen the model fit is much better after feature scaling. Normalization refers to

Another technique that is used is Standardization which is

1.4a K Nearest Neighbors Regression – R( Unnormalized)

The R code below does not use feature scaling

# KNN regression requires the FNN package df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Split train and test train_idx <- trainTestSplit(df3,trainPercent=75,seed=5) train <- df3[train_idx, ] test <- df3[-train_idx, ] # Select the feature variables train.X=train[,1:6] # Set the target for training train.Y=train[,7] # Do the same for test set test.X=test[,1:6] test.Y=test[,7] rsquared <- NULL # Create a list of neighbors neighbors <-c(1,2,4,8,10,14) for(i in seq_along(neighbors)){ # Perform a KNN regression fit knn=knn.reg(train.X,test.X,train.Y,k=neighbors[i]) # Compute R sqaured rsquared[i]=knnRSquared(knn$pred,test.Y) } # Make a dataframe for plotting df <- data.frame(neighbors,Rsquared=rsquared) # Plot the number of neighors vs the R squared ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("R squared") + ggtitle("KNN regression - R squared vs Number of Neighors (Unnormalized)")

1.4b K Nearest Neighbors Regression – Python( Unnormalized)

The Python code below does not use feature scaling

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.neighbors import KNeighborsRegressor autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Perform a train/test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Create a list of neighbors rsquared=[] neighbors=[1,2,4,8,10,14] for i in neighbors: # Fit a KNN model knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train, y_train) # Compute R squared rsquared.append(knnreg.score(X_test, y_test)) print('R-squared test score: {:.3f}' .format(knnreg.score(X_test, y_test))) # Plot the number of neighors vs the R squared fig3=plt.plot(neighbors,rsquared) fig3=plt.title("KNN regression - R squared vs Number of neighbors(Unnormalized)") fig3=plt.xlabel("Neighbors") fig3=plt.ylabel("R squared") fig3.figure.savefig('foo3.png', bbox_inches='tight') print "Finished plotting and saving" ## R-squared test score: 0.527 ## R-squared test score: 0.678 ## R-squared test score: 0.707 ## R-squared test score: 0.684 ## R-squared test score: 0.683 ## R-squared test score: 0.670 ## Finished plotting and saving 1.4c K Nearest Neighbors Regression – R( Normalized)

It can be seen that R squared improves when the features are normalized.

df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Perform MinMaxScaling of feature variables train.X.scaled=MinMaxScaler(train.X) test.X.scaled=MinMaxScaler(test.X) # Create a list of neighbors rsquared <- NULL neighbors <-c(1,2,4,6,8,10,12,15,20,25,30) for(i in seq_along(neighbors)){ # Fit a KNN model knn=knn.reg(train.X.scaled,test.X.scaled,train.Y,k=i) # Compute R ssquared rsquared[i]=knnRSquared(knn$pred,test.Y) } df <- data.frame(neighbors,Rsquared=rsquared) # Plot the number of neighors vs the R squared ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("R squared") + ggtitle("KNN regression - R squared vs Number of Neighors(Normalized)")

1.4d K Nearest Neighbors Regression – Python( Normalized)

R squared improves when the features are normalized with MinMaxScaling

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.neighbors import KNeighborsRegressor from sklearn.preprocessing import MinMaxScaler autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Perform a train/ test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Use MinMaxScaling scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) # Apply scaling on test set X_test_scaled = scaler.transform(X_test) # Create a list of neighbors rsquared=[] neighbors=[1,2,4,6,8,10,12,15,20,25,30] for i in neighbors: # Fit a KNN model knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train_scaled, y_train) # Compute R squared rsquared.append(knnreg.score(X_test_scaled, y_test)) print('R-squared test score: {:.3f}' .format(knnreg.score(X_test_scaled, y_test))) # Plot the number of neighors vs the R squared fig4=plt.plot(neighbors,rsquared) fig4=plt.title("KNN regression - R squared vs Number of neighbors(Normalized)") fig4=plt.xlabel("Neighbors") fig4=plt.ylabel("R squared") fig4.figure.savefig('foo4.png', bbox_inches='tight') print "Finished plotting and saving" ## R-squared test score: 0.703 ## R-squared test score: 0.810 ## R-squared test score: 0.830 ## R-squared test score: 0.838 ## R-squared test score: 0.834 ## R-squared test score: 0.828 ## R-squared test score: 0.827 ## R-squared test score: 0.826 ## R-squared test score: 0.816 ## R-squared test score: 0.815 ## R-squared test score: 0.809 ## Finished plotting and saving

Conclusion

In this initial post I cover the regression models when the output is continous. I intend to touch upon other Machine Learning algorithms.
Comments, suggestions and corrections are welcome.

Watch this this space!

To be continued….

You may like
1. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
2. Neural Networks: The mechanics of backpropagation
3. More book, more cricket! 2nd edition of my books now on Amazon
4. Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
5. Introducing cricket package yorkr:Part 4-In the block hole!

To see all posts see Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ARIMA models and Intervention Analysis

Fri, 10/06/2017 - 14:54

In my previous tutorial Structural Changes in Global Warming I introduced the strucchange package and some basic examples to date structural breaks in time series. In the present tutorial, I am going to show how dating structural changes (if any) and then Intervention Analysis can help in finding better ARIMA models. Dating structural changes consists in determining if there are any structural breaks in the time series data generating process, and, if so, their dates. Intervention analysis estimates the effect of an external or exogenous intervention on a time series. As an example of intervention, a permanent level shift, as we will see in this tutorial. In our scenario, the external or exogenous intervention is not known in advance, (or supposed to be known), it is inferred from the structural break we will identify.

The dataset considered for the analysis is the Arbuthnot dataset containing information of male and female births in London from year 1639 to 1710. Based on that, a metric representing the fractional excess of boys births versus girls is defined as:

$$
\begin{equation}
\begin{aligned}
\dfrac{(Boys – Girls)}{Girls}
\end{aligned}
\end{equation}
$$

The time series so defined is analyzed to determine candidate ARIMA models. The present tutorial is so organized. First, a brief exploratory analysis is carried on. Then, six ARIMA models are defined, analyzed and compared. Forecast of the time series under analysis is computed. Finally, a short historical background digression is introduced describing what was happening in England on 17-th century and citing studies on the matter of sex ratio at birth.

Packages suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(astsa)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(fUnitRoots)) suppressPackageStartupMessages(library(FitARMA)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(reshape)) suppressPackageStartupMessages(library(Rmisc)) suppressPackageStartupMessages(library(fBasics))

Note:The fUnitRoots package is not any longer maintained by CRAN, however it can be installed from source available at the following link:

fUnitRoots version 3010.78 sources

Exploratory Analysis

Loading the Arbuthnot dataset and showing some basic metrics and plots.

url <- "https://www.openintro.org/stat/data/arbuthnot.csv" abhutondot <- read.csv(url, header=TRUE) nrow(abhutondot) [1] 82 head(abhutondot) year boys girls 1 1629 5218 4683 2 1630 4858 4457 3 1631 4422 4102 4 1632 4994 4590 5 1633 5158 4839 6 1634 5035 4820 abhutondot_rs <- melt(abhutondot, id = c("year")) head(abhutondot_rs) year variable value 1 1629 boys 5218 2 1630 boys 4858 3 1631 boys 4422 4 1632 boys 4994 5 1633 boys 5158 6 1634 boys 5035 tail(abhutondot_rs) year variable value 159 1705 girls 7779 160 1706 girls 7417 161 1707 girls 7687 162 1708 girls 7623 163 1709 girls 7380 164 1710 girls 7288 ggplot(data = abhutondot_rs, aes(x = year)) + geom_line(aes(y = value, colour = variable)) + scale_colour_manual(values = c("blue", "red"))

Gives this plot.

Boys births appear to be consistently greater than girls ones. Let us run a t.test to further verify if there is a true difference in the mean of the two groups as represented by boys and girls births.

t.test(value ~ variable, data = abhutondot_rs) Welch Two Sample t-test data: value by variable t = 1.4697, df = 161.77, p-value = 0.1436 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -128.0016 872.9040 sample estimates: mean in group boys mean in group girls 5907.098 5534.646

Based on the p-value, we cannot reject the null hypothesis that the true difference in means is equal to zero.

basicStats(abhutondot[-1]) boys girls nobs 8.200000e+01 8.200000e+01 NAs 0.000000e+00 0.000000e+00 Minimum 2.890000e+03 2.722000e+03 Maximum 8.426000e+03 7.779000e+03 1. Quartile 4.759250e+03 4.457000e+03 3. Quartile 7.576500e+03 7.150250e+03 Mean 5.907098e+03 5.534646e+03 Median 6.073000e+03 5.718000e+03 Sum 4.843820e+05 4.538410e+05 SE Mean 1.825161e+02 1.758222e+02 LCL Mean 5.543948e+03 5.184815e+03 UCL Mean 6.270247e+03 5.884477e+03 Variance 2.731595e+06 2.534902e+06 Stdev 1.652754e+03 1.592137e+03 Skewness -2.139250e-01 -2.204720e-01 Kurtosis -1.221721e+00 -1.250893e+00 p1 <- ggplot(data = abhutondot_rs, aes(x = variable, y = value)) + geom_boxplot() p2 <- ggplot(data = abhutondot, aes(boys)) + geom_density() p3 <- ggplot(data = abhutondot, aes(girls)) + geom_density() multiplot(p1, p2, p3, cols = 3)

Gives this plot.

Let us define the time series to be analysed with frequency = 1 as data is collected yearly, see ref. [5] for details.

excess_frac <- (abhutondot$boys - abhutondot$girls)/abhutondot$girls excess_ts <- ts(excess_frac, frequency = 1, start = abhutondot$year[1]) autoplot(excess_ts)

Gives this plot.

Basic statistics metrics are reported.

basicStats(excess_frac) excess_frac nobs 82.000000 NAs 0.000000 Minimum 0.010673 Maximum 0.156075 1. Quartile 0.048469 3. Quartile 0.087510 Mean 0.070748 Median 0.064704 Sum 5.801343 SE Mean 0.003451 LCL Mean 0.063881 UCL Mean 0.077615 Variance 0.000977 Stdev 0.031254 Skewness 0.680042 Kurtosis 0.073620

Boys births were at least 1% higher than girls ones, reaching a top percentage excess equal to 15%.

Further, unit roots tests (run by urdfTest() within fUnitRoots package) show that we cannot reject the null hypothesis of unit root presence. See their test statistics compared with critical values below (see ref. [2] for details about the urdfTest() report).

urdftest_lag = floor(12*(length(excess_ts)/100)^0.25) urdfTest(excess_ts, type = "nc", lags = urdftest_lag, doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression none Call: lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.052739 -0.018246 -0.002899 0.019396 0.069349 Coefficients: Estimate Std. Error t value Pr(>|t|) z.lag.1 -0.01465 0.05027 -0.291 0.771802 z.diff.lag1 -0.71838 0.13552 -5.301 1.87e-06 *** z.diff.lag2 -0.66917 0.16431 -4.073 0.000143 *** z.diff.lag3 -0.58640 0.18567 -3.158 0.002519 ** z.diff.lag4 -0.56529 0.19688 -2.871 0.005700 ** z.diff.lag5 -0.58489 0.20248 -2.889 0.005432 ** z.diff.lag6 -0.60278 0.20075 -3.003 0.003944 ** z.diff.lag7 -0.43509 0.20012 -2.174 0.033786 * z.diff.lag8 -0.28608 0.19283 -1.484 0.143335 z.diff.lag9 -0.14212 0.18150 -0.783 0.436769 z.diff.lag10 0.13232 0.15903 0.832 0.408801 z.diff.lag11 -0.07234 0.12774 -0.566 0.573346 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03026 on 58 degrees of freedom Multiple R-squared: 0.4638, Adjusted R-squared: 0.3529 F-statistic: 4.181 on 12 and 58 DF, p-value: 0.0001034 Value of test-statistic is: -0.2914 Critical values for test statistics: 1pct 5pct 10pct tau1 -2.6 -1.95 -1.61 urdfTest(excess_ts, type = "c", lags = urdftest_lag, doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression drift Call: lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.051868 -0.018870 -0.005227 0.019503 0.067936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02239 0.01789 1.251 0.2159 z.lag.1 -0.31889 0.24824 -1.285 0.2041 z.diff.lag1 -0.44287 0.25820 -1.715 0.0917 . z.diff.lag2 -0.40952 0.26418 -1.550 0.1266 z.diff.lag3 -0.34933 0.26464 -1.320 0.1921 z.diff.lag4 -0.35207 0.25966 -1.356 0.1805 z.diff.lag5 -0.39863 0.25053 -1.591 0.1171 z.diff.lag6 -0.44797 0.23498 -1.906 0.0616 . z.diff.lag7 -0.31103 0.22246 -1.398 0.1675 z.diff.lag8 -0.19044 0.20656 -0.922 0.3604 z.diff.lag9 -0.07128 0.18928 -0.377 0.7079 z.diff.lag10 0.18023 0.16283 1.107 0.2730 z.diff.lag11 -0.04154 0.12948 -0.321 0.7495 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03012 on 57 degrees of freedom Multiple R-squared: 0.4781, Adjusted R-squared: 0.3683 F-statistic: 4.352 on 12 and 57 DF, p-value: 6.962e-05 Value of test-statistic is: -1.2846 0.8257 Critical values for test statistics: 1pct 5pct 10pct tau2 -3.51 -2.89 -2.58 phi1 6.70 4.71 3.86

ACF and PACF plots are given.

par(mfrow=c(1,2)) acf(excess_ts) pacf(excess_ts)

Gives this plot:

We observe the auto-correlation spike at lag = 10 beyond confidence region. That suggests the presence of a seasonal component with period = 10. Structural changes are now investigated. First let us verify if regression against a constant is significative for our time series.

summary(lm(excess_ts ~ 1)) Call: lm(formula = excess_ts ~ 1) Residuals: Min 1Q Median 3Q Max -0.060075 -0.022279 -0.006044 0.016762 0.085327 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.070748 0.003451 20.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03125 on 81 degrees of freedom

The intercept is reported as significative. Let us see if there are any structural breaks.

(break_point <- breakpoints(excess_ts ~ 1)) Optimal 2-segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: 42 Corresponding to breakdates: 1670 plot(break_point)

Gives this plot.

summary(break_point) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: m = 1 42 m = 2 20 42 m = 3 20 35 48 m = 4 20 35 50 66 m = 5 17 30 42 56 69 Corresponding to breakdates: m = 1 1670 m = 2 1648 1670 m = 3 1648 1663 1676 m = 4 1648 1663 1678 1694 m = 5 1645 1658 1670 1684 1697 Fit: m 0 1 2 3 4 5 RSS 0.07912 0.06840 0.06210 0.06022 0.05826 0.05894 BIC -327.84807 -330.97945 -330.08081 -323.78985 -317.68933 -307.92410

The BIC minimum value is reached when m = 1, hence just one break point is determined corresponding to year 1670. Let us plot the original time series against its structural break and its confidence interval.

plot(excess_ts) lines(fitted(break_point, breaks = 1), col = 4) lines(confint(break_point, breaks = 1))

Gives this plot.

Boys vs girls sex ratio at birth changed from:

fitted(break_point)[1] 0.08190905

to:

fitted(breakpoint)[length(excess_ts)] 0.05902908

Running a t.test() to verify further the difference in mean is significative across the two time windows identified by the breakpoint date, year 1670.

break_date <- breakdates(break_point) win_1 <- window(excess_ts, end = break_date) win_2 <- window(excess_ts, start = break_date + 1) t.test(win_1, win_2) Welch Two Sample t-test data: win_1 and win_2 t = 3.5773, df = 70.961, p-value = 0.0006305 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01012671 0.03563322 sample estimates: mean of x mean of y 0.08190905 0.05902908

Based on reported p-value, we reject the null hypothesis that the true difference is equal to zero.

ARIMA Models

I am going to compare the following six ARIMA models (represented with the usual (p,d,q)(P,D,Q)[S] notation):

1. non seasonal (1,1,1), as determined by auto.arima() within forecast package 2. seasonal (1,0,0)(0,0,1)[10] 3. seasonal (1,0,0)(1,0,0)[10] 4. seasonal (0,0,0)(0,0,1)[10] with level shift regressor as intervention variable 5. seasonal (1,0,0)(0,0,1)[10] with level shift regressor as intervention variable 6. non seasonal (1,0,0) with level shift regressor as intervention variable

Here we go.

Model #1

The first model is determined by the auto.arima() function within the forecast package, using the options:

a. stepwise = FALSE, which allows for a more in-depth search of potential models
b. trace = TRUE, which allows to get a list of all the investigated models

Further, as default input to auto.arima() :
c. stationary = FALSE, so that models search is not restricted to stationary models
d. seasonal = TRUE, so that models search is not restricted to non seasonal models

(model_1 <- auto.arima(excess_ts, stepwise = FALSE, trace = TRUE)) ARIMA(0,1,0) : -301.4365 ARIMA(0,1,0) with drift : -299.3722 ARIMA(0,1,1) : -328.9381 ARIMA(0,1,1) with drift : -326.9276 ARIMA(0,1,2) : -329.4061 ARIMA(0,1,2) with drift : Inf ARIMA(0,1,3) : -327.2841 ARIMA(0,1,3) with drift : Inf ARIMA(0,1,4) : -325.7604 ARIMA(0,1,4) with drift : Inf ARIMA(0,1,5) : -323.4805 ARIMA(0,1,5) with drift : Inf ARIMA(1,1,0) : -312.8106 ARIMA(1,1,0) with drift : -310.7155 ARIMA(1,1,1) : -329.5727 ARIMA(1,1,1) with drift : Inf ARIMA(1,1,2) : -327.3821 ARIMA(1,1,2) with drift : Inf ARIMA(1,1,3) : -325.1085 ARIMA(1,1,3) with drift : Inf ARIMA(1,1,4) : -323.446 ARIMA(1,1,4) with drift : Inf ARIMA(2,1,0) : -317.1234 ARIMA(2,1,0) with drift : -314.9816 ARIMA(2,1,1) : -327.3795 ARIMA(2,1,1) with drift : Inf ARIMA(2,1,2) : -325.0859 ARIMA(2,1,2) with drift : Inf ARIMA(2,1,3) : -322.9714 ARIMA(2,1,3) with drift : Inf ARIMA(3,1,0) : -315.9114 ARIMA(3,1,0) with drift : -313.7128 ARIMA(3,1,1) : -325.1497 ARIMA(3,1,1) with drift : Inf ARIMA(3,1,2) : -323.1363 ARIMA(3,1,2) with drift : Inf ARIMA(4,1,0) : -315.3018 ARIMA(4,1,0) with drift : -313.0426 ARIMA(4,1,1) : -324.2463 ARIMA(4,1,1) with drift : -322.0252 ARIMA(5,1,0) : -315.1075 ARIMA(5,1,0) with drift : -312.7776 Series: excess_ts ARIMA(1,1,1) Coefficients: ar1 ma1 0.2224 -0.9258 s.e. 0.1318 0.0683 sigma^2 estimated as 0.0009316: log likelihood=167.94 AIC=-329.88 AICc=-329.57 BIC=-322.7

As we can see, all investigated models have d=1, which is congruent with the augmented Dickey-Fuller test results.

summary(model_1) Series: excess_ts ARIMA(1,1,1) Coefficients: ar1 ma1 0.2224 -0.9258 s.e. 0.1318 0.0683 sigma^2 estimated as 0.0009316: log likelihood=167.94 AIC=-329.88 AICc=-329.57 BIC=-322.7 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.002931698 0.02995934 0.02405062 -27.05674 46.53832 0.8292635 -0.01005689

The significance of our (1,1,1) model coefficients is further investigated.

coeftest(model_1) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.222363 0.131778 1.6874 0.09153 . ma1 -0.925836 0.068276 -13.5603 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #2

A spike at lag = 1 in the ACF plot suggests the presence of an auto-regressive component. Our second model candidate takes into account the seasonality observed at lag = 10. As a result, the candidate model (1,0,0)(0,0,1)[10] is investigated.

model_2 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period = 10), include.mean = TRUE) summary(model_2) Series: excess_ts ARIMA(1,0,0)(0,0,1)[10] with non-zero mean Coefficients: ar1 sma1 mean 0.2373 0.3441 0.0708 s.e. 0.1104 0.1111 0.0053 sigma^2 estimated as 0.0008129: log likelihood=176.23 AIC=-344.46 AICc=-343.94 BIC=-334.83 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.0002212383 0.02798445 0.02274858 -21.47547 42.96717 0.7843692 -0.004420048 coeftest(model_2) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.2372975 0.1104323 2.1488 0.031650 * sma1 0.3440811 0.1110791 3.0976 0.001951 ** intercept 0.0707836 0.0052663 13.4409 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #3

In the third model, I introduce a seasonal auto-regressive component in place of the moving average one as present into model #2.

model_3 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(1,0,0), period = 10), include.mean = TRUE) summary(model_3) Series: excess_ts ARIMA(1,0,0)(1,0,0)[10] with non-zero mean Coefficients: ar1 sar1 mean 0.2637 0.3185 0.0705 s.e. 0.1069 0.1028 0.0058 sigma^2 estimated as 0.0008173: log likelihood=176.1 AIC=-344.19 AICc=-343.67 BIC=-334.56 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.0002070952 0.02806013 0.02273145 -21.42509 42.85735 0.7837788 -0.005665764 coeftest(model_3) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.2636602 0.1069472 2.4653 0.013689 * sar1 0.3185397 0.1027903 3.0989 0.001942 ** intercept 0.0704636 0.0058313 12.0836 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #4

This model takes into account the level shift highlighted by the structural change analysis and the seasonal component at lag = 10 observed in the ACF plot. To represent the structural change as level shift, a regressor variable named as level is defined as equal to zero for the timeline preceeding the breakpoint date and as equal to one afterwards such date. In other words, it is a step function which represents a permanent level shift. Such variable is input to the Arima() function as xreg argument. That is one of the most simple representation of an intervention variable, for a more complete overview see ref. [6].

level <- c(rep(0, break_point$breakpoints), rep(1, length(excess_ts) - break_point$breakpoints)) model_4 <- Arima(excess_ts, order = c(0,0,0), seasonal = list(order = c(0,0,1), period = 10), xreg = level, include.mean = TRUE) summary(model_4) Series: excess_ts Regression with ARIMA(0,0,0)(0,0,1)[10] errors Coefficients: sma1 intercept level 0.3437 0.0817 -0.0225 s.e. 0.1135 0.0053 0.0072 sigma^2 estimated as 0.0007706: log likelihood=178.45 AIC=-348.9 AICc=-348.38 BIC=-339.27 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.0001703111 0.02724729 0.02184016 -19.82639 41.28977 0.7530469 0.1608774 coeftest(model_4) z test of coefficients: Estimate Std. Error z value Pr(>|z|) sma1 0.3437109 0.1135387 3.0273 0.002468 ** intercept 0.0817445 0.0052825 15.4745 < 2.2e-16 *** level -0.0225294 0.0072468 -3.1089 0.001878 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ‘level’ regressor coefficient indicates an average 2.2% decrease in the boys vs. girls excess birth ratio, afterwards year 1670.

Model #5

The present model adds to model #4 an auto-regressive component.

model_5 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period=10), xreg = level, include.mean = TRUE) summary(model_5) Series: excess_ts Regression with ARIMA(1,0,0)(0,0,1)[10] errors Coefficients: ar1 sma1 intercept level 0.1649 0.3188 0.0820 -0.0230 s.e. 0.1099 0.1112 0.0061 0.0084 sigma^2 estimated as 0.0007612: log likelihood=179.56 AIC=-349.11 AICc=-348.32 BIC=-337.08 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 8.225255e-05 0.02690781 0.02174375 -19.42485 40.90147 0.7497229 0.007234682 coeftest(model_5) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.1648878 0.1099118 1.5002 0.133567 sma1 0.3187896 0.1112301 2.8660 0.004156 ** intercept 0.0819981 0.0061227 13.3926 < 2.2e-16 *** level -0.0230176 0.0083874 -2.7443 0.006064 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The auto-regressive coefficient ar1 is reported as not significative, hence this model is not taken into account further, even though it would provide a (very) slight improvement in the AIC metric.

Model #6

This model consideres the structural change as in model #4 without the seasonal component. That because I want to evaluate if any benefits comes from including a seasonal component.

model_6 <- Arima(excess_ts, order = c(1,0,0), xreg = level, include.mean = TRUE) summary(model_6) Series: excess_ts Regression with ARIMA(1,0,0) errors Coefficients: ar1 intercept level 0.1820 0.0821 -0.0232 s.e. 0.1088 0.0053 0.0076 sigma^2 estimated as 0.0008369: log likelihood=175.68 AIC=-343.35 AICc=-342.83 BIC=-333.73 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -7.777648e-05 0.02839508 0.02258574 -21.93086 43.2444 0.7787544 0.0003897161 coeftest(model_6) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.1820054 0.1088357 1.6723 0.094466 . intercept 0.0821470 0.0053294 15.4139 < 2.2e-16 *** level -0.0232481 0.0076044 -3.0572 0.002234 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Models Residuals Analysis

It is relevant to verify if our models residuals are white noise or not. If not, the model under analysis has not captured all the structure of our original time series. For this purpose, I will take advantage of the checkresiduals() function within forecast package and the sarima() within the astsa package. What I like of the sarima() function is the residuals qqplot() with confidence region and the Ljung-Box plot to check significance of residuals correlations. If you like to further verify the Ljung-Box test results, I would suggest to take advantage of the LjungBoxTest() function within FitARMA package.

Notes:
1. Model #5 was taken out of the prosecution of the analysis, hence its residuals will not be checked.
2. checkresiduals() and sarima() gives textual outputs which are omitted as equivalent information is already included elsewhere. The checkresiduals() reports a short LjungBox test result. The sarima() function reports a textual model summary showing coefficients and metrics similar to already shown summaries.

Model #1 Residuals Diagnostic

Checking ARIMA model (1,1,1) residuals.

checkresiduals(model_1)

Gives this plot.

It is important to verify the significance of residuals auto-correlation, in order to see if the spike at lag = 16 is as such. In fact, the purpose of running the LjungBox test is to verify if any auto-correlation beyond the confidence region (as seen in the ACF plot) is a true positive (p-value < 0.05) or is false positive (p-value >= 0.05).

LjungBoxTest(residuals(model_1), k = 2, lag.max = 20) m Qm pvalue 1 0.01 0.92611002 2 0.01 0.91139925 3 0.17 0.68276539 4 0.82 0.66216496 5 1.35 0.71745835 6 1.99 0.73828536 7 3.32 0.65017878 8 3.98 0.67886254 9 5.16 0.64076272 10 13.34 0.10075806 11 15.32 0.08260244 12 15.32 0.12066369 13 15.35 0.16692082 14 15.39 0.22084407 15 15.40 0.28310047 16 23.69 0.04989871 17 25.63 0.04204503 18 27.65 0.03480954 19 30.06 0.02590249 20 30.07 0.03680262

The p-value at lag = 16 is < 0.05 hence the auto-correlation spike at lag = 16 out of the confidence interval is significative. Our model #1 has not captured all the structure of the original time series. Also auto-correlations at lags = {17, 18, 19, 20} have p-value < 0.05, however they stand within the confidence inteval.

Now doing the same with sarima() function.

sarima(excess_ts, p = 1, d = 1, q = 1)

Gives this plot.

The sarima() output plot shows results congruent with previous comments.

Model #2 Residuals Diagnostic

Checking ARIMA (1,0,0)(0,0,1)[10] model residuals.

checkresiduals(model_2)

Gives this plot.

We observe how the model #2 does not show auto-correlation spikes above the confidence interval, and that is a confirmation of the presence of seasonality with period = 10.

LjungBoxTest(residuals(model_2), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9674875 2 0.00 0.9545816 3 0.30 0.5815774 4 0.69 0.7096699 5 0.82 0.8442508 6 0.98 0.9121811 7 2.01 0.8481327 8 4.24 0.6442302 9 8.56 0.2861290 10 8.63 0.3742209 11 10.06 0.3459882 12 10.13 0.4290298 13 10.15 0.5172343 14 10.20 0.5985932 15 10.44 0.6577499 16 14.30 0.4275627 17 17.14 0.3104476 18 18.86 0.2759461 19 22.35 0.1715052 20 22.41 0.2142307

No p-value > 0.05 are shown by the LjungBox test report. Now showing sarima() plots.

sarima(excess_ts, p = 1, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10)

Gives this plot.

The sarima() output plot shows results congruent with previous comments. As a conclusion, model #2 has white noise residuals.

Model #3 Residuals Diagnostic

Checking ARIMA (1,0,0)(1,0,0)[10] model residuals.

checkresiduals(model_3)

Gives this plot.

LjungBoxTest(residuals(model_3), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9583318 2 0.00 0.9553365 3 0.18 0.6719971 4 0.37 0.8313599 5 0.46 0.9285781 6 0.63 0.9600113 7 1.63 0.8975737 8 3.90 0.6901431 9 8.23 0.3126132 10 8.34 0.4005182 11 10.36 0.3222430 12 10.36 0.4091634 13 10.39 0.4960029 14 10.52 0.5708185 15 10.78 0.6290166 16 14.81 0.3914043 17 17.91 0.2675070 18 19.69 0.2343648 19 23.37 0.1374412 20 23.70 0.1651785 sarima(excess_ts, p = 1, d = 0, q = 0, P = 1, D = 0, Q = 0, S = 10)

Gives this plot.

Model#3 has white noise residuals.

Model #4 Residuals Diagnostic

Checking residuals of the ARIMA (0,0,0)(0,0,1)[10] model with level shift.

checkresiduals(model_4)

Gives this plot.

LjungBoxTest(residuals(model_4), k = 1, lag.max = 20) m Qm pvalue 1 2.20 0.1379312 2 2.23 0.1349922 3 2.24 0.3262675 4 3.68 0.2977682 5 4.99 0.2884494 6 5.23 0.3884858 7 5.52 0.4787801 8 7.45 0.3837810 9 10.78 0.2142605 10 10.79 0.2905934 11 12.61 0.2465522 12 12.61 0.3198632 13 12.61 0.3979718 14 12.63 0.4769589 15 12.65 0.5538915 16 16.37 0.3578806 17 16.77 0.4005631 18 19.73 0.2882971 19 25.25 0.1182396 20 25.31 0.1504763 sarima(excess_ts, p = 0, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10, xreg = level)

Gives this plot.

As a conclusion, model #4 has white noise residuals.

Model #6 Residuals Diagnostic

Checking residuals of the ARIMA (1,0,0) model with level shift.

checkresiduals(model_6)

Gives this plot

We can observe ACF spikes above the confidence region for lags = {10, 16}.

LjungBoxTest(residuals(model_6), k = 1, lag.max = 20) m Qm pvalue 1 0.00 0.99713258 2 0.00 0.96036600 3 0.07 0.96612792 4 1.51 0.67947886 5 2.70 0.60998895 6 4.47 0.48361437 7 5.20 0.51895133 8 5.49 0.59997164 9 6.51 0.58979614 10 14.72 0.09890580 11 17.09 0.07239050 12 17.21 0.10175761 13 17.21 0.14179063 14 17.24 0.18844158 15 17.34 0.23872998 16 25.31 0.04596230 17 27.53 0.03591335 18 29.50 0.03021450 19 31.47 0.02537517 20 31.71 0.03370585

The LjungBox test reports the residuals auto-correlation at lag = 10 as not signficative (p-value > 0.05). The lag = 16 residuals auto-correlation is significative (p-value < 0.05). Let us do same verifications with sarima().

sarima(excess_ts, p = 1, d = 0, q = 0, xreg = level)

Gives this plot.

The sarima() plots confirm the presence of significative auto-correlations in the residuals at lag = 16. As a conclusion, this model does not capture all the structure of our original time series.

As overall conclusion, only the seasonal models have white noise residuals.

Models Summary

Finally, it is time to gather the overall AIC, AICc and BIC metrics of our five candidate models (please remember that model #5 has been cut off from the prosecution of the analysis) and choose the final model.

df <- data.frame(col_1_res = c(model_1$aic, model_2$aic, model_3$aic, model_4$aic, model_6$aic), col_2_res = c(model_1$aicc, model_2$aicc, model_3$aicc, model_4$aicc, model_6$aicc), col_3_res = c(model_1$bic, model_2$bic, model_3$bic, model_4$bic, model_6$bic)) colnames(df) <- c("AIC", "AICc", "BIC") rownames(df) <- c("ARIMA(1,1,1)", "ARIMA(1,0,0)(0,0,1)[10]", "ARIMA(1,0,0)(1,0,0)[10]", "ARIMA(0,0,0)(0,0,1)[10] with level shift", "ARIMA(1,0,0) with level shift") df AIC AICc BIC ARIMA(1,1,1) -329.8844 -329.5727 -322.7010 ARIMA(1,0,0)(0,0,1)[10] -344.4575 -343.9380 -334.8306 ARIMA(1,0,0)(1,0,0)[10] -344.1906 -343.6711 -334.5637 ARIMA(0,0,0)(0,0,1)[10] with level shift -348.8963 -348.3768 -339.2694 ARIMA(1,0,0) with level shift -343.3529 -342.8334 -333.7260

The model which provides the best AIC, AICc and BIC metrics at the same time is model #4, ARIMA(0,0,0)(0,0,1)[10] with level shift.

Note: In case of structural changes, the (augmented) Dickey-Fuller test is biased toward the non rejection of the null, as ref. [2] explains. This may justify why the null hypothesis of unit root presence was not rejected by such test and model #1 performs worse than the other ones in terms of AIC metric. Further, you may verify that with d=1 in models from #2 up to #6, the AIC metric does not improve.

Forecast

I take advantage of the forecast() function provided within forecast package using model #4 as input. The xreg variable is determined as a constant level equal to one congruently with the structural analysis results.

h_fut <- 20 plot(forecast(model_4, h = h_fut, xreg = rep(1, h_fut)))

Gives this plot.

Historical Background

So we observed a level shift equal approximately to 2.25% in the boys vs girls births excess ratio occurring on year 1670. Two questions arises:

1. What could have influenced the sex-ratio at birth ?
2. What was happening in London during the 17-th century ?

There are studies showing results in support of the fact that sex-ratio at birth is influenced by war periods. Studies such as ref. [7], [8] and [9] suggest an increase of boys births during and/or after war time. General justification is for an adaptive response to external conditions and stresses. Differently, studies as ref. [10] state that there is no statistical evidence of war time influence on sex-ratio at births. Under normal circumstances, the boys vs girls sex ratio at birth is approximately 105:100 on average, (according to some sources), and generally between 102:100 and 108:100.

Our time series covers most of the following eras of the England history (ref. [11]):

* Early Stuart era: 1603-1660
* Later Stuart era 1660-1714

The English Civil War occured during the Early Stuart era and consisted of a series of armed conflicts and political machinations that took place between Parliamentarians (known as Roundheads) and Royalists (known as Cavaliers) between 1642 and 1651. The English conflict left some 34,000 Parliamentarians and 50,000 Royalists dead, while at least 100,000 men and women died from war-related diseases, bringing the total death toll caused by the three civil wars in England to almost 200,000 (see ref. [12]).

If we step back to view the first plot showing the abhutondot dataset, we can observe a sharp drop on births (both boys and girls) between 1642 and 1651, as testify a period of scarce resources and famine, and we can infer it was due to the civil war. Let us run a quick analysis on the total number of births.

abhutondot.ts <- ts(abhutondot$boys + abhutondot$girls, frequency = 1 , start = abhutondot$year[1]) autoplot(abhutondot.ts)

Gives this plot.

I then verify if any structural breaks are there with respect a constant level as regressor.

summary(lm(abhutondot.ts ~ 1)) Call: lm(formula = abhutondot.ts ~ 1) Residuals: Min 1Q Median 3Q Max -5829.7 -2243.0 371.3 3281.3 4703.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11442 358 31.96 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3242 on 81 degrees of freedom

The regression is reported as significative. Going on with the search for structural breaks, if any,

(break_point <- breakpoints(abhutondot.ts ~ 1)) Optimal 4-segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: 15 33 52 Corresponding to breakdates: 1643 1661 1680 plot(break_point)

Gives this plot.

summary(break_point) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: m = 1 39 m = 2 33 52 m = 3 15 33 52 m = 4 15 33 52 68 m = 5 15 32 44 56 68 Corresponding to breakdates: m = 1 1667 m = 2 1661 1680 m = 3 1643 1661 1680 m = 4 1643 1661 1680 1696 m = 5 1643 1660 1672 1684 1696 Fit: m 0 1 2 3 4 5 RSS 851165494 211346686 139782582 63564154 59593922 62188019 BIC 1566 1461 1436 1380 1383 1396

The BIC minimum value is reached with m = 3. Let us plot the original time series against its structural breaks and their confidence intervals.

plot(abhutondot.ts) fitted.ts <- fitted(break_point, breaks = 3) lines(fitted.ts, col = 4) lines(confint(break_point, breaks = 3))

Gives this plot.

The fitted levels and the breakpoints dates are as follows.

unique(as.integer(fitted.ts)) [1] 9843 6791 11645 14902 breakdates(break_point, breaks = 3) [1] 1648 1663 1676

So it is confirmed that the total number of births went through a sequence of level shift due to exogeneous shocks. The civil war is very likely determining the first break and its end the second one. It is remarkable the total births decrease from the 9843 level down to 6791 occurring around year 1648. As well remarkable, the level up shift happened on year 1663, afterwards the civil war end. The excess sex ratio structural break on year 1670 occurred at a time approximately in between total births second and third breaks.

The fitted multiple level shifts (as determined by the structural breaks analysis) can be used as intervention variable to fit an ARIMA model, as shown below.

fitted.ts <- fitted(break_point, breaks = 3) autoplot(fitted.ts)

Gives this plot.

abhutondot_xreg <- Arima(abhutondot.ts, order = c(0,1,1), xreg = fitted.ts, include.mean = TRUE) summary(abhutondot_xreg) Series: abhutondot.ts Regression with ARIMA(0,1,1) errors Coefficients: ma1 fitted.ts -0.5481 0.5580 s.e. 0.1507 0.1501 sigma^2 estimated as 743937: log likelihood=-661.65 AIC=1329.3 AICc=1329.61 BIC=1336.48 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 71.60873 846.5933 622.1021 0.0132448 5.92764 1.002253 0.08043183 coeftest(abhutondot_xreg) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ma1 -0.54809 0.15071 -3.6368 0.0002761 *** fitted.ts 0.55799 0.15011 3.7173 0.0002013 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 checkresiduals(abhutondot_xreg)

Gives this plot.

LjungBoxTest(residuals(abhutondot_xreg), k=1, lag.max=20) m Qm pvalue 1 1.39 0.2387836 2 2.26 0.1325458 3 2.31 0.3156867 4 2.69 0.4416912 5 2.93 0.5701546 6 3.32 0.6503402 7 4.14 0.6580172 8 4.53 0.7170882 9 4.54 0.8054321 10 7.86 0.5479118 11 8.51 0.5791178 12 8.54 0.6644773 13 8.69 0.7292904 14 9.31 0.7491884 15 11.16 0.6734453 16 11.50 0.7163115 17 12.58 0.7035068 18 12.60 0.7627357 19 13.01 0.7906889 20 14.83 0.7334703 sarima(abhutondot.ts, p=0, d=1, q=1, xreg = fitted.ts)

Gives this plot.

The residuals are white noise.

Conclusions

To have spot seasonality and level shifts were important aspects in our ARIMA models analysis. The seasonal component allowed to define models with white noise residuals. The structural breaks analysis allowed to define models with better AIC metric introducing as regressor the identified level shifts. We also had the chance to go through some historical remarks of the history of England and get to know about sex ratio at birth studies.

If you have any questions, please feel free to comment below.

References

    Related Post

    1. A Gentle Introduction on Market Basket Analysis — Association Rules
    2. Building A Book Recommender System – The Basics, kNN and Matrix Factorization
    3. Explaining complex machine learning models with LIME
    4. Text Message Classification
    5. Analyzing Google Trends Data in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Governance, Engagement, and Resistance in the Open Science Movement: A Comparative Study

    Fri, 10/06/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    A growing community of scientists from a variety of disciplines is moving the norms of scientific research toward open practices. Supporters of open science hope to increase the quality and efficiency of research by enabling the widespread sharing of datasets, research software source code, publications, and other processes and products of research. The speed at which the open science community seems to be growing mirrors the rapid development of technological capabilities, including robust open source scientific software, new services for data sharing and publication, and novel data science techniques for working with massive datasets. Organizations like rOpenSci harness such capabilities and deploy various combinations of these research tools, or what I refer to here as open science infrastructures, to facilitate open science.

    As studies of other digital infrastructures have pointed out, developing and deploying the technological capabilities that support innovative work within a community of practitioners constitutes just one part of making innovation happen. As quickly as the technical solutions to improving scientific research may be developing, a host of organizational and social issues are lagging behind and hampering the open science community’s ability to inscribe open practices in the culture of scientific research. Remedying organizational and social issues requires paying attention to open science infrastructures’ human components, such as designers, administrators, and users, as well as the policies, practices, and organizational structures that contribute to the smooth functioning of these systems.12 These elements of infrastructure development have, in the past, proven to be equal to or more important than technical capabilities in determining the trajectory the infrastructure takes (e.g., whether it “succeeds” or “fails”).34.

    As a postdoc with rOpenSci, I have begun a qualitative, ethnographic project to explore the organizational and social processes involved in making open science the norm in two disciplines: astronomy and ecology. I focus on these two disciplines to narrow, isolate, and compare the set of contextual factors (e.g., disciplinary histories, research norms, and the like) that might influence perceptions of open science. Specifically, I aim to answer the following research questions (RQ):

    RQ1a: What are the primary motivations of scientists who actively engage with open science infrastructures?

    RQ1b: What are the factors that influence resistance to open science among some scientists?

    RQ2: What strategies do open science infrastructure leaders use to encourage participation, govern contributions, and overcome resistance to open science infrastructure use?
    a. To what extent do governance strategies balance standardization and flexibility, centralization and decentralization, and voluntary and mandatory contributions?
    b. By what mechanisms are open science policies and practices enforced?
    c. What are the commonalities and differences in the rationale behind choices of governance strategies?

    Below, I describe how I am systematically investigating these questions in two parts. In Part 1, I am identifying the issues raised by scientists who engage with or resist the open science movement. In Part 2, I am studying the governance strategies open science leaders and decision-makers use to elicit engagement with open science infrastructures in these disciplines.

    Part 1: Engagement with and Resistance to Open Science

    I am firmly rooted in a research tradition which emphasizes that studying the uptake of a new technology or technological approach, no matter the type of work or profession, begins with capturing how the people charged with changing their activities respond to the change “on the ground.” In this vein, Part 1 of the study aims to lend empirical support or opposition to arguments for and against open science that are commonly found in opinion pieces, on social media, and in organizational mission statements. A holistic reading of such documents reveals several commonalities in the positions for and against open science. Supporters of open science often cite increased transparency, reproducibility, and collaboration as the overwhelming benefits of making scientific research processes and products openly available. Detractors highlight concerns over “scooping,” ownership, and the time costs of curating and publishing code and data.

    I am seeking to verify and test these claims by interviewing and surveying astronomers and ecologists or, more broadly, earth and environmental scientists who fall on various parts of the open science engagement-to-resistance spectrum. I am conducting interviews using a semi-structured interview protocol5 across all interviewees. I will then use a qualitative data analysis approach based on the grounded theory method6 to extract themes from the responses, focusing on the factors that promote engagement (e.g., making data available, spending time developing research software, or making publications openly accessible) or resistance (e.g., unwillingness to share code used in a study or protecting access to research data). Similar questions will be asked at scale via a survey.

    Armed with themes from the responses, I will clarify and refine the claims often made in the public sphere about the benefits and drawbacks of open science. I hope to develop this part of the study into actionable recommendations for promoting open science, governing contributions to open science repositories, and addressing the concerns of scientists who are hesitant about engagement.

    Part 2: Focusing on Governance

    Even with interviews and surveys of scientists on the ground, it is difficult to systematically trace and analyze the totality of social and political processes that support open science infrastructure development because the processes occur across geographic, disciplinary, and other boundaries.

    However, as others have pointed out,7 the organizational and social elements of digital infrastructure development often become visible and amenable to study through infrastructure governance. Governance refers to the combination of “executive and management roles, program oversight functions organized into structures, and policies that define management principles and decision making.”8 Effective governance provides projects with the direction and oversight necessary to achieve desired outcomes of infrastructure development while allowing room for creativity and innovation.2910 Studying a project’s governance surfaces the negotiation processes that occur among stakeholders—users, managers, organizations, policymakers, and the like—throughout the development process. Outcomes include agreements about the types of technologies used, standards defining the best practices for technology use, and other policies to ensure that a robust, sustainable infrastructure evolves.911

    Despite the scientific research community’s increasing reliance on open science infrastructures, few studies compare different infrastructure governance strategies2 and even fewer develop new or revised strategies for governing infrastructure development and use.12 The primary goal of this part of the project is to address this gap in our understanding of the governance strategies used to create, maintain, and grow open science infrastructures.

    I am administering this part of the study by conducting in-depth, semi-structured interviews with leaders of various open science infrastructure projects supporting work in astronomy and ecology. I define “leaders” in this context as individuals or small groups of individuals who make decisions about the management of open science infrastructures and their component parts. This set of leaders includes founders and administrators of widely-used scientific software packages and collections of packages, of open data repositories, of open access publication and preprint services, and various combinations of open science tools. Furthermore, I intend to interview the leaders of organizations with which the open science community interacts—top publication editors, for example—to gauge how open science practices and processes are being governed outside of active open science organizations.

    I will conduct qualitative coding as described above to develop themes from the responses of open science leaders. I will then ground these themes in the literature on digital infrastructure governance—which emphasizes gradual, decentralized, and voluntary development—and look for avenues to improve governance strategies.

    Alongside the interview and survey methods, I am actively observing and retaining primary documents from the ongoing discourse around open science in popular scientific communication publications (e.g., Nature and Science), conferences and meetings (e.g., OpenCon and discipline-specific hackweeks), and in the popular media/social media (e.g., The New York Times and Twitter threads).

    Preliminary Themes

    I entered this project with a very basic understanding of how open science “works”—the technical and social mechanisms by which scientists make processes and outputs publicly available. In learning about the open science movement, in general and in particular instantiations, I’ve begun to see the intricacies involved in efforts to change scientific research and its modes of communication: research data publication, citation, and access; journal publication availability; and research software development and software citation standards. Within the community trying to sustain these changes are participants and leaders who are facing and tackling several important issues head-on. I list some of the most common engagement, resistance, and governance challenges appearing in interview and observation transcripts below.

    Participation challenges
    • Overcoming the fear of sharing code and data, specifically the fear of sharing “messy” code and the fear of being shamed for research errors.
    • Defending the time and financial costs of participation in open science—particularly open source software development—to supervisors, collaborators, or tenure and promotion panels who are not engaged with open science.
    • Finding time to make code and data usable for others (e.g., through good documentation or complete metadata) and, subsequently, finding a home where code and data can easily be searched and found.
    Governance challenges
    • Navigating the issue of convincing researchers that software development and data publication/archiving “count” as research products, even though existing funding, publication, and tenure and promotion models may not yet value those contributions.
    • Developing guidelines and processes for conducting peer review on research publication, software, and data contributions, especially the tensions involved in “open review.”
    • Deciding whose responsibility it is to enforce code and data publication standards or policies, both within open science organizations and in traditional outlets like academic journals.

    The points raised in this post and the questions guiding my project might seem like discussions you’ve had too many times over coffee during a hackathon break or over beers after a conference session. If so, I’d love to hear from you, even if you are not an astronomer, an ecologist, or an active leader of an open science infrastructure. I am always looking for new ideas, both confirming and disconfirming, to refine my approach to this project.

    1. Braa, J., Hanseth, O., Heywood, A., Mohammed, W., Shaw, V. 2007. Developing health information systems in developing countries: The flexible standards strategy. MIS Quarterly, 31(2), 381-402. https://doi.org/10.2307/25148796 

    2. Tilson, D., Lyytinen, K., Sørensen, C. 2010 Digital infrastructures: The missing IS research agenda. Information Systems Research, 21(4), 748-759. https://doi.org/10.1287/isre.1100.0318 

    3. Borgman, C. L. 2010. Scholarship in the digital age: Information, infrastructure, and the Internet. MIT Press, Cambridge, MA. ISBN: 9780262250863 

    4. Vaast, E., Walsham, G. 2009. Trans-situated learning: Supporting a network of practice with an information infrastructure. Information Systems Research, 20(4), 547-564. https://doi.org/10.1287/isre.1080.0228 

    5. Spradley, J. P. (2016). The ethnographic interview. Longegrove, IL: Waveland Press. ISBN: 0030444969 

    6. Corbin, J., Strauss, A., 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology, 13(1), 3-21. https://doi.org/10.1007/BF00988593 

    7. Barrett, M., Davidson, E., Prabhu, J., Vargo, S. L. 2015. Service innovation in the digital age: Key contributions and future directions. MIS Quarterly, 39(1) 135-154. DOI: 10.25300/MISQ/2015/39:1.03 

    8. Hanford, M. 2005. Defining program governance and structure. IBM developerWorks. Available at: https://www.ibm.com/developerworks/rational/library/apr05/hanford/

    9. Star, S. L., Ruhleder, K. 1996. Steps toward an ecology of infrastructure: Design and access for large information spaces. Information Systems Research, 7(1), 111-134. https://doi.org/10.1287/isre.7.1.111 

    10. Edwards, P. N., Jackson, S. J., Bowker, G. C., Knobel, C. P. 2007. Understanding infrastructure: Dynamics, tensions, and design. Final report for Workshop on History and Theory of Infrastructure: Lessons for New Scientific Cyberinfrastructures. NSF Report. Available at: https://deepblue.lib.umich.edu/handle/2027.42/49353

    11. Hanseth, O., Jacucci, E., Grisot, M., Aanestad, M. 2006. Reflexive standardization: Side effects and complexity in standard making. MIS Quarterly, 30(1), 563-581. https://doi.org/10.2307/25148773 

    12. Hanseth, O., & Lyytinen, K. (2010). Design theory for dynamic complexity in information infrastructures: the case of building internet. Journal of Information Technology, 25(1), 1-19. 

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R-Ladies global tour

    Fri, 10/06/2017 - 02:00

    (This article was first published on Maëlle, and kindly contributed to R-bloggers)

    It was recently brought to my attention by Hannah Frick that there are now sooo many R-Ladies chapters around the world! R-Ladies is a world-wide organization to promote gender diversity in the R community, and I’m very grateful to be part of this community through which I met so many awesome ladies! Since we’re all connected, it has now happened quite a few times that R-Ladies gave talks at chapters outside of their hometowns. An R-Lady from Taiwan giving a talk in Madrid while on a trip in Europe and another one doing the same in Lisbon, an R-Lady from San Francisco presenting at the London and Barcelona chapters thanks to a conference on the continent, an R-Lady from Uruguay sharing her experience for the New York City and San Francisco chapters… It’s like rockstars tours!

    Therefore we R-Ladies often joke about doing an exhaustive global tour. Hannah made me think about this tour again… If someone were to really visit all of the chapters, what would be the shortest itinerary? And could we do a cool gif with the results? These are the problems we solve here.

    Getting the chapters

    To find all chapters, I’ll use Meetup information about meetups whose topics include “r-ladies”, although it means forgetting a few chapters that maybe haven’t updated their topics yet. Thus, I’ll scrape this webpage because I’m too impatient to wait for the cool meetupr package to include the Meetup API topic endpoint and because I’m too lazy to include it myself. I did open an issue though. Besides, I was allowed to scrape the page:

    robotstxt::paths_allowed("https://www.meetup.com/topics/") ## [1] TRUE

    Yesss. So let’s scrape!

    library("rvest") link <- "https://www.meetup.com/topics/r-ladies/all/" page_content <- read_html(link) css <- 'span[class="text--secondary text--small chunk"]' chapters <- html_nodes(page_content, css) %>% html_text(trim = TRUE) chapters <- stringr::str_replace(chapters, ".*\\|", "") chapters <- trimws(chapters) head(chapters) ## [1] "London, United Kingdom" "San Francisco, CA" ## [3] "Istanbul, Turkey" "Melbourne, Australia" ## [5] "New York, NY" "Madrid, Spain" # Montenegro chapters[chapters == "HN\\, Montenegro"] <- "Herceg Novi, Montenegro" Geolocating the chapters

    Here I decided to use a nifty package to the awesome OpenCage API. Ok, this is my own package. But hey it’s really a good geocoding API. And the package was reviewed for rOpenSci by Julia Silge! In the docs of the package you’ll see how to save your API key in order not to have to input it as a function parameter every time.

    Given that there are many chapters but not that many (41 to be exact), I could inspect the results and check them.

    geolocate_chapter <- function(chapter){ # query the API results <- opencage::opencage_forward(chapter)$results # deal with Strasbourg if(chapter == "Strasbourg, France"){ results <- dplyr::filter(results, components.city == "Strasbourg") } # get a CITY results <- dplyr::filter(results, components._type == "city") # sort the results by confidence score results <- dplyr::arrange(results, desc(confidence)) # choose the first line among those with highest confidence score results <- results[1,] # return only long and lat tibble::tibble(long = results$geometry.lng, lat = results$geometry.lat, chapter = chapter, formatted = results$formatted) } chapters_df <- purrr::map_df(chapters, geolocate_chapter) # add an index variable chapters_df <- dplyr::mutate(chapters_df, id = 1:nrow(chapters_df)) knitr::kable(chapters_df[1:10,]) long lat chapter formatted id -0.1276473 51.50732 London, United Kingdom London, United Kingdom 1 -122.4192362 37.77928 San Francisco, CA San Francisco, San Francisco City and County, California, United States of America 2 28.9651646 41.00963 Istanbul, Turkey Istanbul, Fatih, Turkey 3 144.9631608 -37.81422 Melbourne, Australia Melbourne VIC 3000, Australia 4 -73.9865811 40.73060 New York, NY New York City, United States of America 5 -3.7652699 40.32819 Madrid, Spain Leganés, Community of Madrid, Spain 6 -77.0366455 38.89495 Washington, DC Washington, District of Columbia, United States of America 7 -83.0007064 39.96226 Columbus, OH Columbus, Franklin County, Ohio, United States of America 8 -71.0595677 42.36048 Boston, MA Boston, Suffolk, Massachusetts, United States of America 9 -79.0232050 35.85030 Durham, NC Durham, NC, United States of America 10 Planning the trip

    I wanted to use the ompr package inspired by this fantastic use case, “Boris Johnson’s fully global itinerary of apology” – be careful, the code of this use case is slightly outdated but is up-to-date in the traveling salesperson vignette. The ompr package supports modeling and solving Mixed Integer Linear Programs. I got a not so bad notion of what this means by looking at this collection of use cases. Sadly, the traveling salesperson problem is a complicated problem and its solving time exponentially increases with the number of stops… in that case, it became really too long for plain mixed integer linear programming, as in “more than 24 hours later not done” too long.

    Therefore, I decided to use a specific R package for traveling salesperson problems TSP. Dirk, ompr’s maintainer, actually used it once as seen in this gist and then in this newspaper piece about how to go to all 78 Berlin museums during the night of the museums. Quite cool!

    We first need to compute the distance between chapters. In kilometers and rounded since it’s enough precision.

    convert_to_km <- function(x){ round(x/1000) } distance <- geosphere::distm(as.matrix(dplyr::select(chapters_df, long, lat)), fun = geosphere::distGeo) %>% convert_to_km()

    I used methods that do not find the optimal tour. This means that probably my solution isn’t the best one, but let’s say it’s ok for this time. Otherwise, the best thing is to ask Concorde’s maintainer if one can use their algorithm which is the best one out there, see its terms of use here.

    library("TSP") set.seed(42) result0 <- solve_TSP(TSP(distance), method = "nearest_insertion") result <- solve_TSP(TSP(distance), method = "two_opt", control = list(tour = result0))

    And here is how to link the solution to our initial chapters data.frame.

    paths <- tibble::tibble(from = chapters_df$chapter[as.integer(result)], to = chapters_df$chapter[c(as.integer(result)[2:41], as.integer(result)[1])], trip_id = 1:41) paths <- tidyr::gather(paths, "property", "chapter", 1:2) paths <- dplyr::left_join(paths, chapters_df, by = "chapter") knitr::kable(paths[1:3,]) trip_id property chapter long lat formatted id 1 from Charlottesville, VA -78.55676 38.08766 Charlottesville, VA, United States of America 38 2 from Washington, DC -77.03665 38.89495 Washington, District of Columbia, United States of America 7 3 from New York, NY -73.98658 40.73060 New York City, United States of America 5 Plotting the tour, boring version

    I’ll start by plotting the trips as it is done in the vignette, i.e. in a static way. Note: I used Dirk’s code in the Boris Johnson use case for the map, and had to use a particular branch of ggalt to get coord_proj working.

    library("ggplot2") library("ggalt") library("ggthemes") library("ggmap") world <- map_data("world") %>% dplyr::filter(region != "Antarctica") ggplot(data = paths, aes(long, lat)) + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(aes(group = trip_id), color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8) + theme_map(base_size =20) + coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") + ggtitle("R-Ladies global tour", subtitle = paste0(tour_length(result), " km"))

    Dirk told me the map would look better with great circles instead of straight lines so I googled a bit around, asked for help on Twitter before finding this post.

    library("geosphere") # find points on great circles between chapters gc_routes <- gcIntermediate(paths[1:length(chapters), c("long", "lat")], paths[(length(chapters)+1):(2*length(chapters)), c("long", "lat")], n = 360, addStartEnd = TRUE, sp = TRUE, breakAtDateLine = TRUE) gc_routes <- SpatialLinesDataFrame(gc_routes, data.frame(id = paths$id, stringsAsFactors = FALSE)) gc_routes_df <- fortify(gc_routes) p <- ggplot() + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(data = gc_routes_df, aes(long, lat, group = group), alpha = 0.5, color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8, aes(long, lat)) + theme_map(base_size =20)+ coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") p + ggtitle("R-Ladies global tour", subtitle = paste0(tour_length(result), " km"))

    Ok this is nicer, it was worth the search.

    Plotting the tour, magical version

    And now I’ll use magick because I want to add a small star flying around the world. By the way if this global tour were to happen I reckon that one would need to donate a lot of money to rainforest charities or the like, because it’d have a huge carbon footprint! Too bad really, I don’t want my gif to promote planet destroying behaviours.

    To make the gif I used code similar to the one shared in this post but in a better version thanks to Jeroen who told me to read the vignette again. Not saving PNGs saves time!

    I first wanted to really show the emoji flying along the route and even created data for that, with a number of rows between chapters proportional to the distance between them. It’d have looked nice and smooth. But making a gif with hundreds of frames ended up being too long for me at the moment. So I came up with another idea, I’ll have to hope you like it!

    library("emojifont") load.emojifont('OpenSansEmoji.ttf') library("magick") plot_one_moment <- function(chapter, size, p, chapters_df){ print(p + ggtitle(paste0("R-Ladies global tour, ", chapters_df[chapters_df$chapter == chapter,]$chapter), subtitle = paste0(tour_length(result), " km"))+ geom_text(data = chapters_df[chapters_df$chapter == chapter,], aes(x = long, y = lat, label = emoji("star2")), family="OpenSansEmoji", size = size)) } img <- image_graph(1000, 800, res = 96) out <- purrr::walk2(rep(chapters[as.integer(result)], each = 2), rep(c(5, 10), length = length(chapters)*2), p = p, plot_one_moment, chapters_df = chapters_df) dev.off() ## png ## 2 image_animate(img, fps=1) %>% image_write("rladiesglobal.gif")

    At least I made a twinkling star. I hope Hannah will be happy with the gif, because now I’d like to just dream of potential future trips! Or learn a bit of geography by looking at the gif.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Checking residual distributions for non-normal GLMs

    Fri, 10/06/2017 - 02:00

    (This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

    Checking residual distributions for non-normal GLMs Quantile-quantile plots

    If you are fitting a linear regression with Gaussian (normally
    distributed) errors, then one of the standard checks is to make sure the
    residuals are approximately normally distributed.

    It is a good idea to do these checks for non-normal GLMs too, to make
    sure your residuals approximate the model’s assumption.

    Here I explain how to create quantile-quantile plots for non-normal
    data, using an example of fitting a GLM using Student-t distributed
    errors. Such models can be appropriate when the residuals are
    overdispersed
    .

    First let’s create some data. We will make a linear predictor (ie the
    true regression line) eta and then simulate some data by adding
    residuals. We will simulate two data-sets that have the same linear
    predictor, but the first will have normally distributed errors and the
    second will have t distributed errors:

    n <- 100 phi <- 0.85 mu <- 0.5 set.seed(23) x <- rnorm(n) eta <- mu + phi * x nu <- 2.5 tau <- 3 y_tdist <- eta + (rt(n, df=nu)/sqrt(tau)) y_normdist <- eta + rnorm(n, sd = 1/sqrt(tau)) plot(x, y_tdist) points(x, y_normdist, col = "red", pch = 16, cex = 0.8) legend("topleft", legend = c("t-distributed errors", "normally distributed errors"), pch = c(1,16), col = c("black", "red"))

    Notice how the t-distributed data are more spread out. The df
    parameter, here named nu=2.5 controls how dispersed the data are.
    Lower values will give data that are more dispersed, large values
    approach a normal distribution.

    Now let’s fit a Gaussian glm (just a linear regression really) to both
    these data-sets

    m1_norm <- glm(y_normdist ~ x) m1_t <- glm(y_tdist ~ x)

    We should check whether the two models meet the normal assumption, using
    the standard ‘qq’ (quantile-quantile) plot:

    par(mfrow = c(1,2)) plot(m1_norm, 2) plot(m1_t, 2)

    These plots compare the theoretical quantiles to the actual quantiles of
    the residuals. If the points fall on the straight line then the
    theoretical and realised are very similar, and the assumption is met.
    Clearly this is not the case for the second model, which is
    overdispersed.

    We know it is overdispersed because the theoretical quantiles are much
    smaller than the actual at the tails (notice how the ends down then up).

    The p-values (or CIs if you use them) for m1_t are therefore likely
    biased and too narrow, leading potentially to type I errors (us saying
    that x affects y, which in fact it does not). In this case we know we
    haven’t made a type I error, because we made up the data. However, if
    you were using real data you wouldn’t be so sure.

    Doing our own quantile-quantile plot

    To better understand the QQ plot it helps to generate it yourself,
    rather than using R’s automatic checks.

    First we calculate the model residuals (in plot(m1_t) R did this
    internally):

    m1_t_resid <- y_tdist - predict(m1_t)

    Then we can plot the quantiles for the residuals against theoretical
    quantiles generated using qnorm. Below we also plot the original QQ
    plot from above, so you can see that our version is the same as R’s
    automatic one:

    par(mfrow = c(1,2)) qqplot(qnorm(ppoints(n), sd = 1), m1_t_resid) qqline(m1_t_resid, lty = 2) plot(m1_t,2)

    I added the qqline for comparative purposes. It just puts a line
    through the 25th and 75th quantiles.

    QQ plot for a non-normal GLM

    Now we have learned how to write our own custom for a QQ plot, we can
    use it to check other types of non-normal data.

    Here we will fit a GLM to the y_tdist data using student-t distributed
    errors. I do this using the Bayesian package INLA.

    library(INLA) data <- list(y=y_tdist, x = x) mod_tdist <- inla(y ~ x, family="T", data=data, control.predictor = list(compute = TRUE), control.family = list( hyper = list(prec = list(prior="loggamma",param=c(1,0.5)), dof = list(prior="loggamma",param=c(1,0.5)) ) ) )

    The family ="T" command tells INLA to use the t-distribution, rather
    than the Normal distribution. Note also I have specified the priors
    using the control.family command. This is best practice. We need a
    prior for the precision (1/variance) and a prior for the dof (=
    degrees of freedom, which has to be >2 in INLA).

    It is sometimes help to visualise the priors, so we can check too see
    they look sensible. Here we visualise the prior for the dof, (which in
    INLA has a min value of 2):

    xgamma <- seq(0.01, 10, length.out = 100) plot(xgamma+2, dgamma(xgamma, 1, 0.5), type = 'l', xlab = "Quantile", ylab = "Density")

    We don’t really expect values much greater than 10, so this prior makes
    sense. If we used an old-school prior that was flat in 2-1000 we might
    get issues with model fitting.

    Now enough about priors. Let’s look at the estimated coefficients:

    mod_tdist$summary.fixed ## mean sd 0.025quant 0.5quant 0.975quant mode ## (Intercept) 0.5324814 0.07927198 0.3773399 0.5321649 0.6891779 0.5315490 ## x 0.7229362 0.08301006 0.5565746 0.7239544 0.8835630 0.7259817 ## kld ## (Intercept) 3.067485e-12 ## x 6.557565e-12

    Good the CIs contain our true values, and the mean is close to our true
    value too.
    What about the hyper-parameters (the precision and DF)? We need to get
    INLA to run some more calucations to get accurate estimates of these:

    h_tdist <- inla.hyperpar(mod_tdist) h_tdist$summary.hyperpar[,3:5] ## 0.025quant 0.5quant 0.975quant ## precision for the student-t observations 0.2663364 0.6293265 1.163440 ## degrees of freedom for student-t 2.2404966 2.7396391 4.459057

    The estimate for the DF might be a ways off the mark. That is ok, we
    expect that, you need lots of really good data to get accurate estimates
    of hyper-parameters.

    Now, let’s use our skills in creating QQ plots to make QQ plot using
    theoretical quantiles from the t distribution.

    First step is to extract INLA’s predictions of the data, so we can
    calculate residuals

    preds <- mod_tdist$summary.fitted.values resids <- y_tdist - preds[,4]

    Next step is to extract the marginal estimates of the DF and precision
    to use when generating our QQ plot (the quantiles will change with the
    DF):

    tau_est <- h_tdist$summary.hyperpar[1,4] nu_est <- h_tdist$summary.hyperpar[2,4]

    Now we can use qt() to generate theoretical quantiles and the
    residuals for our realised quantiles:

    qqplot(qt(ppoints(n), df = nu_est), resids * sqrt(tau_est), xlab = "Theoretical quantile", ylab = "residuals") qqline(resids * sqrt(tau_est), lty = 2)

    Note that I multiply the residuals by the sqrt of the precision
    estimate. This is how INLA fits a t-distributed
    GLM
    .
    I do the same for the qqline.

    Our residuals are now falling much closer to the line. The model is
    doing a much better job of fitting the data. You could also calculate
    the WAIC for this model and a Gaussian one, to compare the fits. The
    t-distributed GLM should have a lower WAIC (better fit).

    We can now be confident that our CIs are accurate.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The BayesianTools R package with general-purpose MCMC and SMC samplers for Bayesian statistics

    Thu, 10/05/2017 - 14:59

    (This article was first published on Submitted to R-bloggers – theoretical ecology, and kindly contributed to R-bloggers)

    This is a somewhat belated introduction of a package that we published on CRAN at the beginning of the year already, but I hadn’t found the time to blog about this earlier. In the R environment and beyond, a large number of packages exist that estimate posterior distributions via MCMC sampling, either for specific statistical models (e.g.…

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Submitted to R-bloggers – theoretical ecology. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages