Celebrating 20 years of CRAN
(This article was first published on R – scottishsnow, and kindly contributed to Rbloggers)
In April I spoke for the Royal Statistical Society (Glasgow branch) at their event celebrating 20 years of CRAN. The other speakers were Charis Chanialidis and Colin Gillespie. Like most people I find watching myself speak really cringe worthy, but perhaps it’s not so bad! The three talks are here:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio v1.1 Released
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
We’re excited to announce the general availability of RStudio 1.1. Highlights include:
 A connections tab which makes it easy to connect to, explore, and view data in a variety of databases.
 A terminal tab which provides fluid shell integration with the IDE, xterm emulation, and even support for fullscreen terminal applications.
 An object explorer which can navigate deeply nested R data structures and objects.
 A new, modern dark theme and Retinaquality icons throughout.
 Dozens of other small improvements and bugfixes.
RStudio Server Pro 1.1 is also now available. Some of the new Pro features include:
 Support for floating licenses, which make it easy to run RStudio Server Pro in Docker containers, virtual machines, and cloud computing instances.
 Improved session management, which allows analysts to label, multiselect, force quit and otherwise selfmanage their R sessions.
 Tools for administrators, including the ability to send users notifications and automatically clean up unused sessions, freeing disk space and resources.
You can download RStudio v1.1 today, and let us know what you think on the RStudio IDE community forum.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to sample from multidimensional distributions using Gibbs sampling?
(This article was first published on Appsilon Data Science Blog, and kindly contributed to Rbloggers)
We will show how to perform multivariate random sampling using one of the Markov Chain Monte Carlo (MCMC) algorithms, called the Gibbs sampler.
Random sampling with rabbit on the bed plane
To start, what are MCMC algorithms and what are they based on?
Suppose we are interested in generating a random variable X with a distribution of \pi, over \mathcal{X}.
If we are not able to do this directly, we will be satisfied with generating a sequence of random variables X_0, X_1, \ldots, which in a sense tending to a distribution of \pi. The MCMC approach explains how to do so:
Build a Markov chain X_0, X_1, \ldots, for \mathcal{X}, whose stationary distribution is \pi.
If the structure is correct, we should expect random variables to converge to \pi.
In addition, we can expect that
for function f: \mathcal{X} \rightarrow \mathbb{R}, occurs:
\mathbb{E}(f) = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=0}^{n1}f(X_{i})
with probability equals to 1.
In essence, we want the above equality to occur for any arbitrary random variable X_0.
One of the MCMC algorithms that guarantee the above properties is the socalled Gibbs sampler.
Gibbs Sampler – description of the algorithm.
Assumptions:
 \pi is defined on the product space \mathcal{X} = \Pi_{i=1}^{d}\mathcal{X_{i}}
 We are able to draw from the conditional distributions
\pi(X_{i}X_{i}),
where
X_{i} = (X_{1}, \ldots X_{i1}, X_{i+1}, \ldots X_{d})
Algorithm steps:
 Select the initial values X_{k}^{(0)}, k = 1, \ldots d
 For t=1,2,\ldots repeat:
For k = 1, \ldots d sample X_{k}^{(t)} from distribution \pi(X_{k}^{(t)}\rvert X_{1}^{(t)}, \ldots X_{k1}^{(t)}, X_{k+1}^{(t1)}, \ldots X_{d}^{(t1)})
 Repeat step 2 until the distribution of vector (X_{1}^{(t)}, \ldots X_{d}^{(t)}) stabilizes.
 The subsequent iterations of point 2 will provide a randomization of \pi.
How do we understand point 3?
We are most satisfied when average coordinates stabilize to some accuracy.
Intuition in twodimensional case:
Source: [3]
Gibbs sampling for randomization with a twodimensional normal distribution.
We will sample from the distribution of \theta \sim \mathcal{N}(0, [\sigma_{ij}]), where
\sigma_{11} = \sigma_{22} = 1 and \sigma_{12} = \sigma_{21} = \rho.
Knowing joint density of \theta, it’s easy to show, that:
\theta_{1}\theta_{2} \sim \mathcal{N}(\rho\theta_{2}, 1\rho^{2})
\theta_{2}\theta_{1} \sim \mathcal{N}(\rho\theta_{1}, 1\rho^{2})
R implementation:
gibbs_normal_sampling < function(n_iteration, init_point, ro) { # init point is some numeric vector of length equals to 2 theta_1 < c(init_point[1], numeric(n_iteration)) theta_2 < c(init_point[2], numeric(n_iteration)) for (i in 2:(n_iteration+1)) { theta_1[i] < rnorm(1, ro * theta_2[i1], sqrt(1  ro^2)) theta_2[i] < rnorm(1, ro * theta_1[i], sqrt(1  ro^2)) } list(theta_1, theta_2) }
Using the above function, let’s see how the 10 points were scored at \rho = 0.5:
And for 10000 iterations:
We leave you a comparison of how the stability of the parameters changes depending on the selected \rho parameter.
Let’s move on to use the Gibbs sampler to estimate the density parameters.
We will show the use of the Gibbs sampler and bayesian statistics to estimate the mean parameters in the mix of normal distributions.
Assumptions (simplified case):
 iid. sample Y=Y_{1},\ldots Y_{N} comes from a mixture of normal distributions (1\pi)\mathcal{N}(\mu_{1}, \sigma_{1}^{2})+\pi\mathcal{N}(\mu_{2}, \sigma_{2}^{2}), where \pi, \sigma_{1} i \sigma_{2} are known.
 For i=1,2 \mu_{i} \sim \mathcal{N}(0, 1) (a priori distributions) and \mu_{1} with \mu_{2} are independent.
 \Delta = (\Delta_{1}, \ldots, \Delta_{N}) – the classification vector (unobserved) from which Y is derived (when \Delta_{i} = 0, Y_{i} is drawn from \mathcal{N}(\mu_{1}, \sigma_{1}^{2}), when \Delta_{i} = 1 then drawn from \mathcal{N}(\mu_{2}, \sigma_{2}^{2})).
 \mathbb{P}(\Delta_{i} = 1) = \pi
With above assumptions it can be shown that:
 f(\Delta) = (1\pi)^{N\sum_{i=1}^{N}\Delta_{i}}\cdot\pi^{\sum_{i=1}^{N}\Delta_{i}}
 (\mu_{1}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}(1\Delta_{i})y_{i}}{1 + \sum_{i=1}^{N}(1\Delta_{i})}, \frac{1}{1+\sum_{i=1}^{N}(1\Delta_{i})})
 (\mu_{2}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}\Delta_{i}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}}, \frac{1}{1+\sum_{i=1}^{N}\Delta_{i}})
 \mathbb{P} (\Delta_{i} = 1\rvert\mu_{1}, \mu_{2}, Y) = \frac{\pi \phi_{(\mu_{2}, \sigma_{2})}(y_{i})}{(1\pi) \phi_{(\mu_1, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_2, \sigma_{2})}(y_{i})}
The form of the algorithm:
 Choose starting point for the mean (\mu_{1}^{0}, \mu_{2}^{0})
 In the kth iteration do:
 With the probability:
\frac{\pi \phi_{(\mu_{2}^{(k1)}, \sigma_{2})}(y_{i})}{(1\pi) \phi_{(\mu_{1}^{(k1)}, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_{2}^{(k1)}, \sigma_{2})}(y_{i})} draw \Delta_{i}^{(k)} for i = 1, \ldots, N  Calculate:
\hat{\mu_{1}} = \frac{\sum_{i=1}^{N}(1\Delta_{i}^{(k)})y_{i}}{1 + \sum_{i=1}^{N}(1\Delta_{i}^{(k)})}
\hat{\mu_{2}} = \frac{\sum_{i=1}^{N}\Delta_{i}^{(k)}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}}  Generate:
\mu_{1}^{(k)} \sim \mathcal{N}(\hat{\mu_{1}}, \frac{1}{1 + \sum_{i=1}^{N}(1\Delta_{i}^{(k)})})
\mu_{2}^{(k)} \sim \mathcal{N}(\hat{\mu_{2}}, \frac{1}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}})
 With the probability:
 Perform step 2. until the distribution of vector (\Delta, \mu_{1}, \mu_{2}) stabilizes.
How to do this in R?
At start let’s generate random sample from mixture of normals with parameters (\pi, \mu_1, \mu_2, \sigma_1, \sigma_2) = (0.7, 10, 2, 1, 2).
set.seed(12345) mu_1 < 10 mu_2 < 2 sigma_1 < 1 sigma_2 < 2 pi_known < 0.7 n < 2000 pi_sampled < rbinom(n, 1, pi_known) y_1 < rnorm(n, mu_1, sigma_1) y_2 < rnorm(n, mu_2, sigma_2) y < (1  pi_sampled) * y_1 + pi_sampled * y_2Take a quick look at a histogram of our data:
The following task is to estimate \mu_1 and \mu_2 from above model.
mu_init < c(12, 0) n_iterations < 300 delta < vector("list", n_iterations) mu_1_vec < c(mu_init[1], numeric(n_iterations)) mu_2_vec < c(mu_init[2], numeric(n_iterations)) delta_probability < function(i, y, mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known) { pi_known * dnorm(y, mu_2_vec[i  1], sigma_2) / ((1 pi_known) * dnorm(y, mu_1_vec[i  1], sigma_1) + pi_known * dnorm(y, mu_2_vec[i  1], sigma_2)) } mu_1_mean < function(delta, i, y) { sum((1  delta[[i  1]]) * y) / (1 + sum(1  delta[[i  1]])) } mu_2_mean < function(delta, i, y) { sum(delta[[i  1]] * y) / (1 + sum(delta[[i  1]])) } for (j in 2:(n_iterations + 1)) { delta[[j  1]] < y %>% map_int( ~ rbinom(1, 1, delta_probability(j, ., mu_1_vec, mu_2_vec, sigma_1, sigma_2, pi_known))) mu_1_vec[j] < rnorm(1, mu_1_mean(delta, j, y), sqrt(1 / (1 + sum(1  delta[[j  1]])))) mu_2_vec[j] < rnorm(1, mu_2_mean(delta, j, y), sqrt(1 / (1 + sum(delta[[j  1]])))) }Let’s see the relation of sampled data to known one:
The following plot presents the mean of the \Delta vector at each iteration:
Let’s check how mean of parameters \mu_1 and \mu_2 stabilize at used algorithm:
Note how little iteration was enough to stabilize the parameters.
Finally let’s see estimated density with our initial sample:
To those concerned about the topic, refer to [1] where you can find a generalization of normal distribution mixture by extending a priori distributions to other parameters.
It is also worth to compare the above algorithm with its deterministic counterpart, Expectation Maximization (EM) algorithm see [2].
Write your question and comments below. We’d love to hear what you think.
Resources:
[1] http://gauss.stat.su.se/rr/RR2006_1.pdf
[2] http://web.stanford.edu/~hastie/ElemStatLearn/
[3] http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
mea culpa!
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
An entry about our Bayesian Essentials book on X validated alerted me to a typo in the derivation of the Gaussian posterior..! When deriving the posterior (which was left as an exercise in the Bayesian Core), I just forgot the term expressing the divergence between the prior mean and the sample mean. Mea culpa!!!
Filed under: Books, Kids, R, Statistics, University life Tagged: Bayesian Analysis, Bayesian Core, Bayesian Essentials with R, Book, cross validated, Gaussian model, typo
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A step change in managing your calendar, without social media
(This article was first published on DanielPocock.com  rproject, and kindly contributed to Rbloggers)
Have you been to an event recently involving free software or a related topic? How did you find it? Are you organizing an event and don’t want to fall into the trap of using Facebook or Meetup or other services that compete for a share of your community’s attention?
Are you keen to find events in foreign destinations related to your interest areas to coincide with other travel intentions?
Have you been concerned when your GSoC or Outreachy interns lost a week of their project going through the bureaucracy to get a visa for your community’s event? Would you like to make it easier for them to find the best events in the countries that welcome and respect visitors?
In many recent discussions about free software activism, people have struggled to break out of the illusion that social media is the way to cultivate new contacts. Wouldn’t it be great to make more meaningful contacts by attending more a more diverse range of events rather than losing time on social media?
Making it happenThere are already a number of tools (for example, Drupal plugins and WordPress plugins) for promoting your events on the web and in iCalendar format. There are also a number of sites like Agenda du Libre and GriCal who aggregate events from multiple communities where people can browse them.
How can we take these concepts further and make a convenient, compelling and global solution?
Can we harvest event data from a wide range of sources and compile it into a large database using something like PostgreSQL or a NoSQL solution or even a distributed solution like OpenDHT?
Can we use big data techniques to mine these datasources and help match people to events without compromising on privacy?
Why not build an automated iCalendar “todo” list of deadlines for events you want to be reminded about, so you never miss the deadlines for travel sponsorship or submitting a talk proposal?
I’ve started documenting an architecture for this on the Debian wiki and proposed it as an Outreachy project. It will also be offered as part of GSoC in 2018.
Ways to get involvedIf you would like to help this project, please consider introducing yourself on the debianoutreach mailing list and helping to mentor or refer interns for the project. You can also help contribute ideas for the specification through the mailing list or wiki.
Mini DebConf Prishtina 2017This weekend I’ve been at the MiniDebConf in Prishtina, Kosovo. It has been hosted by the amazing Prishtina hackerspace community.
Watch out for future events in Prishtina, the pizzas are huge, but that didn’t stop them disappearing before we finished the photos:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: DanielPocock.com  rproject. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Spatial networks – case study St James centre, Edinburgh (1/3)
(This article was first published on R – scottishsnow, and kindly contributed to Rbloggers)
Last year I spent a bit of time learning about routing and network analysis. I created a map of distance from each GB postcode to the nearest railway station. At the time my local council were also looking at changing school catchments and building new schools. This also seemed like an excellent project for routing analysis. I made some progress on this, but never wrote it up.
One aspect I was really interested in was common routes. When you create a shortest path network from many points to a single destination you get a lot of lines which converge on the same place. Can we use this for anything else? Can we count how many lines share the same pathways? This would be particularly useful for prioritising investment in safe routes to school, or walking and cycling infrastructure.
I was spurred to look at this analysis again from the tweet below:
Have to question the merits of people coming from the west driving across the city centre to get to St James’s Ctr.
— Cllr Scott Arthur (@ProfScottThinks) October 6, 2017
For background, the St James’s Centre in Edinburgh is being demolished and a new shopping centre built to replace it. There’s been a lot of (justified) complaint surrounding the new road layout at Piccadilly Place, because it prioritises motorised traffic over more efficient and effective ways of transporting people in cities (foot, bike, bus and tram). On top of this, the new St James’ centre will have a greatly increased number of spaces for parked cars (1800 from 550. With 300 for residents).
Scott’s tweet made me think about catchment area for the St James’ centre. How far away are different parts of Edinburgh? Which roads would be most travelled to get there? If one increases the number of parking spaces at a location, surely the number of cars travelling to that location will increase. What will the impact on the road network be?
There are some great resources to help us with this. The data source I’m going to use is Ordnance Survey OpenData. In particular we need Open Roads and CodePoint (postcodes). Software I’ll use for this task are QGIS, GRASS GIS and R. These are open source and require a little practice to use, but the effort pays off!
After reading data into GRASS I calculated the distance network from all postcodes to the St James’ Centre:
# connect postcodes to streets as layer 2 v.net overwrite input=roads points=postcodes output=roads_net1 operation=connect thresh=400 arc_layer=1 node_layer=2 # connect st james to streets as layer 3 v.net overwrite input=roads_net1 points=stjames output=roads_net2 operation=connect thresh=400 arc_layer=1 node_layer=3 # shortest paths from postcodes (points in layer 2) to st james (points in layer 3) v.net.distance in=roads_net2 out=pc_2_stjames flayer=2 to_layer=3 # Join postcode and distance tables v.db.join map=postcodes column=cat other_table=pc_2_stjames other_column=cat # Make a km column v.db.addcolumn map=postcodes columns="dist_km double precision" v.db.update map=postcodes column=dist_km qcol="dist/1000" # Write to shp (for mapping in QGIS) v.out.ogr s input=postcodes output=pc_2_stjames format=ESRI_Shapefile output_layer=pc_2_stationWe can see the results of this below:
We can also zoom this map on the area inside the bypass and adjust the colouring to give us more resolution for shorter distances.
Now we can use R to investigate how many postcodes are within these distances. R can talk directly to GRASS, or we can use the exported shp file.
library(rgdal) # Read shp file # Note OGR wants the full path to the file, you can't use '~' postcodes = readOGR("/home/me/Projects/network/data/pc_2_stjames.shp") hist(postcodes$dist_km, breaks=50) # Rank the distances x = sort(postcodes$dist_km) png("~/Cloud/Michael/Projects/network/figures/stjames_postcodedistance.png", height=600, width=800) par(cex=1.5) plot(x, type="l", main="EH postcode: shortest road distances to St James centre", xlab="Number of postcodes", ylab="Distance (km)", lwd=3) points(max(which(x<2)), 2, pch=19, cex=2, col="purple4") points(max(which(x<5)), 5, pch=19, cex=2, col="darkorange") legend("topleft", c(paste(max(which(x<2)), "postcodes within 2 km (walking distance)"), paste(max(which(x<5)), "postcodes within 5 km (cycling distance)")), col=c("purple4", "darkorange"), pch=19, pt.cex=2) dev.off() # Get percentages round(100 * max(which(x<2)) / length(x)) round(100 * (max(which(x<5))  max(which(x<2))) / length(x))As we can see, 1868 postcodes are within walking distance of the St James’ Centre and 7044 within cycling distance. If we
 Ignore the other shopping centres around Edinburgh,
 Assume that population density is consistent between postcodes,
 Take the complete EH postcode locations as the catchment area,
then 8 % of total customers are within walking distance (2 km), a further 22 % are within cycling distance (25 km) and the remaining 70 % would need to travel by other means. I wonder if there will be 330 cycle parking spaces at the St James centre? This would be proportional to the number of car parking spaces. This assumes that those beyond cycling distance need to drive, but much of the population will be near a bus stop.
Additional work:
 Create an allocation network for all Edinburgh shopping centres with car parks. What is St James’ catchment from that?
 Use the census data to compare car ownership with the St James’ allocation network. Car ownership is probably quite low in the city centre!
 Consider travel time, not just distance.
 How many of those outside walking/cycling distance are near a bus stop?
To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
It’s tibbletime v0.0.2: TimeAware Tibbles, New Functions, Weather Analysis and More
(This article was first published on businessscience.io  Articles, and kindly contributed to Rbloggers)
Today we are introducing tibbletime v0.0.2, and we’ve got a ton of new features in store for you. We have functions for converting to flexible time periods with the ~period formula~ and making/calculating custom rolling functions with rollify() (plus a bunch more new functionality!). We’ll take the new functionality for a spin with some weather data (from the weatherData package). However, the new tools make tibbletime useful in a number of broad applications such as forecasting, financial analysis, business analysis and more! We truly view tibbletime as the next phase of time series analysis in the tidyverse. If you like what we do, please connect with us on social media to stay up on the latest Business Science news, events and information!
IntroductionWe are excited to announce the release of tibbletime v0.0.2 on CRAN. Loads of new
functionality have been added, including:

Generic period support: Perform timebased calculations by a number
of supported periods using a new ~period formula~. 
Creating series: Use create_series() to quickly create a tbl_time
object initialized with a regular time series. 
Rolling calculations: Turn any function into a rolling version of itself with
rollify(). 
A number of smaller tweaks and helper functions to make life easier.
As we further develop tibbletime, it is becoming clearer that the package
is a tool that should be used in addition to the rest of the tidyverse.
The combination of the two makes time series analysis in the tidyverse much easier to do!
Today we will take a look at weather data for New York and San
Francisco from 2013. It will be an exploratory analysis
to show off some of the new features in tibbletime, but the package
itself has much broader application. As we will see, tibbletime’s timebased
functionality can be a valuable data manipulation tool to help with:

Product and sales forecasting

Financial analysis with custom rolling functions

Grouping data into time buckets to analyze change over time, which is great for any part of an organization including sales, marketing, manufacturing, and HR!
The datasets used are from a neat package called weatherData. While weatherData has functionality to pull weather data for a number of cities, we will use the builtin datasets. We encourage you to explore the weatherData API if you’re interested in collecting weather data.
To get started, load the following packages:
 tibbletime: Timeaware tibbles for the tidyverse
 tidyverse: Loads packages including dplyr, tidyr, purrr, and ggplot
 corrr: Tidy correlations
 weatherData: Slick package for getting weather data
Also, load the datasets from weatherData, “NewYork2013” and “SFO2013”.
# Load libraries library(tibbletime) # Make sure you have 0.0.2 from CRAN! library(tidyverse) library(corrr) library(weatherData) # Load weatherData datasets NYC < NewYork2013 SFO < SFO2013 Combine and convertTo tidy up, we first join our data sets together using bind_rows(). Passing
a named list of tibbles along with specifying the .id argument allows
bind_rows() to create a new City reference column for us.
Next, we will convert to tbl_time and group by our City variable. Note that we know this is a tbl_time object by Index: Time that gets printed along with the tibble.
# Convert to tbl_time class weather < weather %>% mutate(Time = as.POSIXct(Time)) %>% as_tbl_time(Time) %>% group_by(City) weather ## # A time tibble: 19,706 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 20130101 00:51:00 41.0 ## 2 NYC 20130101 01:51:00 39.9 ## 3 NYC 20130101 02:51:00 41.0 ## 4 NYC 20130101 03:51:00 41.0 ## 5 NYC 20130101 04:51:00 41.0 ## 6 NYC 20130101 05:51:00 39.9 ## 7 NYC 20130101 06:51:00 39.9 ## 8 NYC 20130101 07:51:00 39.9 ## 9 NYC 20130101 08:51:00 39.9 ## 10 NYC 20130101 09:51:00 39.9 ## # ... with 19,696 more rows Period conversionThe first new idea to introduce is the ~period formula~. This tells the tibbletime functions how you want to timegroup your data. It is specified
as multiple ~ period, with examples being 1~d for “every 1 day,” and
4~m for “every 4 months.”
In our original data, it looks like weather is an hourly dataset, with each new
data point coming in on the 51st minute of the hour for NYC and the 56th minute
for SFO. Unfortunately, a number of points don’t follow this. Check out the following rows:
What we want is 1 row per hour, and in this case we get two rows for NYC hour 12.
We can use as_period() to ensure that we only have 1 row for each hour
Now that we have our data in an hourly format, we probably don’t care about
which minute it came in on. We can floor the date column using a helper function,
time_floor(). Credit to Hadley Wickham because this is essentially a convenient
wrapper around lubridate::floor_date(). Setting the period to 1~h floors
each row to the beginning of the last hour.
Now that we have cleaned up a bit, let’s visualize the data.
# Yikes: Hourly is a bit too much data for the chart ggplot(weather, aes(x = Time, y = Temperature, color = City)) + geom_line() + theme_minimal()Seems like hourly data is a bit overwhelming for this kind of chart. Let’s
convert to daily and try again.
That’s better. It looks like NYC has a much wider range of temperatures than
SFO. Both seem to be hotter in summer months.
The dplyr::summarise() function is very useful for taking grouped summaries.
time_summarise() takes this a step further by allowing you to summarise by
period.
Below we take a look at the average and standard deviation of the temperatures
calculated at monthly and bimonthly intervals.
July seemed to be one of the hottest months for NYC, let’s take a closer look at it.
To just grab July dates, use time_filter(). If you haven’t seen this before, a time formula is used to specify the dates to filter for. The onesided formula below expands to include dates between, 20130701 00:00:00 ~ 20130731 23:59:59.
july < weather %>% time_filter(~20137) july ## # A time tibble: 1,486 x 3 ## # Index: Time ## # Groups: City [2] ## City Time Temperature ## * ## 1 NYC 20130701 00:00:00 73.4 ## 2 NYC 20130701 01:00:00 73.9 ## 3 NYC 20130701 02:00:00 73.4 ## 4 NYC 20130701 03:00:00 73.9 ## 5 NYC 20130701 04:00:00 73.9 ## 6 NYC 20130701 05:00:00 73.9 ## 7 NYC 20130701 06:00:00 75.9 ## 8 NYC 20130701 07:00:00 75.9 ## 9 NYC 20130701 08:00:00 75.2 ## 10 NYC 20130701 09:00:00 77.0 ## # ... with 1,476 more rowsTo visualize July’s weather, we will make a boxplot of the temperatures.
Specifically, we will slice July into intervals of 2 days, and create a series
of boxplots based on the data inside those intervals. To do this, we will
use time_collapse(), which collapses a column of dates into a column of the same
lenth, but where every row in a time interval shares the same date. You can use this resulting
column for further grouping or labeling operations.
Let’s visualize to see if we can gain any insights. Wow, San Fran maintained a constant cool average of 60 degrees in the hottest month
of the year!
Finally, we will look at the correlation of temperatures in our two cities in a few different ways.
First, let’s look at the overall correlation. The corrr package provides a nice way to accomplish this with data frames.
weather %>% spread(key = City, value = Temperature) %>% select(NYC, SFO) %>% corrr::correlate() ## # A tibble: 2 x 3 ## rowname NYC SFO ## ## 1 NYC NA 0.6510299 ## 2 SFO 0.6510299 NANext, let’s look at monthly correlations. The general idea will be
to nest each month into it’s own data frame, apply correlate() to each
nested data frame, and then unnest the results. We will use time_nest() to easily perform the monthly nesting.
For each month, calculate the correlation tibble and then focus() on the NYC column. Then unnest and floor the results.
monthly_nest %>% mutate(monthly_cor = map(data, ~corrr::correlate(.x) %>% corrr::focus(NYC)) ) %>% unnest(monthly_cor) %>% time_floor(1~d) ## # A time tibble: 12 x 4 ## # Index: Time ## Time data rowname NYC ## * ## 1 20130131 SFO 0.10281153 ## 2 20130228 SFO 0.38288119 ## 3 20130331 SFO 0.52432022 ## 4 20130430 SFO 0.34258085 ## 5 20130531 SFO 0.07814153 ## 6 20130630 SFO 0.52024900 ## 7 20130731 SFO 0.29163801 ## 8 20130831 SFO 0.45479643 ## 9 20130930 SFO 0.48056194 ## 10 20131031 SFO 0.59429495 ## 11 20131130 SFO 0.35513490 ## 12 20131231 SFO 0.17559596It seems that summer and fall months tend to have higher correlation than colder months.
And finally we will calculate the rolling correlation of NYC and SFO temperatures. The “width” of our roll will be monthly, or 360 hours since we are in hourly format.
There are a number of ways to do this, but for this example
we introduce rollify(), which takes any function that you give it and creates a rolling version of it. The first argument to rollify() is the function that you want to modify, and it is passed to rollify() in the same way that you would pass a function to purrr::map(). The second argument is the window size. Call the rolling function just as you would call a nonrolling version
of cor() from inside mutate().
It looks like the correlation is definitely not stable throughout the year,
so that initial correlation value of .65 definitely has to be taken
with a grain of salt!
There are a number of ways to do rolling functions, and we recommend based on your needs. If you are interested in:

Flexibility: Use rollify(). You can literally turn any function into a “tidy” rolling function. Think everything from rolling statistics to rolling regressions. Whatever you can dream up, it can do. The speed is fast, but not quite as fast as other Rcpp based libraries.

Performance: Use the roll package, which uses RcppParallel as its backend making it the fastest option available. The only downside is flexibility since you cannot create custom rolling functions and are bound to those available.
We’ve touched on a few of the new features in tibbletime v0.0.2. Notably:

rollify() for rolling functions

as_period() with generic periods

time_collapse() for collapsing date columns
A full change log can be found in the NEWS file on Github or CRAN.
We are always open to new ideas and encourage you to submit an issue on our
Github repo here.
Have fun with tibbletime!
Final thoughtsMind you this is only v0.0.2. We have a lot of work to do, but we couldn’t
wait any longer to share this. Feel free to kick the tires on tibbletime, and let us know your thoughts. Please submit any comments, issues or bug reports to us on GitHub here. Enjoy!
Business Science takes the headache out of data science. We specialize in applying machine learning and data science in business applications. We help businesses that seek to build out this capability but may not have the resources currently to implement predictive analytics. Business Science works with clients as diverse as startups to Fortune 500 and seeks to guide organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!
Connect with Business ScienceConnect, communicate and collaborate with us! The easiest way to do so is via social media. Connect with us out on:
 @bizScienc on twitter!
 Facebook!
 LinkedIn!
 Sign up for our insights blog to stay updated!
 If you like our software, star our GitHub packages!
To leave a comment for the author, please follow the link and comment on their blog: businessscience.io  Articles. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
NFL Series
(This article was first published on StatOfMind, and kindly contributed to Rbloggers)
If you have previously attempted to analyze NFL data, it is likely that you have tried to scrape ESPN or footballreference, which provides a wealth on statistics surrounding game data. However, if you ever wanted to obtain truly indepth data, then it is likely that you found yourself leveraging the API maintained by NFL.com. Unfortunately, it’s data is surfaced in a JSON format that leaves a lot to be desired (i.e. it’s a nightmare). Lucky for me, I was recently scrolling through my Twitter feed and came across an interesting mention of a new R package called nflscrapR. After some exploration, I came through this quote from the author of the package:
NFL fans and sports enthusiastic alike, I would like to introduce the nflscrapR package, an American football data aggregator that will scrape, clean, and parse playbyplay data across games, seasons, and careers.The nflscrapR essentiallys surfaces all playbyplay data for the last 7 seasons, and this has motivated me to start a deep dive on NFL data. I plan to have two main topics, one that focusses on players at specific positions, and another focussing on team dynamics and patterns. To kick this series off, we will begin with an exploration into the performance of NFL’s best running backs.
1. PrerequisitesIn order to reproduce the figures below, wou will need to have R (v3.2.3 or above) installed on your machine. There are also a couple of additional libraries that will be required. Details on how to install those are shown in the commands below.
# install.packages('devtools') library(devtools) #> Skipping install for github remote, the SHA1 (05815ef8) has not changed since last install. #> Use `force = TRUE` to force installation devtools::install_github(repo = "maksimhorowitz/nflscrapR", force=TRUE) 2 Collecting the DataWith the nflscrapR library now installed, you are now ready to collect playbyplay data for the 20162017 NFL season. Start by loading the library and collect the data using the command below:
# Load the package library(nflscrapR) library(ggplot2) library(dplyr) library(pheatmap) # Collect data for 2016 NFL season pbp_2016 < season_play_by_play(2016)Overall the pbp_2016 dataframe contains 100 data points for 45,737 plays, but for the purpose of this post, we will be focussing exclusively on fields related to running backs (In future posts, we will explore data relevant to other positions on the football field). In addition, we’ll focus primarily on frequently used running backs, which we empirically define as any player that has had at least 200 rushes over the course of the 20162017 season.
# Get all players with at least 200 rushes during the season min_rush_cnt < 200 rush_cnt < pbp_2016 %>% filter(PlayType == 'Run') %>% group_by(Rusher) %>% summarise(rush_cnt = n(), total_yards = sum(Yards.Gained), mean_yards = round(mean(Yards.Gained), 2)) %>% filter(rush_cnt >= min_rush_cnt) %>% arrange(desc(rush_cnt)) # Get all rushing data for eligible players rushing_stats < pbp_2016 %>% filter(PlayType == 'Run' & Rusher %in% rush_cnt$Rusher & Yards.Gained <=50) %>% filter(down!=4 & !is.na(down)) %>% filter(!is.na(RunLocation))Altogether, we find that a total of 19 players rushed over 200 times during the 20162017 season. A short summary of their performance is show below.
When talking about the overall performance of running backs, it is common for people to report the total number of yards that were rushed for, or the average yards per run. While these are perfectly acceptable numbers to share, I’ve always felt like they did not tell the whole story. For example, a player could have a high average yards per run, only for us to realize that he actually often loses yards on a run but makes up for it with a few very long runs. Therefore, I started by looking at the overall distribution of number of yards gained/lost for each play, with the hope that this would reveal whether some players were more consistent on a playbyplay basis than others. We can use the ggplot2 library to generate a density plot of yards gained per play for each of our eligible players:
# Compare distribution of rushes for eligible players ggplot(rushing_stats, aes(x = Yards.Gained, y = Rusher, fill=Rusher)) + geom_joy(scale = 3) + theme_joy() + scale_fill_manual(values=rep(c('gray', 'lightblue'), length(rushing_stats$Rusher)/2)) + scale_y_discrete(expand = c(0.01, 0)) + scale_x_continuous(expand = c(0, 0)) + theme(legend.position="none") + labs(x="Yards gained per play" ,y="")Overall, we see that most running backs have a similar distribution of yards gained by run. However, we can see that LeSean McCoy (7th distribution from the top) has a much “flatter” distribution (i.e. more variance in the distribution of yards gained per run), meaning his performance can be a lot more variable/unpredictable for any given run.
4. When are running backs used?Another statement that is also commonly reported is that running backs are primarily used in early downs. To verify whether this is generall true, we can compute the total amount of runs that each player made across different downs, and go even further by breaking this down by quarter too. The code chunk below counts the number of runs that each rushing back made during pairs of downs and quarters.
# Compare when rushers are used usage_stats < pbp_2016 %>% filter(!is.na(down) & Rusher %in% rush_cnt$Rusher & qtr!=5) %>% group_by(Rusher, down, qtr) %>% summarise(cnt = n()) %>% mutate(qtr_down = paste("Q", qtr, " Down: ", down, sep=""))We can then leverage the d3heatmap to quickly generate a simple heatmap of how often running backs are used during specific downs and quarters.
library(d3heatmap) # pivot dataframe usage < usage_stats %>% dcast(Rusher ~ qtr_down, value.var = "cnt") # clean data row.names(usage) < usage$Rusher usage < usage %>% select(Rusher) usage[is.na(usage)] < 0 # normalize data usage_norm < t(apply(usage, 1, function(x) x/sum(x))) # Plot heatmap of proportions of rushes by different field locations and gaps p < d3heatmap(usage_norm, colors="Blues", Colv=FALSE, show_grid=3) saveWidget(p, file="rusher_usage_down_quarter.html") Proportion of rushes per quarter and downs for NFLs best running backs
In the plot above, we are essentially plotting the usage of each running back as a function of what stage of the game we are in. As we can see, it is abundantly clear that running backs are primarily used in the first two downs, and rarely during the third and fourth downs. Overall, there does not appear to be significant differences between how players are used. However, it does not answer whether some running backs perform better on certains downs, which is what we will address now.
5. Are some running backs better on certain downs?Another question we can ask ourselves is whether some running backs perform better on later downs. To visualize this data, we can again generate a density plot of yards gained per play for each of our eligible players, while also facetting the data by downs.
# Compare distribution of rushes by downs ggplot(rushing_stats, aes(x = Yards.Gained, y = down)) + geom_joy(scale=1, rel_min_height=.03, fill='black') + scale_y_discrete(expand = c(0.01, 0)) + xlab('Value') + facet_wrap(~Rusher, scales='free', ncol=3) + theme_joy() + theme(axis.title.y = element_blank())+ labs(x="Yards gained per play" ,y="Down")Again, we do not see any striking differences between players and the distribution of yards gained by down. However, it is interesting to note that most “long runs” (10 yards or above) tend to occur on the first down. When we look closely, we can also see that some rushers such as DeMarco Murray do exhibit visual differences between yards gained by downs. In this case, the “mass” of yards gained on the third down is much closer to zero than when compared to the “mass” for the first and second downs, which suggests that he may struggle during this down (this could be attributable to many factors: stamina, weaker offensive line on 3rd downs, etc…)
6. Where do the best running backs run?It is fairly well accepted that the performance of a running back will be heavily influenced by the strength of the offensive line in front of them. With that in mind, let’s start by looking at the field location in which different running backs prefer to run. The plot below shows the number of yards gained by each running back based on which side of the field they ran towards (left, middle or right).
ggplot(data=rushing_stats, aes(x=RunLocation, y=Yards.Gained, color=RunLocation)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)We can take this further by looking at the field location in which different running backs prefer to run. This can be achieved by generating a matrix that contains the proportion of rushes by field location for each player.
# Get proportions of rushes on different field locations rush_locations < rushing_stats %>% filter(PlayType=='Run') %>% filter(!is.na(RunLocation)) %>% group_by(Rusher, RunLocation) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) loc_mat < rush_locations %>% dcast(Rusher ~ RunLocation, value.var = "freq") row.names(loc_mat) < loc_mat$Rusher loc_mat < loc_mat %>% select(Rusher)The content of the loc_mat matrix contains the preferred rush locations of each running back, and can be plotted as a clustered heatmaps using the pheatmap library in R.
# Plot heatmap of proportions of rushes by different field locations pheatmap(loc_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)The plot above highlights which running back are most similar in their run locations. Rushers such as J. Ajayi, E. Elliot and M. Ingram, clearly avoid running in the middle of the field. On the flipside, rushers such as M. Gordon D. Johnson, S.ware and F. Gore clearly prefer running down the middle rather than to the sides. These patterns could be attributed to the running styles of each rushers (speed, mobility, strength etc…), but also the strength of the offensive line at particular positions.
7. Who creates the gaps for the running backs?We can also explore the number of yards gained by each running back based on the offensive line positions that created space for them.
ggplot(data=rushing_stats %>% filter(!is.na(RunGap)), aes(x=RunGap, y=Yards.Gained, color=RunGap)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)The proportions of run opportunities that was enabled by each offensive line position can also be summarized in a matrix using the command below.
# Get proportions of gaps created by different offensive line positions rush_gaps < rushing_stats %>% filter(!is.na(RunGap)) %>% filter(!is.na(RunGap)) %>% group_by(Rusher, RunGap) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) gap_mat < rush_gaps %>% dcast(Rusher ~ RunGap, value.var = "freq") row.names(gap_mat) < gap_mat$Rusher gap_mat < gap_mat %>% select(Rusher) # Plot heatmap of proportions of rushes by different field gaps pheatmap(gap_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)Again, we see many differences among the leagues top running backs. Unsurprisingly, a number of players have the most run opportunities created by the guard position, but a few players such as F. Gore, L. McCoy and D. Johnson run in gaps created by the tackle position. Finally, S. Ware from the Kansas City Chiefs often runs through gaps created by tight ends, which may be more representative of the team’s formation.
ConclusionIn this introductory post, we have explored the performance and patterns of some of NFL’s best running backs. Overall, it was a fairly superficial analysis, as it never considered interactions with other components of the team, or temporal patterns, but it does show the depth and power of this data. In the next series, I will dive into the performance and behavior of wide receivers, so stay tuned!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: StatOfMind. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Linear / Logistic Regression in R: Dealing With Unknown Factor Levels in Test Data
(This article was first published on R – Discovering Python & R, and kindly contributed to Rbloggers)
Let’s say you have data containing a categorical variable with 50 levels. When you divide the data into train and test sets, chances are you don’t have all 50 levels featuring in your training set.
This often happens when you divide the data set into train and test sets according to the distribution of the outcome variable. In doing so, chances are that our explanatory categorical variable might not be distributed exactly the same way in train and test sets – so much so that certain levels of this categorical variable are missing from the training set. The more levels there are to a categorical variable, it gets difficult for that variable to be similarly represented upon splitting the data.
Take for instance this example data set (train.csv + test.csv) which contains a categorical variable var_b that takes 349 unique levels. Our train data has 334 of these levels – on which the model is built – and hence 15 levels are excluded from our trained model. If you try making predictions on the test set with this model in R, it throws an error:
factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820,
25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460
If you’ve used R to model generalized linear class of models such as linear, logit or probit models, then chances are you’ve come across this problem – especially when you’re validating your trained model on test data.
The workaround to this problem is in the form of a function, remove_missing_levels that I found here written by pats. You need magrittr library installed and it can only work on lm, glm and glmmPQL objects.
Once you’ve sourced the above function in R, you can seamlessly proceed with using your trained model to make predictions on the test set. The code below demonstrates this for the data set shared above. You can find these codes in one of my github repos and try it out yourself.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Discovering Python & R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Visualization Course for FirstYear Students
(This article was first published on R Project – CitizenStatistician, and kindly contributed to Rbloggers)
A little over a year ago, we decided to propose a data visualization course at the firstyear level. We had been thinking about this for awhile, but never had the time to teach it given the scheduling constraints we had. When one of the other departments on campus was shut down and the faculty merged in with other departments, we felt that the time was ripe to make this proposal.
In putting together the proposal, we knew that:
 The course would be primarily composed of social science students. My department, Educational Psychology, attracts students from the College of Education and Human Development (e.g., Child Psychology, Social Work, Family Social Science).
 To attract students, it would be helpful if the course would fulfill the University’s Liberal Education (LE) requirement for Mathematical Thinking.
This led to several challenges and long discussions about the curriculum for this course. For example:
 Should the class focus on producing data visualizations (very exciting for the students) or on understanding/interpreting existing visualizations (useful for most social science students)?
 If we were going to produce data visualizations, which software tool would we use? Could this level of student handle R?
 In order to meet the LE requirement, the curriculum for the course would need to show a rigorous treatment of students actually “doing” mathematics. How could we do this?
 Which types of visualizations would we include in the course?
 Would we use a textbook? How might this inform the content of the course?
After several conversations among the teaching team, with stakeholder departments, and with colleagues teaching data visualization courses at other universities, we eventually proposed that the course:
 Focus both on students’ being able to read and understand existing visualizations and produce a subset of these visualizations, and
 Use R (primary tool) and RAWGraphs for the production of these plots.
The choice to use R was not an immediate one. We initially looked at using Tableau, but the default choices made by the software (e.g., to immediately plot summaries rather than raw data) and the cost for students after matriculating from the course eventually sealed its fate (we don’t use it). We contemplated using Excel for a minute (gasp!), but we vetoed that even quicker than Tableau. The RAWGraphs website, we felt, held a lot of promise as a software tool for the course. It had an intuitive draganddrop interface, and could be used to create many of the plots we wanted students to produce. Unfortunately, we were not able to get the bar graph widget to produce sidebyside bar plots easily (actually at all). The other drawback was that the draganddrop interactions made it a harder sell to the LE committee as a method of building students’ computational and mathematical thinking if we used it as the primary tool.
Once we settled on using R, we had to decide between using the suite of base plots, or ggplot2 (lattice was not in the running). We decided that ggplot made the most sense in terms of thinking about extensibility. Its syntax was based on a theoretical foundation for creating and thinking about plots, which also made it a natural choice for a data visualization course. The idea of mapping variables to aesthetics was also consistent with the language used in RAWGraphs, so it helped reenforce core ideas across the tools. Lastly, we felt that using the ggplot syntax would also help students transition to other tools (such as ggviz or plotly) more easily.
One thing that the teaching team completely agreed on (and was mentioned by almost everyone who we talked to who taught data visualization) was that we wanted students to be producing graphs very early in the course; giving them a sense of power and the reenforcement that they could be successful. We felt this might be difficult for students with the ggplot syntax. To ameliorate this, we wrote a coursespecific R package (epsy1261; available on github) that allows students to create a few simple plots interactively by employing functionality from the manipulate package. (We could have also done this via Shiny, but I am not as wellversed in Shiny and only had a few hours to devote to this over the summer given other responsibilities.)
Course ContentWe decided on a threepronged approach to the course content. The first prong would be based on the production of common statistical plots: bar charts, scatterplots, and maps, and some variations of these (e.g., donut plots, treemaps, bubble charts). The second prong was focused on reading more complex plots (e.g., networks, alluvial plots), but not producing them, except maybe by hand. The third prong was a group project. This would give students a chance to use what they had learned, and also, perhaps, access other plots we had not covered. In addition, we wanted students to consider narrative in the presentation of these plots—to tell a datadriven story.
Along with this, we had hoped to introduce students to computational skills such as data summarization, tidying, and joining data sets. We also wanted to introduce concepts such as smoothing (especially for helping describe trend in scatterplots), color choice, and projection and coordinate systems (in maps). Other things we thought about were using R Markdown and data scraping.
RealityThe reality, as we are finding now that we are over a third of the way through the course, is that this amount of content was overambitious. We grossly underestimated the amount of practice time these students would need, especially working with R. Two things play a role in this:
 The course attracted way more students than we expected for the first offering (our class size is 44) and there is a lot of heterogeneity of students’ experiences and academic background. For example, we have graduate students from the School of Design, some first years, and mostly sophomores and juniors. We also have a variety of majors including, design, the social sciences, and computer science.
 We hypothesize that students are not practicing much outside of class. This means they are essentially only using R twice a week for 75 minutes when they are in class. This amount of practice is too infrequent for students to really learn the syntax.
Most of the students’ computational experiences are minimal prior to taking this course. They are very competent at using pointandclick software (e.g., Google Docs), but have an abundance of trouble when forced to use syntax. The precision of casesensitivity, commas, and parentheses is outside their wheelhouse.
I would go so far as to say that several of these students are intimidated by the computation, and completely panic on facing an error message. This has led to us having to really think through and spend time discussing computational workflows and dealing with how to “debug” syntax to find errors. All of this has added more time than we anticipated on the actual computing. (While this may add time, it is still educationally useful for these students.)
The teaching team meets weekly for 90 minutes to discuss and reflect on what happened in the course. We also plan what will happen in the upcoming week based on what we observed and what we see in students’ homework. As of now, we clearly see that students need more practice, and we have begun giving students the end result of a plot and asking them to recreate these.
I am still hoping to get to scatterplots and maps in the course. However, some of the other computational ideas (scraping, joining) may have to be relegated to conceptual ideas in a reading. We are also considering scrapping the project, at least for this semester. At the very least, we will change it to a more structured set of plots they need to produce rather than letting them choose the data sets, etc. Live and learn. Next time we offer the course it will be better.
*Technology note: RAWGraphs can be adapted by designing additional chart types, so in theory, if one had time, we could write our own version to be more compatible with the course. We are also considering using the ggplotgui package, which is a Shiny dashboard for creating ggplot plots.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R Project – CitizenStatistician. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
From Static to Numeric to SingleChoice Exercises in R/exams
(This article was first published on R/exams, and kindly contributed to Rbloggers)
IdeaOur colleagues over at the economics department became interested in using R/exams for
generating largescale exams in their introductory economics courses. However, they face the challenge that
so far they had been writing static exercises and modified them by hand if they wanted to reuse them in a
different exam in another semester. To let R/exams do this job it is illustrated how a static arithmetic
exercise can be turned into a dynamic exercise template either in num
format with a numeric solution or into schoice format with a singlechoice solution.
The idea for the exercise is a very basic price elasticity of demand task:
p(x)=a–b·x
for the price
p
given the demanded quantity
x
. What is the price elastiticy of demand at a price of
p
?
The natural candidates for “parameterizing” this exercise are the
price
p
and the parameters
a
and
b
of the inverse demand function.
Based on these the solution is simply:
x=D(p)=(a–p)/b
.
Then, at
p
the price elasticity of demand is
D‘(p)D(p)p=–1/bxp.
Overview
Below various incarnations of this exercise are provided in both R/Markdown Rmd and R/LaTeX Rnw format.
The following table gives a brief overview of all available versions along with a short description of the idea behind it.
More detailed explanations are provided in the subsequent sections.
elasticity1.Rnw No num Fixed parameters and numeric solution. 2 elasticity2.Rmd
elasticity2.Rnw No schoice As in #1 but with singlechoice solution (five answer alternatives). 3 elasticity3.Rmd
elasticity3.Rnw Yes num Randomly drawn parameters with dynamic computation of correct solution, based on #1. 4 elasticity4.Rmd
elasticity4.Rnw Yes schoice Randomly drawn parameters (as in #3) with dynamicallygenerated singlechoice solution (as in #2), computed by num_to_schoice(). 5 elasticity5.Rmd
elasticity5.Rnw Yes schoice As in #4 but with the last alternative: None of the above. Static numeric
The starting point is a completely static exercise as it had been used in a previous
introductory economics exam. The parameters had been set
to
p=5
,
a=50
,
and
b=0.5
.
This implies that
x=90
,
leading to an elasticity of
–0.111111
.
The corresponding R/exams templates simply hardcode these numbers into the question/solution
and wrap everything into R/Markdown (elasticity1.Rmd)
or R/LaTeX (elasticity1.Rnw).
Note that LaTeX is used in either case for the mathematical notation. In case you are unfamiliar
with the R/exams format, please check out the First Steps tutorial.
The metainformation simply sets extype to num, supplies the exsolution with a precision of three digits,
and allows an extol tolerance of 0.01. To see what the result looks like download the files linked above
and run exams2html() and/or exams2pdf(). (The examples below always use the R/Markdown version but the
R/LaTeX version can be used in exactly the same way.)
Singlechoice versions of exercises are often desired for use in written exams
because they can be conveniently scanned and automatically evaluated. Thus, we need to come up
with a number of incorrect alternative solutions (or “distractors”). If desired, these could include
typical wrong solutions or a None of the others alternative.
The R/exams templates
elasticity2.Rmd and
elasticity2.Rnw are
essentially copies of the static numeric exercise above but:
 Question/solution now contain an answerlist (with five alternatives).
 The extype has been changed to schoice.
 The exsolution now contains a binary coding of the correct solution.
 Furthermore, to obtain some basic randomization we have turned on shuffling by setting exshuffle
to TRUE. (Subsampling more than five alternatives would also be possible and would add some further randomization.)
As above exams2html() and/or exams2pdf() can be used to display the exercise interactively in R/exams.
Dynamic numericNext, the static exercise from above is made dynamic by drawing the parameters from a suitable
datagenerating process. In this case, the following works well:
Note that in order to obtain “nice” numbers a common scaling factor fctr is used for both x and b.
Also, while the examinees are presented with parameters a and b and have to compute x,
the datagenerating process actually draws x and b and computes a from that. Again, this makes it
easier to obtain “nice” numbers.
The R/exams templates
elasticity3.Rmd and
elasticity3.Rnw include
the datagenerating process above as a code chunk either in R/Markdown or R/LaTeX format.
The parameters are then inserted into question/solution/metainformation using `r a` (R/Markdown)
or \Sexpr{a} (R/LaTeX). Sometimes the fmt() function is used for formatting the parameters
with a desired number of digits. (See ?fmt for more details.)
As before exams2html() and/or exams2pdf() can be used to display the random draws from
the exercise templates. For checking that the metainformation is included correctly, it is often
helpful to run
Furthermore, some tweaking is usually required when calibrating the parameter ranges in the
datagenerating process. The stresstest_exercise() function draws a large number of random replications
and thus can help to spot errors that occur or to find solutions that are “extreme” (e.g., much larger or smaller than usual). To clean up your
global environment and run the function use something like this:
The latter command plots the correct solution against the (scalar) parameters that were
generated in the exercise. This might show patterns for which parameters the solution becomes
too large or too small etc. See ?stresstest_exercise for further features/details.
To go from the dynamic numeric exercise to a dynamic singlechoice exercise we need to
extend the datagenerating process to also produce a number of wrong alternatives.
The function num_to_schoice() helps with this by providing different sampling
mechanisms. It allows to set a range in which the alternatives have to be, a minimum
distance between all alternatives, possibly include typical wrong solutions, etc.
It also shuffles the resulting alternatives and tries to make sure that the correct
solution is not a certain order statistic (e.g., almost always the largest or
smallest alternative).
The R/exams templates
elasticity4.Rmd and
elasticity4.Rnw first
wrap the datagenerating process into a while loop with while(sol > 0.11). This makes sure
that there is enough “space” for four wrong alternatives between 0.11 and 0, using a minimum
distance of 0.017. Subsequently, the four wrong alternatives are generated by:
This suggests a number of typical wrong solutions err (provided that they are not too small
or too large) and makes sure that the range rng is large enough. With these arguments
num_to_schoice() is run, see ?num_to_schoice for the details. The resulting sc list
contains suitable $questions and $solutions that can be easily embedded in an answerlist()
and in the metainformation. Sometimes it is useful to wrap num_to_schoice() into another
while() loop to make sure a valid result is found. See the deriv2
template for illustraion.
As before exams2html() and/or exams2pdf() can be used to display the random draws from
this exercise template. And some stresstesting could be carried out by:
As a final variation we could include None of the above as the last alternative. This is
very easy because we can simply replace the fifth element of the question list with the corresponding
string:
No matter whether this was actually the correct alternative or one of the incorrect alternatives, the
corresponding solution will stay the same. The R/exams templates
elasticity5.Rmd and
elasticity5.Rnw incorporate
this additional line of code in the datagenerating process.
To leave a comment for the author, please follow the link and comment on their blog: R/exams. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
getSymbols and Alpha Vantage
(This article was first published on FOSS Trading, and kindly contributed to Rbloggers)
Thanks to Paul Teetor, getSymbols() can now import data from Alpha Vantage! This feature is part of the quantmod 0.411 release, and provides another another data source to avoid any Yahoo Finance API changes*.
Alpha Vantage is a free web service that provides realtime and historical equity data. They provide daily, weekly, and monthly history for both domestic and international markets, with up to 20 years of history. Dividend and split adjusted close prices are available for daily data. They also provide near realtime price bars at a resolution of 1 minute or more, for up to 10 recent days. All you need to get started is a onetime registration for an API key. Alpha Vantage has clean, documented, public API that returns either JSONencoded data or a CSV file. The arguments to getSymbols.av() closely follow the native API, so be sure to use their documentation! To get started, install the latest quantmod from CRAN. Then you call: getSymbols(“MSFT”, src = “av”, api.key = “[your key]”) Where you replace “[your key”] with the API key you receive after registration. You can use setDefaults() to set your API key one time, and use it for all getSymbols.av() calls. setDefaults(“getSymbols.av”, api.key = “[your key]”) *Speaking of API changes, this release also includes a fix for a Yahoo Finance change (#174). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Stan Biweekly Roundup, 6 October 2017
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
I missed last week and almost forgot to add this week’s.
 Jonah Gabry returned from teaching a oneweek course for a special EU research institute in Spain.
 Mitzi Morris has been knocking out bug fixes for the parser and some pull requests to refactor the underlying type inference to clear the way for tuples, sparse matrices, and higherorder functions.
 Michael Betancourt with help from Sean Talts spent last week teaching an intro course to physicists about Stan. Charles Margossian attended and said it went really well.
 Ben Goodrich, in addition to handling a slew of RStan issues has been diving into the math library to define derivatives for Bessel functions.
 Aki Vehtari has put us in touch with the MxNet developers at Amazon UK and we had our first conference call with them to talk about adding sparse matrix functionality to Stan (Neil Lawrence is working there now).
 Aki is also working on revising the EP as a way of life paper and finalizing other Stanrelated papers.
 Bob Carpenter and Andrew Gelman have recruited Advait Rajagopal to help us with the Coursera specialization we’re going to offer (contingent on coming to an agreement with Columbia). The plan’s to have four course: Intro to BDA (Andrew), Stan (Bob), MCMC (Bob), and Regression and other stories (Andrew).
 Ben Bales finished the revised pull request for vectorized RNGS. Turns out these things are much easier to write than they are to test thoroughly. Pesky problems with instantiations by integers and what not turn up.
 Daniel Lee is getting ready for ACoP, which Bill Gillespie and Charles Margossian will also be presenting at.
 Steven Bronder and Rok Češnovar, with some help from Daniel Lee, are going to merge the ViennaCL library for GPU matrix ops with their own specializations for derivatives in Stan into the math library. This is getting close to being real for users.
 Sean Talts when he wasn’t teaching or learning physics has been refactoring the Jenkins test facilities. As our tests get bigger and we get more developers, it’s getting harder and harder to maintain stable continuous integration testing.
 Breck Baldwin is taking over dealing with StanCon. Our goal is to get up to 150 registrations.
 Breck Baldwin has also been working with Andrew Gelman and Jonathan Auerbach on nonconventional statistics training (like at Maker Fairs)—they have the beginnings of a paper. Breck’s highly recommending the math musueum in NY to see how this kind of thing’s done.
 Bob Carpenter published a Wiki page on a Stan 3 model concept, which is probably what we’ll be going with going forward. It’s pretty much like what we have now with better const correctness and some better organized utility functions.
 Imad Ali went to the the New England Sports Stats conference. Expect to see more models of basketball using Stan soon.
 Ben Goodrich fixed the problem with exception handling in RStan on some platforms (always a pain because it happened on Macs and he’s not a Mac user).
 Advait Rajagopal has been working with Imad Ali on adding ARMA and ARIMA timeseries functions to rstanarm.
 Aki Vehtari enhanced the loo package with automated code for Kfold cross validation.
 Lizzie Wolkovich visited us for a meeting (she’s on our NumFOCUS leadership body), where she reported that she and a postdoc have been working on calibrating Stan models for phenology (look it up).
 Krzysztof Sakrejda has been working on proper standalone function generation for Rcpp. Turns out to be tricky with their namespace requirements, but I think we have it sorted out as of today.
 Michael Andreae has kicked off is metaanalysis and graphics project at Penn State with Jonah Gabry and Ben Goodrich chipping in.
 Ben Goodrich also fixed the infrastructure for RStan so that multiple models may be supported more easily, which should make it much easier for R package writers to incorporate Stan models.
 Yuling Yao gave us the rundown on where ADVI testing stands. It may falsely report convergence when it’s not at a maximum, it may converge to a local minimum, or it may converge but the Gaussian approximation may be terrible, either in terms of the posterior means or the variances. He and Andrew Gelman are looking at using Pareto smoothed importance sampling (a la the loo package) to try to sort out the quality of the approximation. Yuling thinks convergence is mostly scaling issues and preconditioning along with natural gradients may solve the problem. It’s nice to see grad students sink their teeth into a problem! It’d be great if we could come up a more robust ADVI implementation that had diagnostic warnings if the approximation wasn’t reliable.
The post Stan Biweekly Roundup, 6 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A cRossword about R
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The members of the R Ladies DC user group put together an Rthemed crossword for a recent networking event. It's a fun way to test out your R knowledge. (Click to enlarge, or download a printable version here.)
If you get stuck, you can find the answers here or at the link below. And if you'd like to join your local RLadies chapter, you can find a list of meetups here.
Github (rladies): meetuppresentations_dc / NetworkingCrosswordPuzzle2017
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Practical Machine Learning with R and Python – Part 1
(This article was first published on R – Giga thoughts …, and kindly contributed to Rbloggers)
IntroductionThis is the 1st part of a series of posts I intend to write on some common Machine Learning Algorithms in R and Python. In this first part I cover the following Machine Learning Algorithms
 Univariate Regression
 Multivariate Regression
 Polynomial Regression
 K Nearest Neighbors Regression
The code includes the implementation in both R and Python. This series of posts are based on the following 2 MOOC courses I did at Stanford Online and at Coursera
 Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
 Applied Machine Learning in Python Prof KevynCollin Thomson, University Of Michigan, Coursera
I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG). I also use the Boston data set from MASS package
While coding in R and Python I found that there were some aspects that were more convenient in one language and some in the other. For example, plotting the fit in R is straightforward in R, while computing the R squared, splitting as Train & Test sets etc. are already available in Python. In any case, these minor inconveniences can be easily be implemented in either language.
R squared computation in R is computed as follows
Note: You can download this R Markdown file and the associated data sets from Github at MachineLearningRandPython
Note 1: This post was created as an R Markdown file in RStudio which has a cool feature of including R and Python snippets. The plot of matplotlib needs a workaround but otherwise this is a real cool feature of RStudio!
Here a simple linear regression line is fitted between a single input feature and the target variable
# Source in the R function library source("RFunctions.R") # Read the Boston data file df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS  Statistical Learning # Split the data into training and test sets (75:25) train_idx < trainTestSplit(df,trainPercent=75,seed=5) train < df[train_idx, ] test < df[train_idx, ] # Fit a linear regression line between 'Median value of owner occupied homes' vs 'lower status of # population' fit=lm(medv~lstat,data=df) # Display details of fir summary(fit) ## ## Call: ## lm(formula = medv ~ lstat, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## 15.168 3.990 1.318 2.034 24.500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 34.55384 0.56263 61.41 <2e16 *** ## lstat 0.95005 0.03873 24.53 <2e16 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.216 on 504 degrees of freedom ## Multiple Rsquared: 0.5441, Adjusted Rsquared: 0.5432 ## Fstatistic: 601.6 on 1 and 504 DF, pvalue: < 2.2e16 # Display the confidence intervals confint(fit) ## 2.5 % 97.5 % ## (Intercept) 33.448457 35.6592247 ## lstat 1.026148 0.8739505 plot(df$lstat,df$medv, xlab="Lower status (%)",ylab="Median value of owned homes ($1000)", main="Median value of homes ($1000) vs Lowe status (%)") abline(fit) abline(fit,lwd=3) abline(fit,lwd=3,col="red") rsquared=Rsquared(fit,test,test$medv) sprintf("Rsquared for univariate regression (Boston.csv) is : %f", rsquared) ## [1] "Rsquared for univariate regression (Boston.csv) is : 0.556964" 1.1b Univariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression #os.chdir("C:\\software\\machinelearning\\RandPython") # Read the CSV file df = pd.read_csv("Boston.csv",encoding = "ISO88591") # Select the feature variable X=df['lstat'] # Select the target y=df['medv'] # Split into train and test sets (75:25) X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) X_train=X_train.values.reshape(1,1) X_test=X_test.values.reshape(1,1) # Fit a linear model linreg = LinearRegression().fit(X_train, y_train) # Print the training and test R squared score print('Rsquared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('Rsquared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Plot the linear regression line fig=plt.scatter(X_train,y_train) # Create a range of points. Compute yhat=coeff1*x + intercept and plot x=np.linspace(0,40,20) fig1=plt.plot(x, linreg.coef_ * x + linreg.intercept_, color='red') fig1=plt.title("Median value of homes ($1000) vs Lowe status (%)") fig1=plt.xlabel("Lower status (%)") fig1=plt.ylabel("Median value of owned homes ($1000)") fig.figure.savefig('foo.png', bbox_inches='tight') fig1.figure.savefig('foo1.png', bbox_inches='tight') print "Finished" ## Rsquared score (training): 0.571 ## Rsquared score (test): 0.458 ## Finished 1.2a Multivariate Regression – R code # Read crimes data crimesDF < read.csv("crimes.csv",stringsAsFactors = FALSE) # Remove the 1st 7 columns which do not impact output crimesDF1 < crimesDF[,7:length(crimesDF)] # Convert all to numeric crimesDF2 < sapply(crimesDF1,as.numeric) # Check for NAs a < is.na(crimesDF2) # Set to 0 as an imputation crimesDF2[a] <0 #Create as a dataframe crimesDF2 < as.data.frame(crimesDF2) #Create a train/test split train_idx < trainTestSplit(crimesDF2,trainPercent=75,seed=5) train < crimesDF2[train_idx, ] test < crimesDF2[train_idx, ] # Fit a multivariate regression model between crimesPerPop and all other features fit < lm(ViolentCrimesPerPop~.,data=train) # Compute and print R Squared rsquared=Rsquared(fit,test,test$ViolentCrimesPerPop) sprintf("Rsquared for multivariate regression (crimes.csv) is : %f", rsquared) ## [1] "Rsquared for multivariate regression (crimes.csv) is : 0.653940" 1.2b Multivariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression # Read the data crimesDF =pd.read_csv("crimes.csv",encoding="ISO88591") #Remove the 1st 7 columns crimesDF1=crimesDF.iloc[:,7:crimesDF.shape[1]] # Convert to numeric crimesDF2 = crimesDF1.apply(pd.to_numeric, errors='coerce') # Impute NA to 0s crimesDF2.fillna(0, inplace=True) # Select the X (feature vatiables  all) X=crimesDF2.iloc[:,0:120] # Set the target y=crimesDF2.iloc[:,121] X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) # Fit a multivariate regression model linreg = LinearRegression().fit(X_train, y_train) # compute and print the R Square print('Rsquared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('Rsquared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) ## Rsquared score (training): 0.699 ## Rsquared score (test): 0.677 1.3a Polynomial Regression – RFor Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit
# Polynomial degree 1 df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 < as.data.frame(sapply(df,as.numeric)) # Select key columns df2 < df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 < df2[complete.cases(df2),] # Split as train and test sets train_idx < trainTestSplit(df3,trainPercent=75,seed=5) train < df3[train_idx, ] test < df3[train_idx, ] # Fit a model of degree 1 fit < lm(mpg~. ,data=train) rsquared1 <Rsquared(fit,test,test$mpg) sprintf("Rsquared for Polynomial regression of degree 1 (auto_mpg.csv) is : %f", rsquared1) ## [1] "Rsquared for Polynomial regression of degree 1 (auto_mpg.csv) is : 0.763607" # Polynomial degree 2  Quadratic x = as.matrix(df3[1:6]) # Make a polynomial of degree 2 for feature variables before split df4=as.data.frame(poly(x,2,raw=TRUE)) df5 < cbind(df4,df3[7]) # Split into train and test set train_idx < trainTestSplit(df5,trainPercent=75,seed=5) train < df5[train_idx, ] test < df5[train_idx, ] # Fit the quadratic model fit < lm(mpg~. ,data=train) # Compute R squared rsquared2=Rsquared(fit,test,test$mpg) sprintf("Rsquared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared2) ## [1] "Rsquared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.831372" #Polynomial degree 3 x = as.matrix(df3[1:6]) # Make polynomial of degree 4 of feature variables before split df4=as.data.frame(poly(x,3,raw=TRUE)) df5 < cbind(df4,df3[7]) train_idx < trainTestSplit(df5,trainPercent=75,seed=5) train < df5[train_idx, ] test < df5[train_idx, ] # Fit a model of degree 3 fit < lm(mpg~. ,data=train) # Compute R squared rsquared3=Rsquared(fit,test,test$mpg) sprintf("Rsquared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared3) ## [1] "Rsquared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.773225" df=data.frame(degree=c(1,2,3),Rsquared=c(rsquared1,rsquared2,rsquared3)) # Make a plot of Rsquared and degree ggplot(df,aes(x=degree,y=Rsquared)) +geom_point() + geom_line(color="blue") + ggtitle("Polynomial regression  R squared vs Degree of polynomial") + xlab("Degree") + ylab("R squared") 1.3a Polynomial Regression – PythonFor Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit
import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO88591") autoDF.shape autoDF.columns # Select key columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] # Convert columns to numeric autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Drop NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Polynomial degree 1 X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('Rsquared score  Polynomial degree 1 (training): {:.3f}'.format(linreg.score(X_train, y_train))) # Compute R squared rsquared1 =linreg.score(X_test, y_test) print('Rsquared score  Polynomial degree 1 (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Polynomial degree 2 poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) # Compute R squared print('Rsquared score  Polynomial degree 2 (training): {:.3f}'.format(linreg.score(X_train, y_train))) rsquared2 =linreg.score(X_test, y_test) print('Rsquared score  Polynomial degree 2 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) #Polynomial degree 3 poly = PolynomialFeatures(degree=3) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('(Rsquared score Polynomial degree 3 (training): {:.3f}' .format(linreg.score(X_train, y_train))) # Compute R squared rsquared3 =linreg.score(X_test, y_test) print('Rsquared score Polynomial degree 3 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) degree=[1,2,3] rsquared =[rsquared1,rsquared2,rsquared3] fig2=plt.plot(degree,rsquared) fig2=plt.title("Polynomial regression  R squared vs Degree of polynomial") fig2=plt.xlabel("Degree") fig2=plt.ylabel("R squared") fig2.figure.savefig('foo2.png', bbox_inches='tight') print "Finished plotting and saving" ## Rsquared score  Polynomial degree 1 (training): 0.811 ## Rsquared score  Polynomial degree 1 (test): 0.799 ## Rsquared score  Polynomial degree 2 (training): 0.861 ## Rsquared score  Polynomial degree 2 (test): 0.847 ## ## (Rsquared score Polynomial degree 3 (training): 0.933 ## Rsquared score Polynomial degree 3 (test): 0.710 ## ## Finished plotting and saving 1.4 K Nearest NeighborsThe code below implements KNN Regression both for R and Python. This is done for different neighbors. The R squared is computed in each case. This is repeated after performing feature scaling. It can be seen the model fit is much better after feature scaling. Normalization refers to
Another technique that is used is Standardization which is
1.4a K Nearest Neighbors Regression – R( Unnormalized)The R code below does not use feature scaling
# KNN regression requires the FNN package df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 < as.data.frame(sapply(df,as.numeric)) df2 < df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 < df2[complete.cases(df2),] # Split train and test train_idx < trainTestSplit(df3,trainPercent=75,seed=5) train < df3[train_idx, ] test < df3[train_idx, ] # Select the feature variables train.X=train[,1:6] # Set the target for training train.Y=train[,7] # Do the same for test set test.X=test[,1:6] test.Y=test[,7] rsquared < NULL # Create a list of neighbors neighbors <c(1,2,4,8,10,14) for(i in seq_along(neighbors)){ # Perform a KNN regression fit knn=knn.reg(train.X,test.X,train.Y,k=neighbors[i]) # Compute R sqaured rsquared[i]=knnRSquared(knn$pred,test.Y) } # Make a dataframe for plotting df < data.frame(neighbors,Rsquared=rsquared) # Plot the number of neighors vs the R squared ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("R squared") + ggtitle("KNN regression  R squared vs Number of Neighors (Unnormalized)") 1.4b K Nearest Neighbors Regression – Python( Unnormalized)The Python code below does not use feature scaling
import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.neighbors import KNeighborsRegressor autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO88591") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Perform a train/test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Create a list of neighbors rsquared=[] neighbors=[1,2,4,8,10,14] for i in neighbors: # Fit a KNN model knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train, y_train) # Compute R squared rsquared.append(knnreg.score(X_test, y_test)) print('Rsquared test score: {:.3f}' .format(knnreg.score(X_test, y_test))) # Plot the number of neighors vs the R squared fig3=plt.plot(neighbors,rsquared) fig3=plt.title("KNN regression  R squared vs Number of neighbors(Unnormalized)") fig3=plt.xlabel("Neighbors") fig3=plt.ylabel("R squared") fig3.figure.savefig('foo3.png', bbox_inches='tight') print "Finished plotting and saving" ## Rsquared test score: 0.527 ## Rsquared test score: 0.678 ## Rsquared test score: 0.707 ## Rsquared test score: 0.684 ## Rsquared test score: 0.683 ## Rsquared test score: 0.670 ## Finished plotting and saving 1.4c K Nearest Neighbors Regression – R( Normalized)It can be seen that R squared improves when the features are normalized.
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 < as.data.frame(sapply(df,as.numeric)) df2 < df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 < df2[complete.cases(df2),] # Perform MinMaxScaling of feature variables train.X.scaled=MinMaxScaler(train.X) test.X.scaled=MinMaxScaler(test.X) # Create a list of neighbors rsquared < NULL neighbors <c(1,2,4,6,8,10,12,15,20,25,30) for(i in seq_along(neighbors)){ # Fit a KNN model knn=knn.reg(train.X.scaled,test.X.scaled,train.Y,k=i) # Compute R ssquared rsquared[i]=knnRSquared(knn$pred,test.Y) } df < data.frame(neighbors,Rsquared=rsquared) # Plot the number of neighors vs the R squared ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("R squared") + ggtitle("KNN regression  R squared vs Number of Neighors(Normalized)") 1.4d K Nearest Neighbors Regression – Python( Normalized)R squared improves when the features are normalized with MinMaxScaling
import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.neighbors import KNeighborsRegressor from sklearn.preprocessing import MinMaxScaler autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO88591") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Perform a train/ test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Use MinMaxScaling scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) # Apply scaling on test set X_test_scaled = scaler.transform(X_test) # Create a list of neighbors rsquared=[] neighbors=[1,2,4,6,8,10,12,15,20,25,30] for i in neighbors: # Fit a KNN model knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train_scaled, y_train) # Compute R squared rsquared.append(knnreg.score(X_test_scaled, y_test)) print('Rsquared test score: {:.3f}' .format(knnreg.score(X_test_scaled, y_test))) # Plot the number of neighors vs the R squared fig4=plt.plot(neighbors,rsquared) fig4=plt.title("KNN regression  R squared vs Number of neighbors(Normalized)") fig4=plt.xlabel("Neighbors") fig4=plt.ylabel("R squared") fig4.figure.savefig('foo4.png', bbox_inches='tight') print "Finished plotting and saving" ## Rsquared test score: 0.703 ## Rsquared test score: 0.810 ## Rsquared test score: 0.830 ## Rsquared test score: 0.838 ## Rsquared test score: 0.834 ## Rsquared test score: 0.828 ## Rsquared test score: 0.827 ## Rsquared test score: 0.826 ## Rsquared test score: 0.816 ## Rsquared test score: 0.815 ## Rsquared test score: 0.809 ## Finished plotting and saving ConclusionIn this initial post I cover the regression models when the output is continous. I intend to touch upon other Machine Learning algorithms.
Comments, suggestions and corrections are welcome.
Watch this this space!
To be continued….
You may like
1. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
2. Neural Networks: The mechanics of backpropagation
3. More book, more cricket! 2nd edition of my books now on Amazon
4. Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
5. Introducing cricket package yorkr:Part 4In the block hole!
To see all posts see Index of posts
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ARIMA models and Intervention Analysis
In my previous tutorial Structural Changes in Global Warming I introduced the strucchange package and some basic examples to date structural breaks in time series. In the present tutorial, I am going to show how dating structural changes (if any) and then Intervention Analysis can help in finding better ARIMA models. Dating structural changes consists in determining if there are any structural breaks in the time series data generating process, and, if so, their dates. Intervention analysis estimates the effect of an external or exogenous intervention on a time series. As an example of intervention, a permanent level shift, as we will see in this tutorial. In our scenario, the external or exogenous intervention is not known in advance, (or supposed to be known), it is inferred from the structural break we will identify.
The dataset considered for the analysis is the Arbuthnot dataset containing information of male and female births in London from year 1639 to 1710. Based on that, a metric representing the fractional excess of boys births versus girls is defined as:
$$
\begin{equation}
\begin{aligned}
\dfrac{(Boys – Girls)}{Girls}
\end{aligned}
\end{equation}
$$
The time series so defined is analyzed to determine candidate ARIMA models. The present tutorial is so organized. First, a brief exploratory analysis is carried on. Then, six ARIMA models are defined, analyzed and compared. Forecast of the time series under analysis is computed. Finally, a short historical background digression is introduced describing what was happening in England on 17th century and citing studies on the matter of sex ratio at birth.
Packages suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(astsa)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(fUnitRoots)) suppressPackageStartupMessages(library(FitARMA)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(reshape)) suppressPackageStartupMessages(library(Rmisc)) suppressPackageStartupMessages(library(fBasics))Note:The fUnitRoots package is not any longer maintained by CRAN, however it can be installed from source available at the following link:
fUnitRoots version 3010.78 sources
Exploratory AnalysisLoading the Arbuthnot dataset and showing some basic metrics and plots.
url < "https://www.openintro.org/stat/data/arbuthnot.csv" abhutondot < read.csv(url, header=TRUE) nrow(abhutondot) [1] 82 head(abhutondot) year boys girls 1 1629 5218 4683 2 1630 4858 4457 3 1631 4422 4102 4 1632 4994 4590 5 1633 5158 4839 6 1634 5035 4820 abhutondot_rs < melt(abhutondot, id = c("year")) head(abhutondot_rs) year variable value 1 1629 boys 5218 2 1630 boys 4858 3 1631 boys 4422 4 1632 boys 4994 5 1633 boys 5158 6 1634 boys 5035 tail(abhutondot_rs) year variable value 159 1705 girls 7779 160 1706 girls 7417 161 1707 girls 7687 162 1708 girls 7623 163 1709 girls 7380 164 1710 girls 7288 ggplot(data = abhutondot_rs, aes(x = year)) + geom_line(aes(y = value, colour = variable)) + scale_colour_manual(values = c("blue", "red"))Boys births appear to be consistently greater than girls ones. Let us run a t.test to further verify if there is a true difference in the mean of the two groups as represented by boys and girls births.
t.test(value ~ variable, data = abhutondot_rs) Welch Two Sample ttest data: value by variable t = 1.4697, df = 161.77, pvalue = 0.1436 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 128.0016 872.9040 sample estimates: mean in group boys mean in group girls 5907.098 5534.646Based on the pvalue, we cannot reject the null hypothesis that the true difference in means is equal to zero.
basicStats(abhutondot[1]) boys girls nobs 8.200000e+01 8.200000e+01 NAs 0.000000e+00 0.000000e+00 Minimum 2.890000e+03 2.722000e+03 Maximum 8.426000e+03 7.779000e+03 1. Quartile 4.759250e+03 4.457000e+03 3. Quartile 7.576500e+03 7.150250e+03 Mean 5.907098e+03 5.534646e+03 Median 6.073000e+03 5.718000e+03 Sum 4.843820e+05 4.538410e+05 SE Mean 1.825161e+02 1.758222e+02 LCL Mean 5.543948e+03 5.184815e+03 UCL Mean 6.270247e+03 5.884477e+03 Variance 2.731595e+06 2.534902e+06 Stdev 1.652754e+03 1.592137e+03 Skewness 2.139250e01 2.204720e01 Kurtosis 1.221721e+00 1.250893e+00 p1 < ggplot(data = abhutondot_rs, aes(x = variable, y = value)) + geom_boxplot() p2 < ggplot(data = abhutondot, aes(boys)) + geom_density() p3 < ggplot(data = abhutondot, aes(girls)) + geom_density() multiplot(p1, p2, p3, cols = 3)Gives this plot.
Let us define the time series to be analysed with frequency = 1 as data is collected yearly, see ref. [5] for details.
excess_frac < (abhutondot$boys  abhutondot$girls)/abhutondot$girls excess_ts < ts(excess_frac, frequency = 1, start = abhutondot$year[1]) autoplot(excess_ts)Basic statistics metrics are reported.
basicStats(excess_frac) excess_frac nobs 82.000000 NAs 0.000000 Minimum 0.010673 Maximum 0.156075 1. Quartile 0.048469 3. Quartile 0.087510 Mean 0.070748 Median 0.064704 Sum 5.801343 SE Mean 0.003451 LCL Mean 0.063881 UCL Mean 0.077615 Variance 0.000977 Stdev 0.031254 Skewness 0.680042 Kurtosis 0.073620Boys births were at least 1% higher than girls ones, reaching a top percentage excess equal to 15%.
Further, unit roots tests (run by urdfTest() within fUnitRoots package) show that we cannot reject the null hypothesis of unit root presence. See their test statistics compared with critical values below (see ref. [2] for details about the urdfTest() report).
urdftest_lag = floor(12*(length(excess_ts)/100)^0.25) urdfTest(excess_ts, type = "nc", lags = urdftest_lag, doplot = FALSE) Title: Augmented DickeyFuller Unit Root Test Test Results: Test regression none Call: lm(formula = z.diff ~ z.lag.1  1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max 0.052739 0.018246 0.002899 0.019396 0.069349 Coefficients: Estimate Std. Error t value Pr(>t) z.lag.1 0.01465 0.05027 0.291 0.771802 z.diff.lag1 0.71838 0.13552 5.301 1.87e06 *** z.diff.lag2 0.66917 0.16431 4.073 0.000143 *** z.diff.lag3 0.58640 0.18567 3.158 0.002519 ** z.diff.lag4 0.56529 0.19688 2.871 0.005700 ** z.diff.lag5 0.58489 0.20248 2.889 0.005432 ** z.diff.lag6 0.60278 0.20075 3.003 0.003944 ** z.diff.lag7 0.43509 0.20012 2.174 0.033786 * z.diff.lag8 0.28608 0.19283 1.484 0.143335 z.diff.lag9 0.14212 0.18150 0.783 0.436769 z.diff.lag10 0.13232 0.15903 0.832 0.408801 z.diff.lag11 0.07234 0.12774 0.566 0.573346  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03026 on 58 degrees of freedom Multiple Rsquared: 0.4638, Adjusted Rsquared: 0.3529 Fstatistic: 4.181 on 12 and 58 DF, pvalue: 0.0001034 Value of teststatistic is: 0.2914 Critical values for test statistics: 1pct 5pct 10pct tau1 2.6 1.95 1.61 urdfTest(excess_ts, type = "c", lags = urdftest_lag, doplot = FALSE) Title: Augmented DickeyFuller Unit Root Test Test Results: Test regression drift Call: lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max 0.051868 0.018870 0.005227 0.019503 0.067936 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 0.02239 0.01789 1.251 0.2159 z.lag.1 0.31889 0.24824 1.285 0.2041 z.diff.lag1 0.44287 0.25820 1.715 0.0917 . z.diff.lag2 0.40952 0.26418 1.550 0.1266 z.diff.lag3 0.34933 0.26464 1.320 0.1921 z.diff.lag4 0.35207 0.25966 1.356 0.1805 z.diff.lag5 0.39863 0.25053 1.591 0.1171 z.diff.lag6 0.44797 0.23498 1.906 0.0616 . z.diff.lag7 0.31103 0.22246 1.398 0.1675 z.diff.lag8 0.19044 0.20656 0.922 0.3604 z.diff.lag9 0.07128 0.18928 0.377 0.7079 z.diff.lag10 0.18023 0.16283 1.107 0.2730 z.diff.lag11 0.04154 0.12948 0.321 0.7495  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03012 on 57 degrees of freedom Multiple Rsquared: 0.4781, Adjusted Rsquared: 0.3683 Fstatistic: 4.352 on 12 and 57 DF, pvalue: 6.962e05 Value of teststatistic is: 1.2846 0.8257 Critical values for test statistics: 1pct 5pct 10pct tau2 3.51 2.89 2.58 phi1 6.70 4.71 3.86ACF and PACF plots are given.
par(mfrow=c(1,2)) acf(excess_ts) pacf(excess_ts)We observe the autocorrelation spike at lag = 10 beyond confidence region. That suggests the presence of a seasonal component with period = 10. Structural changes are now investigated. First let us verify if regression against a constant is significative for our time series.
summary(lm(excess_ts ~ 1)) Call: lm(formula = excess_ts ~ 1) Residuals: Min 1Q Median 3Q Max 0.060075 0.022279 0.006044 0.016762 0.085327 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 0.070748 0.003451 20.5 <2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03125 on 81 degrees of freedomThe intercept is reported as significative. Let us see if there are any structural breaks.
(break_point < breakpoints(excess_ts ~ 1)) Optimal 2segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: 42 Corresponding to breakdates: 1670 plot(break_point) summary(break_point) Optimal (m+1)segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: m = 1 42 m = 2 20 42 m = 3 20 35 48 m = 4 20 35 50 66 m = 5 17 30 42 56 69 Corresponding to breakdates: m = 1 1670 m = 2 1648 1670 m = 3 1648 1663 1676 m = 4 1648 1663 1678 1694 m = 5 1645 1658 1670 1684 1697 Fit: m 0 1 2 3 4 5 RSS 0.07912 0.06840 0.06210 0.06022 0.05826 0.05894 BIC 327.84807 330.97945 330.08081 323.78985 317.68933 307.92410The BIC minimum value is reached when m = 1, hence just one break point is determined corresponding to year 1670. Let us plot the original time series against its structural break and its confidence interval.
plot(excess_ts) lines(fitted(break_point, breaks = 1), col = 4) lines(confint(break_point, breaks = 1))Boys vs girls sex ratio at birth changed from:
fitted(break_point)[1] 0.08190905to:
fitted(breakpoint)[length(excess_ts)] 0.05902908Running a t.test() to verify further the difference in mean is significative across the two time windows identified by the breakpoint date, year 1670.
break_date < breakdates(break_point) win_1 < window(excess_ts, end = break_date) win_2 < window(excess_ts, start = break_date + 1) t.test(win_1, win_2) Welch Two Sample ttest data: win_1 and win_2 t = 3.5773, df = 70.961, pvalue = 0.0006305 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01012671 0.03563322 sample estimates: mean of x mean of y 0.08190905 0.05902908Based on reported pvalue, we reject the null hypothesis that the true difference is equal to zero.
ARIMA ModelsI am going to compare the following six ARIMA models (represented with the usual (p,d,q)(P,D,Q)[S] notation):
1. non seasonal (1,1,1), as determined by auto.arima() within forecast package 2. seasonal (1,0,0)(0,0,1)[10] 3. seasonal (1,0,0)(1,0,0)[10] 4. seasonal (0,0,0)(0,0,1)[10] with level shift regressor as intervention variable 5. seasonal (1,0,0)(0,0,1)[10] with level shift regressor as intervention variable 6. non seasonal (1,0,0) with level shift regressor as intervention variableHere we go.
Model #1The first model is determined by the auto.arima() function within the forecast package, using the options:
a. stepwise = FALSE, which allows for a more indepth search of potential models
b. trace = TRUE, which allows to get a list of all the investigated models
Further, as default input to auto.arima() :
c. stationary = FALSE, so that models search is not restricted to stationary models
d. seasonal = TRUE, so that models search is not restricted to non seasonal models
As we can see, all investigated models have d=1, which is congruent with the augmented DickeyFuller test results.
summary(model_1) Series: excess_ts ARIMA(1,1,1) Coefficients: ar1 ma1 0.2224 0.9258 s.e. 0.1318 0.0683 sigma^2 estimated as 0.0009316: log likelihood=167.94 AIC=329.88 AICc=329.57 BIC=322.7 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.002931698 0.02995934 0.02405062 27.05674 46.53832 0.8292635 0.01005689The significance of our (1,1,1) model coefficients is further investigated.
coeftest(model_1) z test of coefficients: Estimate Std. Error z value Pr(>z) ar1 0.222363 0.131778 1.6874 0.09153 . ma1 0.925836 0.068276 13.5603 < 2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #2A spike at lag = 1 in the ACF plot suggests the presence of an autoregressive component. Our second model candidate takes into account the seasonality observed at lag = 10. As a result, the candidate model (1,0,0)(0,0,1)[10] is investigated.
model_2 < Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period = 10), include.mean = TRUE) summary(model_2) Series: excess_ts ARIMA(1,0,0)(0,0,1)[10] with nonzero mean Coefficients: ar1 sma1 mean 0.2373 0.3441 0.0708 s.e. 0.1104 0.1111 0.0053 sigma^2 estimated as 0.0008129: log likelihood=176.23 AIC=344.46 AICc=343.94 BIC=334.83 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.0002212383 0.02798445 0.02274858 21.47547 42.96717 0.7843692 0.004420048 coeftest(model_2) z test of coefficients: Estimate Std. Error z value Pr(>z) ar1 0.2372975 0.1104323 2.1488 0.031650 * sma1 0.3440811 0.1110791 3.0976 0.001951 ** intercept 0.0707836 0.0052663 13.4409 < 2.2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #3In the third model, I introduce a seasonal autoregressive component in place of the moving average one as present into model #2.
model_3 < Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(1,0,0), period = 10), include.mean = TRUE) summary(model_3) Series: excess_ts ARIMA(1,0,0)(1,0,0)[10] with nonzero mean Coefficients: ar1 sar1 mean 0.2637 0.3185 0.0705 s.e. 0.1069 0.1028 0.0058 sigma^2 estimated as 0.0008173: log likelihood=176.1 AIC=344.19 AICc=343.67 BIC=334.56 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.0002070952 0.02806013 0.02273145 21.42509 42.85735 0.7837788 0.005665764 coeftest(model_3) z test of coefficients: Estimate Std. Error z value Pr(>z) ar1 0.2636602 0.1069472 2.4653 0.013689 * sar1 0.3185397 0.1027903 3.0989 0.001942 ** intercept 0.0704636 0.0058313 12.0836 < 2.2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #4This model takes into account the level shift highlighted by the structural change analysis and the seasonal component at lag = 10 observed in the ACF plot. To represent the structural change as level shift, a regressor variable named as level is defined as equal to zero for the timeline preceeding the breakpoint date and as equal to one afterwards such date. In other words, it is a step function which represents a permanent level shift. Such variable is input to the Arima() function as xreg argument. That is one of the most simple representation of an intervention variable, for a more complete overview see ref. [6].
level < c(rep(0, break_point$breakpoints), rep(1, length(excess_ts)  break_point$breakpoints)) model_4 < Arima(excess_ts, order = c(0,0,0), seasonal = list(order = c(0,0,1), period = 10), xreg = level, include.mean = TRUE) summary(model_4) Series: excess_ts Regression with ARIMA(0,0,0)(0,0,1)[10] errors Coefficients: sma1 intercept level 0.3437 0.0817 0.0225 s.e. 0.1135 0.0053 0.0072 sigma^2 estimated as 0.0007706: log likelihood=178.45 AIC=348.9 AICc=348.38 BIC=339.27 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.0001703111 0.02724729 0.02184016 19.82639 41.28977 0.7530469 0.1608774 coeftest(model_4) z test of coefficients: Estimate Std. Error z value Pr(>z) sma1 0.3437109 0.1135387 3.0273 0.002468 ** intercept 0.0817445 0.0052825 15.4745 < 2.2e16 *** level 0.0225294 0.0072468 3.1089 0.001878 **  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1The ‘level’ regressor coefficient indicates an average 2.2% decrease in the boys vs. girls excess birth ratio, afterwards year 1670.
Model #5The present model adds to model #4 an autoregressive component.
model_5 < Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period=10), xreg = level, include.mean = TRUE) summary(model_5) Series: excess_ts Regression with ARIMA(1,0,0)(0,0,1)[10] errors Coefficients: ar1 sma1 intercept level 0.1649 0.3188 0.0820 0.0230 s.e. 0.1099 0.1112 0.0061 0.0084 sigma^2 estimated as 0.0007612: log likelihood=179.56 AIC=349.11 AICc=348.32 BIC=337.08 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 8.225255e05 0.02690781 0.02174375 19.42485 40.90147 0.7497229 0.007234682 coeftest(model_5) z test of coefficients: Estimate Std. Error z value Pr(>z) ar1 0.1648878 0.1099118 1.5002 0.133567 sma1 0.3187896 0.1112301 2.8660 0.004156 ** intercept 0.0819981 0.0061227 13.3926 < 2.2e16 *** level 0.0230176 0.0083874 2.7443 0.006064 **  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1The autoregressive coefficient ar1 is reported as not significative, hence this model is not taken into account further, even though it would provide a (very) slight improvement in the AIC metric.
Model #6This model consideres the structural change as in model #4 without the seasonal component. That because I want to evaluate if any benefits comes from including a seasonal component.
model_6 < Arima(excess_ts, order = c(1,0,0), xreg = level, include.mean = TRUE) summary(model_6) Series: excess_ts Regression with ARIMA(1,0,0) errors Coefficients: ar1 intercept level 0.1820 0.0821 0.0232 s.e. 0.1088 0.0053 0.0076 sigma^2 estimated as 0.0008369: log likelihood=175.68 AIC=343.35 AICc=342.83 BIC=333.73 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 7.777648e05 0.02839508 0.02258574 21.93086 43.2444 0.7787544 0.0003897161 coeftest(model_6) z test of coefficients: Estimate Std. Error z value Pr(>z) ar1 0.1820054 0.1088357 1.6723 0.094466 . intercept 0.0821470 0.0053294 15.4139 < 2.2e16 *** level 0.0232481 0.0076044 3.0572 0.002234 **  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Models Residuals AnalysisIt is relevant to verify if our models residuals are white noise or not. If not, the model under analysis has not captured all the structure of our original time series. For this purpose, I will take advantage of the checkresiduals() function within forecast package and the sarima() within the astsa package. What I like of the sarima() function is the residuals qqplot() with confidence region and the LjungBox plot to check significance of residuals correlations. If you like to further verify the LjungBox test results, I would suggest to take advantage of the LjungBoxTest() function within FitARMA package.
Notes:
1. Model #5 was taken out of the prosecution of the analysis, hence its residuals will not be checked.
2. checkresiduals() and sarima() gives textual outputs which are omitted as equivalent information is already included elsewhere. The checkresiduals() reports a short LjungBox test result. The sarima() function reports a textual model summary showing coefficients and metrics similar to already shown summaries.
Checking ARIMA model (1,1,1) residuals.
checkresiduals(model_1)It is important to verify the significance of residuals autocorrelation, in order to see if the spike at lag = 16 is as such. In fact, the purpose of running the LjungBox test is to verify if any autocorrelation beyond the confidence region (as seen in the ACF plot) is a true positive (pvalue < 0.05) or is false positive (pvalue >= 0.05).
LjungBoxTest(residuals(model_1), k = 2, lag.max = 20) m Qm pvalue 1 0.01 0.92611002 2 0.01 0.91139925 3 0.17 0.68276539 4 0.82 0.66216496 5 1.35 0.71745835 6 1.99 0.73828536 7 3.32 0.65017878 8 3.98 0.67886254 9 5.16 0.64076272 10 13.34 0.10075806 11 15.32 0.08260244 12 15.32 0.12066369 13 15.35 0.16692082 14 15.39 0.22084407 15 15.40 0.28310047 16 23.69 0.04989871 17 25.63 0.04204503 18 27.65 0.03480954 19 30.06 0.02590249 20 30.07 0.03680262The pvalue at lag = 16 is < 0.05 hence the autocorrelation spike at lag = 16 out of the confidence interval is significative. Our model #1 has not captured all the structure of the original time series. Also autocorrelations at lags = {17, 18, 19, 20} have pvalue < 0.05, however they stand within the confidence inteval.
Now doing the same with sarima() function.
sarima(excess_ts, p = 1, d = 1, q = 1)The sarima() output plot shows results congruent with previous comments.
Model #2 Residuals DiagnosticChecking ARIMA (1,0,0)(0,0,1)[10] model residuals.
checkresiduals(model_2)We observe how the model #2 does not show autocorrelation spikes above the confidence interval, and that is a confirmation of the presence of seasonality with period = 10.
LjungBoxTest(residuals(model_2), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9674875 2 0.00 0.9545816 3 0.30 0.5815774 4 0.69 0.7096699 5 0.82 0.8442508 6 0.98 0.9121811 7 2.01 0.8481327 8 4.24 0.6442302 9 8.56 0.2861290 10 8.63 0.3742209 11 10.06 0.3459882 12 10.13 0.4290298 13 10.15 0.5172343 14 10.20 0.5985932 15 10.44 0.6577499 16 14.30 0.4275627 17 17.14 0.3104476 18 18.86 0.2759461 19 22.35 0.1715052 20 22.41 0.2142307No pvalue > 0.05 are shown by the LjungBox test report. Now showing sarima() plots.
sarima(excess_ts, p = 1, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10)The sarima() output plot shows results congruent with previous comments. As a conclusion, model #2 has white noise residuals.
Model #3 Residuals DiagnosticChecking ARIMA (1,0,0)(1,0,0)[10] model residuals.
checkresiduals(model_3) LjungBoxTest(residuals(model_3), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9583318 2 0.00 0.9553365 3 0.18 0.6719971 4 0.37 0.8313599 5 0.46 0.9285781 6 0.63 0.9600113 7 1.63 0.8975737 8 3.90 0.6901431 9 8.23 0.3126132 10 8.34 0.4005182 11 10.36 0.3222430 12 10.36 0.4091634 13 10.39 0.4960029 14 10.52 0.5708185 15 10.78 0.6290166 16 14.81 0.3914043 17 17.91 0.2675070 18 19.69 0.2343648 19 23.37 0.1374412 20 23.70 0.1651785 sarima(excess_ts, p = 1, d = 0, q = 0, P = 1, D = 0, Q = 0, S = 10)Model#3 has white noise residuals.
Model #4 Residuals DiagnosticChecking residuals of the ARIMA (0,0,0)(0,0,1)[10] model with level shift.
checkresiduals(model_4) LjungBoxTest(residuals(model_4), k = 1, lag.max = 20) m Qm pvalue 1 2.20 0.1379312 2 2.23 0.1349922 3 2.24 0.3262675 4 3.68 0.2977682 5 4.99 0.2884494 6 5.23 0.3884858 7 5.52 0.4787801 8 7.45 0.3837810 9 10.78 0.2142605 10 10.79 0.2905934 11 12.61 0.2465522 12 12.61 0.3198632 13 12.61 0.3979718 14 12.63 0.4769589 15 12.65 0.5538915 16 16.37 0.3578806 17 16.77 0.4005631 18 19.73 0.2882971 19 25.25 0.1182396 20 25.31 0.1504763 sarima(excess_ts, p = 0, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10, xreg = level)As a conclusion, model #4 has white noise residuals.
Model #6 Residuals DiagnosticChecking residuals of the ARIMA (1,0,0) model with level shift.
checkresiduals(model_6)We can observe ACF spikes above the confidence region for lags = {10, 16}.
LjungBoxTest(residuals(model_6), k = 1, lag.max = 20) m Qm pvalue 1 0.00 0.99713258 2 0.00 0.96036600 3 0.07 0.96612792 4 1.51 0.67947886 5 2.70 0.60998895 6 4.47 0.48361437 7 5.20 0.51895133 8 5.49 0.59997164 9 6.51 0.58979614 10 14.72 0.09890580 11 17.09 0.07239050 12 17.21 0.10175761 13 17.21 0.14179063 14 17.24 0.18844158 15 17.34 0.23872998 16 25.31 0.04596230 17 27.53 0.03591335 18 29.50 0.03021450 19 31.47 0.02537517 20 31.71 0.03370585The LjungBox test reports the residuals autocorrelation at lag = 10 as not signficative (pvalue > 0.05). The lag = 16 residuals autocorrelation is significative (pvalue < 0.05). Let us do same verifications with sarima().
sarima(excess_ts, p = 1, d = 0, q = 0, xreg = level)The sarima() plots confirm the presence of significative autocorrelations in the residuals at lag = 16. As a conclusion, this model does not capture all the structure of our original time series.
As overall conclusion, only the seasonal models have white noise residuals.
Models SummaryFinally, it is time to gather the overall AIC, AICc and BIC metrics of our five candidate models (please remember that model #5 has been cut off from the prosecution of the analysis) and choose the final model.
df < data.frame(col_1_res = c(model_1$aic, model_2$aic, model_3$aic, model_4$aic, model_6$aic), col_2_res = c(model_1$aicc, model_2$aicc, model_3$aicc, model_4$aicc, model_6$aicc), col_3_res = c(model_1$bic, model_2$bic, model_3$bic, model_4$bic, model_6$bic)) colnames(df) < c("AIC", "AICc", "BIC") rownames(df) < c("ARIMA(1,1,1)", "ARIMA(1,0,0)(0,0,1)[10]", "ARIMA(1,0,0)(1,0,0)[10]", "ARIMA(0,0,0)(0,0,1)[10] with level shift", "ARIMA(1,0,0) with level shift") df AIC AICc BIC ARIMA(1,1,1) 329.8844 329.5727 322.7010 ARIMA(1,0,0)(0,0,1)[10] 344.4575 343.9380 334.8306 ARIMA(1,0,0)(1,0,0)[10] 344.1906 343.6711 334.5637 ARIMA(0,0,0)(0,0,1)[10] with level shift 348.8963 348.3768 339.2694 ARIMA(1,0,0) with level shift 343.3529 342.8334 333.7260The model which provides the best AIC, AICc and BIC metrics at the same time is model #4, ARIMA(0,0,0)(0,0,1)[10] with level shift.
Note: In case of structural changes, the (augmented) DickeyFuller test is biased toward the non rejection of the null, as ref. [2] explains. This may justify why the null hypothesis of unit root presence was not rejected by such test and model #1 performs worse than the other ones in terms of AIC metric. Further, you may verify that with d=1 in models from #2 up to #6, the AIC metric does not improve.
ForecastI take advantage of the forecast() function provided within forecast package using model #4 as input. The xreg variable is determined as a constant level equal to one congruently with the structural analysis results.
h_fut < 20 plot(forecast(model_4, h = h_fut, xreg = rep(1, h_fut))) Historical BackgroundSo we observed a level shift equal approximately to 2.25% in the boys vs girls births excess ratio occurring on year 1670. Two questions arises:
1. What could have influenced the sexratio at birth ?
2. What was happening in London during the 17th century ?
There are studies showing results in support of the fact that sexratio at birth is influenced by war periods. Studies such as ref. [7], [8] and [9] suggest an increase of boys births during and/or after war time. General justification is for an adaptive response to external conditions and stresses. Differently, studies as ref. [10] state that there is no statistical evidence of war time influence on sexratio at births. Under normal circumstances, the boys vs girls sex ratio at birth is approximately 105:100 on average, (according to some sources), and generally between 102:100 and 108:100.
Our time series covers most of the following eras of the England history (ref. [11]):
* Early Stuart era: 16031660
* Later Stuart era 16601714
The English Civil War occured during the Early Stuart era and consisted of a series of armed conflicts and political machinations that took place between Parliamentarians (known as Roundheads) and Royalists (known as Cavaliers) between 1642 and 1651. The English conflict left some 34,000 Parliamentarians and 50,000 Royalists dead, while at least 100,000 men and women died from warrelated diseases, bringing the total death toll caused by the three civil wars in England to almost 200,000 (see ref. [12]).
If we step back to view the first plot showing the abhutondot dataset, we can observe a sharp drop on births (both boys and girls) between 1642 and 1651, as testify a period of scarce resources and famine, and we can infer it was due to the civil war. Let us run a quick analysis on the total number of births.
abhutondot.ts < ts(abhutondot$boys + abhutondot$girls, frequency = 1 , start = abhutondot$year[1]) autoplot(abhutondot.ts)I then verify if any structural breaks are there with respect a constant level as regressor.
summary(lm(abhutondot.ts ~ 1)) Call: lm(formula = abhutondot.ts ~ 1) Residuals: Min 1Q Median 3Q Max 5829.7 2243.0 371.3 3281.3 4703.3 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 11442 358 31.96 <2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3242 on 81 degrees of freedomThe regression is reported as significative. Going on with the search for structural breaks, if any,
(break_point < breakpoints(abhutondot.ts ~ 1)) Optimal 4segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: 15 33 52 Corresponding to breakdates: 1643 1661 1680 plot(break_point) summary(break_point) Optimal (m+1)segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: m = 1 39 m = 2 33 52 m = 3 15 33 52 m = 4 15 33 52 68 m = 5 15 32 44 56 68 Corresponding to breakdates: m = 1 1667 m = 2 1661 1680 m = 3 1643 1661 1680 m = 4 1643 1661 1680 1696 m = 5 1643 1660 1672 1684 1696 Fit: m 0 1 2 3 4 5 RSS 851165494 211346686 139782582 63564154 59593922 62188019 BIC 1566 1461 1436 1380 1383 1396The BIC minimum value is reached with m = 3. Let us plot the original time series against its structural breaks and their confidence intervals.
plot(abhutondot.ts) fitted.ts < fitted(break_point, breaks = 3) lines(fitted.ts, col = 4) lines(confint(break_point, breaks = 3))The fitted levels and the breakpoints dates are as follows.
unique(as.integer(fitted.ts)) [1] 9843 6791 11645 14902 breakdates(break_point, breaks = 3) [1] 1648 1663 1676So it is confirmed that the total number of births went through a sequence of level shift due to exogeneous shocks. The civil war is very likely determining the first break and its end the second one. It is remarkable the total births decrease from the 9843 level down to 6791 occurring around year 1648. As well remarkable, the level up shift happened on year 1663, afterwards the civil war end. The excess sex ratio structural break on year 1670 occurred at a time approximately in between total births second and third breaks.
The fitted multiple level shifts (as determined by the structural breaks analysis) can be used as intervention variable to fit an ARIMA model, as shown below.
fitted.ts < fitted(break_point, breaks = 3) autoplot(fitted.ts) abhutondot_xreg < Arima(abhutondot.ts, order = c(0,1,1), xreg = fitted.ts, include.mean = TRUE) summary(abhutondot_xreg) Series: abhutondot.ts Regression with ARIMA(0,1,1) errors Coefficients: ma1 fitted.ts 0.5481 0.5580 s.e. 0.1507 0.1501 sigma^2 estimated as 743937: log likelihood=661.65 AIC=1329.3 AICc=1329.61 BIC=1336.48 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 71.60873 846.5933 622.1021 0.0132448 5.92764 1.002253 0.08043183 coeftest(abhutondot_xreg) z test of coefficients: Estimate Std. Error z value Pr(>z) ma1 0.54809 0.15071 3.6368 0.0002761 *** fitted.ts 0.55799 0.15011 3.7173 0.0002013 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 checkresiduals(abhutondot_xreg) LjungBoxTest(residuals(abhutondot_xreg), k=1, lag.max=20) m Qm pvalue 1 1.39 0.2387836 2 2.26 0.1325458 3 2.31 0.3156867 4 2.69 0.4416912 5 2.93 0.5701546 6 3.32 0.6503402 7 4.14 0.6580172 8 4.53 0.7170882 9 4.54 0.8054321 10 7.86 0.5479118 11 8.51 0.5791178 12 8.54 0.6644773 13 8.69 0.7292904 14 9.31 0.7491884 15 11.16 0.6734453 16 11.50 0.7163115 17 12.58 0.7035068 18 12.60 0.7627357 19 13.01 0.7906889 20 14.83 0.7334703 sarima(abhutondot.ts, p=0, d=1, q=1, xreg = fitted.ts)The residuals are white noise.
ConclusionsTo have spot seasonality and level shifts were important aspects in our ARIMA models analysis. The seasonal component allowed to define models with white noise residuals. The structural breaks analysis allowed to define models with better AIC metric introducing as regressor the identified level shifts. We also had the chance to go through some historical remarks of the history of England and get to know about sex ratio at birth studies.
If you have any questions, please feel free to comment below.
References
 [1] Applied Econometrics with R, Achim Zeileis, Christian Kleiber – Springer Ed.
[2] Applied Econometrics Time Series, Walter Enders – Wiley Ed.
[3] Intervention Analysis – Penn State, Eberly College of Science – STAT510
[4] Forecast Package
[5] Seasonal Periods – Rob J. Hyndman
[6] Time Series Analysis With Applications in R – Jonathan D. Cryer, Kung.Sik Chan – Springer Ed.,
[7] Why are more boys born during war ? – Dirk Bethmann, Michael Kvasnicka
[8] Evolutionary ecology of human birth sex ratio – Samuli Helle, Samuli Helama, Kalle Lertola
[9] The Sex Ratio at Birth Following Periods of Conflict – Patrick Clarkin
[10] The ratio of male to female births as affected by wars – E. R. Shaw – American Statistical Association
[11] Early Modern Britain – Wikipedia
[12] English Civil Wars
Related Post
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
Governance, Engagement, and Resistance in the Open Science Movement: A Comparative Study
(This article was first published on rOpenSci Blog, and kindly contributed to Rbloggers)
A growing community of scientists from a variety of disciplines is moving the norms of scientific research toward open practices. Supporters of open science hope to increase the quality and efficiency of research by enabling the widespread sharing of datasets, research software source code, publications, and other processes and products of research. The speed at which the open science community seems to be growing mirrors the rapid development of technological capabilities, including robust open source scientific software, new services for data sharing and publication, and novel data science techniques for working with massive datasets. Organizations like rOpenSci harness such capabilities and deploy various combinations of these research tools, or what I refer to here as open science infrastructures, to facilitate open science.
As studies of other digital infrastructures have pointed out, developing and deploying the technological capabilities that support innovative work within a community of practitioners constitutes just one part of making innovation happen. As quickly as the technical solutions to improving scientific research may be developing, a host of organizational and social issues are lagging behind and hampering the open science community’s ability to inscribe open practices in the culture of scientific research. Remedying organizational and social issues requires paying attention to open science infrastructures’ human components, such as designers, administrators, and users, as well as the policies, practices, and organizational structures that contribute to the smooth functioning of these systems.12 These elements of infrastructure development have, in the past, proven to be equal to or more important than technical capabilities in determining the trajectory the infrastructure takes (e.g., whether it “succeeds” or “fails”).34.
As a postdoc with rOpenSci, I have begun a qualitative, ethnographic project to explore the organizational and social processes involved in making open science the norm in two disciplines: astronomy and ecology. I focus on these two disciplines to narrow, isolate, and compare the set of contextual factors (e.g., disciplinary histories, research norms, and the like) that might influence perceptions of open science. Specifically, I aim to answer the following research questions (RQ):
RQ1a: What are the primary motivations of scientists who actively engage with open science infrastructures?
RQ1b: What are the factors that influence resistance to open science among some scientists?
RQ2: What strategies do open science infrastructure leaders use to encourage participation, govern contributions, and overcome resistance to open science infrastructure use?
a. To what extent do governance strategies balance standardization and flexibility, centralization and decentralization, and voluntary and mandatory contributions?
b. By what mechanisms are open science policies and practices enforced?
c. What are the commonalities and differences in the rationale behind choices of governance strategies?
Below, I describe how I am systematically investigating these questions in two parts. In Part 1, I am identifying the issues raised by scientists who engage with or resist the open science movement. In Part 2, I am studying the governance strategies open science leaders and decisionmakers use to elicit engagement with open science infrastructures in these disciplines.
Part 1: Engagement with and Resistance to Open ScienceI am firmly rooted in a research tradition which emphasizes that studying the uptake of a new technology or technological approach, no matter the type of work or profession, begins with capturing how the people charged with changing their activities respond to the change “on the ground.” In this vein, Part 1 of the study aims to lend empirical support or opposition to arguments for and against open science that are commonly found in opinion pieces, on social media, and in organizational mission statements. A holistic reading of such documents reveals several commonalities in the positions for and against open science. Supporters of open science often cite increased transparency, reproducibility, and collaboration as the overwhelming benefits of making scientific research processes and products openly available. Detractors highlight concerns over “scooping,” ownership, and the time costs of curating and publishing code and data.
I am seeking to verify and test these claims by interviewing and surveying astronomers and ecologists or, more broadly, earth and environmental scientists who fall on various parts of the open science engagementtoresistance spectrum. I am conducting interviews using a semistructured interview protocol5 across all interviewees. I will then use a qualitative data analysis approach based on the grounded theory method6 to extract themes from the responses, focusing on the factors that promote engagement (e.g., making data available, spending time developing research software, or making publications openly accessible) or resistance (e.g., unwillingness to share code used in a study or protecting access to research data). Similar questions will be asked at scale via a survey.
Armed with themes from the responses, I will clarify and refine the claims often made in the public sphere about the benefits and drawbacks of open science. I hope to develop this part of the study into actionable recommendations for promoting open science, governing contributions to open science repositories, and addressing the concerns of scientists who are hesitant about engagement.
Part 2: Focusing on GovernanceEven with interviews and surveys of scientists on the ground, it is difficult to systematically trace and analyze the totality of social and political processes that support open science infrastructure development because the processes occur across geographic, disciplinary, and other boundaries.
However, as others have pointed out,7 the organizational and social elements of digital infrastructure development often become visible and amenable to study through infrastructure governance. Governance refers to the combination of “executive and management roles, program oversight functions organized into structures, and policies that define management principles and decision making.”8 Effective governance provides projects with the direction and oversight necessary to achieve desired outcomes of infrastructure development while allowing room for creativity and innovation.2910 Studying a project’s governance surfaces the negotiation processes that occur among stakeholders—users, managers, organizations, policymakers, and the like—throughout the development process. Outcomes include agreements about the types of technologies used, standards defining the best practices for technology use, and other policies to ensure that a robust, sustainable infrastructure evolves.911
Despite the scientific research community’s increasing reliance on open science infrastructures, few studies compare different infrastructure governance strategies2 and even fewer develop new or revised strategies for governing infrastructure development and use.12 The primary goal of this part of the project is to address this gap in our understanding of the governance strategies used to create, maintain, and grow open science infrastructures.
I am administering this part of the study by conducting indepth, semistructured interviews with leaders of various open science infrastructure projects supporting work in astronomy and ecology. I define “leaders” in this context as individuals or small groups of individuals who make decisions about the management of open science infrastructures and their component parts. This set of leaders includes founders and administrators of widelyused scientific software packages and collections of packages, of open data repositories, of open access publication and preprint services, and various combinations of open science tools. Furthermore, I intend to interview the leaders of organizations with which the open science community interacts—top publication editors, for example—to gauge how open science practices and processes are being governed outside of active open science organizations.
I will conduct qualitative coding as described above to develop themes from the responses of open science leaders. I will then ground these themes in the literature on digital infrastructure governance—which emphasizes gradual, decentralized, and voluntary development—and look for avenues to improve governance strategies.
Alongside the interview and survey methods, I am actively observing and retaining primary documents from the ongoing discourse around open science in popular scientific communication publications (e.g., Nature and Science), conferences and meetings (e.g., OpenCon and disciplinespecific hackweeks), and in the popular media/social media (e.g., The New York Times and Twitter threads).
Preliminary ThemesI entered this project with a very basic understanding of how open science “works”—the technical and social mechanisms by which scientists make processes and outputs publicly available. In learning about the open science movement, in general and in particular instantiations, I’ve begun to see the intricacies involved in efforts to change scientific research and its modes of communication: research data publication, citation, and access; journal publication availability; and research software development and software citation standards. Within the community trying to sustain these changes are participants and leaders who are facing and tackling several important issues headon. I list some of the most common engagement, resistance, and governance challenges appearing in interview and observation transcripts below.
Participation challenges Overcoming the fear of sharing code and data, specifically the fear of sharing “messy” code and the fear of being shamed for research errors.
 Defending the time and financial costs of participation in open science—particularly open source software development—to supervisors, collaborators, or tenure and promotion panels who are not engaged with open science.
 Finding time to make code and data usable for others (e.g., through good documentation or complete metadata) and, subsequently, finding a home where code and data can easily be searched and found.
 Navigating the issue of convincing researchers that software development and data publication/archiving “count” as research products, even though existing funding, publication, and tenure and promotion models may not yet value those contributions.
 Developing guidelines and processes for conducting peer review on research publication, software, and data contributions, especially the tensions involved in “open review.”
 Deciding whose responsibility it is to enforce code and data publication standards or policies, both within open science organizations and in traditional outlets like academic journals.
The points raised in this post and the questions guiding my project might seem like discussions you’ve had too many times over coffee during a hackathon break or over beers after a conference session. If so, I’d love to hear from you, even if you are not an astronomer, an ecologist, or an active leader of an open science infrastructure. I am always looking for new ideas, both confirming and disconfirming, to refine my approach to this project.

Braa, J., Hanseth, O., Heywood, A., Mohammed, W., Shaw, V. 2007. Developing health information systems in developing countries: The flexible standards strategy. MIS Quarterly, 31(2), 381402. https://doi.org/10.2307/25148796 ↩

Tilson, D., Lyytinen, K., Sørensen, C. 2010 Digital infrastructures: The missing IS research agenda. Information Systems Research, 21(4), 748759. https://doi.org/10.1287/isre.1100.0318 ↩

Borgman, C. L. 2010. Scholarship in the digital age: Information, infrastructure, and the Internet. MIT Press, Cambridge, MA. ISBN: 9780262250863 ↩

Vaast, E., Walsham, G. 2009. Transsituated learning: Supporting a network of practice with an information infrastructure. Information Systems Research, 20(4), 547564. https://doi.org/10.1287/isre.1080.0228 ↩

Spradley, J. P. (2016). The ethnographic interview. Longegrove, IL: Waveland Press. ISBN: 0030444969 ↩

Corbin, J., Strauss, A., 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology, 13(1), 321. https://doi.org/10.1007/BF00988593 ↩

Barrett, M., Davidson, E., Prabhu, J., Vargo, S. L. 2015. Service innovation in the digital age: Key contributions and future directions. MIS Quarterly, 39(1) 135154. DOI: 10.25300/MISQ/2015/39:1.03 ↩

Hanford, M. 2005. Defining program governance and structure. IBM developerWorks. Available at: https://www.ibm.com/developerworks/rational/library/apr05/hanford/. ↩

Star, S. L., Ruhleder, K. 1996. Steps toward an ecology of infrastructure: Design and access for large information spaces. Information Systems Research, 7(1), 111134. https://doi.org/10.1287/isre.7.1.111 ↩

Edwards, P. N., Jackson, S. J., Bowker, G. C., Knobel, C. P. 2007. Understanding infrastructure: Dynamics, tensions, and design. Final report for Workshop on History and Theory of Infrastructure: Lessons for New Scientific Cyberinfrastructures. NSF Report. Available at: https://deepblue.lib.umich.edu/handle/2027.42/49353. ↩

Hanseth, O., Jacucci, E., Grisot, M., Aanestad, M. 2006. Reflexive standardization: Side effects and complexity in standard making. MIS Quarterly, 30(1), 563581. https://doi.org/10.2307/25148773 ↩

Hanseth, O., & Lyytinen, K. (2010). Design theory for dynamic complexity in information infrastructures: the case of building internet. Journal of Information Technology, 25(1), 119. ↩
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RLadies global tour
(This article was first published on Maëlle, and kindly contributed to Rbloggers)
It was recently brought to my attention by Hannah Frick that there are now sooo many RLadies chapters around the world! RLadies is a worldwide organization to promote gender diversity in the R community, and I’m very grateful to be part of this community through which I met so many awesome ladies! Since we’re all connected, it has now happened quite a few times that RLadies gave talks at chapters outside of their hometowns. An RLady from Taiwan giving a talk in Madrid while on a trip in Europe and another one doing the same in Lisbon, an RLady from San Francisco presenting at the London and Barcelona chapters thanks to a conference on the continent, an RLady from Uruguay sharing her experience for the New York City and San Francisco chapters… It’s like rockstars tours!
Therefore we RLadies often joke about doing an exhaustive global tour. Hannah made me think about this tour again… If someone were to really visit all of the chapters, what would be the shortest itinerary? And could we do a cool gif with the results? These are the problems we solve here.
Getting the chaptersTo find all chapters, I’ll use Meetup information about meetups whose topics include “rladies”, although it means forgetting a few chapters that maybe haven’t updated their topics yet. Thus, I’ll scrape this webpage because I’m too impatient to wait for the cool meetupr package to include the Meetup API topic endpoint and because I’m too lazy to include it myself. I did open an issue though. Besides, I was allowed to scrape the page:
robotstxt::paths_allowed("https://www.meetup.com/topics/") ## [1] TRUEYesss. So let’s scrape!
library("rvest") link < "https://www.meetup.com/topics/rladies/all/" page_content < read_html(link) css < 'span[class="textsecondary textsmall chunk"]' chapters < html_nodes(page_content, css) %>% html_text(trim = TRUE) chapters < stringr::str_replace(chapters, ".*\\", "") chapters < trimws(chapters) head(chapters) ## [1] "London, United Kingdom" "San Francisco, CA" ## [3] "Istanbul, Turkey" "Melbourne, Australia" ## [5] "New York, NY" "Madrid, Spain" # Montenegro chapters[chapters == "HN\\, Montenegro"] < "Herceg Novi, Montenegro" Geolocating the chaptersHere I decided to use a nifty package to the awesome OpenCage API. Ok, this is my own package. But hey it’s really a good geocoding API. And the package was reviewed for rOpenSci by Julia Silge! In the docs of the package you’ll see how to save your API key in order not to have to input it as a function parameter every time.
Given that there are many chapters but not that many (41 to be exact), I could inspect the results and check them.
geolocate_chapter < function(chapter){ # query the API results < opencage::opencage_forward(chapter)$results # deal with Strasbourg if(chapter == "Strasbourg, France"){ results < dplyr::filter(results, components.city == "Strasbourg") } # get a CITY results < dplyr::filter(results, components._type == "city") # sort the results by confidence score results < dplyr::arrange(results, desc(confidence)) # choose the first line among those with highest confidence score results < results[1,] # return only long and lat tibble::tibble(long = results$geometry.lng, lat = results$geometry.lat, chapter = chapter, formatted = results$formatted) } chapters_df < purrr::map_df(chapters, geolocate_chapter) # add an index variable chapters_df < dplyr::mutate(chapters_df, id = 1:nrow(chapters_df)) knitr::kable(chapters_df[1:10,]) long lat chapter formatted id 0.1276473 51.50732 London, United Kingdom London, United Kingdom 1 122.4192362 37.77928 San Francisco, CA San Francisco, San Francisco City and County, California, United States of America 2 28.9651646 41.00963 Istanbul, Turkey Istanbul, Fatih, Turkey 3 144.9631608 37.81422 Melbourne, Australia Melbourne VIC 3000, Australia 4 73.9865811 40.73060 New York, NY New York City, United States of America 5 3.7652699 40.32819 Madrid, Spain Leganés, Community of Madrid, Spain 6 77.0366455 38.89495 Washington, DC Washington, District of Columbia, United States of America 7 83.0007064 39.96226 Columbus, OH Columbus, Franklin County, Ohio, United States of America 8 71.0595677 42.36048 Boston, MA Boston, Suffolk, Massachusetts, United States of America 9 79.0232050 35.85030 Durham, NC Durham, NC, United States of America 10 Planning the tripI wanted to use the ompr package inspired by this fantastic use case, “Boris Johnson’s fully global itinerary of apology” – be careful, the code of this use case is slightly outdated but is uptodate in the traveling salesperson vignette. The ompr package supports modeling and solving Mixed Integer Linear Programs. I got a not so bad notion of what this means by looking at this collection of use cases. Sadly, the traveling salesperson problem is a complicated problem and its solving time exponentially increases with the number of stops… in that case, it became really too long for plain mixed integer linear programming, as in “more than 24 hours later not done” too long.
Therefore, I decided to use a specific R package for traveling salesperson problems TSP. Dirk, ompr’s maintainer, actually used it once as seen in this gist and then in this newspaper piece about how to go to all 78 Berlin museums during the night of the museums. Quite cool!
We first need to compute the distance between chapters. In kilometers and rounded since it’s enough precision.
convert_to_km < function(x){ round(x/1000) } distance < geosphere::distm(as.matrix(dplyr::select(chapters_df, long, lat)), fun = geosphere::distGeo) %>% convert_to_km()I used methods that do not find the optimal tour. This means that probably my solution isn’t the best one, but let’s say it’s ok for this time. Otherwise, the best thing is to ask Concorde’s maintainer if one can use their algorithm which is the best one out there, see its terms of use here.
library("TSP") set.seed(42) result0 < solve_TSP(TSP(distance), method = "nearest_insertion") result < solve_TSP(TSP(distance), method = "two_opt", control = list(tour = result0))And here is how to link the solution to our initial chapters data.frame.
paths < tibble::tibble(from = chapters_df$chapter[as.integer(result)], to = chapters_df$chapter[c(as.integer(result)[2:41], as.integer(result)[1])], trip_id = 1:41) paths < tidyr::gather(paths, "property", "chapter", 1:2) paths < dplyr::left_join(paths, chapters_df, by = "chapter") knitr::kable(paths[1:3,]) trip_id property chapter long lat formatted id 1 from Charlottesville, VA 78.55676 38.08766 Charlottesville, VA, United States of America 38 2 from Washington, DC 77.03665 38.89495 Washington, District of Columbia, United States of America 7 3 from New York, NY 73.98658 40.73060 New York City, United States of America 5 Plotting the tour, boring versionI’ll start by plotting the trips as it is done in the vignette, i.e. in a static way. Note: I used Dirk’s code in the Boris Johnson use case for the map, and had to use a particular branch of ggalt to get coord_proj working.
library("ggplot2") library("ggalt") library("ggthemes") library("ggmap") world < map_data("world") %>% dplyr::filter(region != "Antarctica") ggplot(data = paths, aes(long, lat)) + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(aes(group = trip_id), color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8) + theme_map(base_size =20) + coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") + ggtitle("RLadies global tour", subtitle = paste0(tour_length(result), " km"))Dirk told me the map would look better with great circles instead of straight lines so I googled a bit around, asked for help on Twitter before finding this post.
library("geosphere") # find points on great circles between chapters gc_routes < gcIntermediate(paths[1:length(chapters), c("long", "lat")], paths[(length(chapters)+1):(2*length(chapters)), c("long", "lat")], n = 360, addStartEnd = TRUE, sp = TRUE, breakAtDateLine = TRUE) gc_routes < SpatialLinesDataFrame(gc_routes, data.frame(id = paths$id, stringsAsFactors = FALSE)) gc_routes_df < fortify(gc_routes) p < ggplot() + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(data = gc_routes_df, aes(long, lat, group = group), alpha = 0.5, color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8, aes(long, lat)) + theme_map(base_size =20)+ coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") p + ggtitle("RLadies global tour", subtitle = paste0(tour_length(result), " km"))Ok this is nicer, it was worth the search.
Plotting the tour, magical versionAnd now I’ll use magick because I want to add a small star flying around the world. By the way if this global tour were to happen I reckon that one would need to donate a lot of money to rainforest charities or the like, because it’d have a huge carbon footprint! Too bad really, I don’t want my gif to promote planet destroying behaviours.
To make the gif I used code similar to the one shared in this post but in a better version thanks to Jeroen who told me to read the vignette again. Not saving PNGs saves time!
I first wanted to really show the emoji flying along the route and even created data for that, with a number of rows between chapters proportional to the distance between them. It’d have looked nice and smooth. But making a gif with hundreds of frames ended up being too long for me at the moment. So I came up with another idea, I’ll have to hope you like it!
library("emojifont") load.emojifont('OpenSansEmoji.ttf') library("magick") plot_one_moment < function(chapter, size, p, chapters_df){ print(p + ggtitle(paste0("RLadies global tour, ", chapters_df[chapters_df$chapter == chapter,]$chapter), subtitle = paste0(tour_length(result), " km"))+ geom_text(data = chapters_df[chapters_df$chapter == chapter,], aes(x = long, y = lat, label = emoji("star2")), family="OpenSansEmoji", size = size)) } img < image_graph(1000, 800, res = 96) out < purrr::walk2(rep(chapters[as.integer(result)], each = 2), rep(c(5, 10), length = length(chapters)*2), p = p, plot_one_moment, chapters_df = chapters_df) dev.off() ## png ## 2 image_animate(img, fps=1) %>% image_write("rladiesglobal.gif")At least I made a twinkling star. I hope Hannah will be happy with the gif, because now I’d like to just dream of potential future trips! Or learn a bit of geography by looking at the gif.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Maëlle. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Checking residual distributions for nonnormal GLMs
(This article was first published on Bluecology blog, and kindly contributed to Rbloggers)
Checking residual distributions for nonnormal GLMs Quantilequantile plotsIf you are fitting a linear regression with Gaussian (normally
distributed) errors, then one of the standard checks is to make sure the
residuals are approximately normally distributed.
It is a good idea to do these checks for nonnormal GLMs too, to make
sure your residuals approximate the model’s assumption.
Here I explain how to create quantilequantile plots for nonnormal
data, using an example of fitting a GLM using Studentt distributed
errors. Such models can be appropriate when the residuals are
overdispersed.
First let’s create some data. We will make a linear predictor (ie the
true regression line) eta and then simulate some data by adding
residuals. We will simulate two datasets that have the same linear
predictor, but the first will have normally distributed errors and the
second will have t distributed errors:
Notice how the tdistributed data are more spread out. The df
parameter, here named nu=2.5 controls how dispersed the data are.
Lower values will give data that are more dispersed, large values
approach a normal distribution.
Now let’s fit a Gaussian glm (just a linear regression really) to both
these datasets
We should check whether the two models meet the normal assumption, using
the standard ‘qq’ (quantilequantile) plot:
These plots compare the theoretical quantiles to the actual quantiles of
the residuals. If the points fall on the straight line then the
theoretical and realised are very similar, and the assumption is met.
Clearly this is not the case for the second model, which is
overdispersed.
We know it is overdispersed because the theoretical quantiles are much
smaller than the actual at the tails (notice how the ends down then up).
The pvalues (or CIs if you use them) for m1_t are therefore likely
biased and too narrow, leading potentially to type I errors (us saying
that x affects y, which in fact it does not). In this case we know we
haven’t made a type I error, because we made up the data. However, if
you were using real data you wouldn’t be so sure.
To better understand the QQ plot it helps to generate it yourself,
rather than using R’s automatic checks.
First we calculate the model residuals (in plot(m1_t) R did this
internally):
Then we can plot the quantiles for the residuals against theoretical
quantiles generated using qnorm. Below we also plot the original QQ
plot from above, so you can see that our version is the same as R’s
automatic one:
I added the qqline for comparative purposes. It just puts a line
through the 25th and 75th quantiles.
Now we have learned how to write our own custom for a QQ plot, we can
use it to check other types of nonnormal data.
Here we will fit a GLM to the y_tdist data using studentt distributed
errors. I do this using the Bayesian package INLA.
The family ="T" command tells INLA to use the tdistribution, rather
than the Normal distribution. Note also I have specified the priors
using the control.family command. This is best practice. We need a
prior for the precision (1/variance) and a prior for the dof (=
degrees of freedom, which has to be >2 in INLA).
It is sometimes help to visualise the priors, so we can check too see
they look sensible. Here we visualise the prior for the dof, (which in
INLA has a min value of 2):
We don’t really expect values much greater than 10, so this prior makes
sense. If we used an oldschool prior that was flat in 21000 we might
get issues with model fitting.
Now enough about priors. Let’s look at the estimated coefficients:
mod_tdist$summary.fixed ## mean sd 0.025quant 0.5quant 0.975quant mode ## (Intercept) 0.5324814 0.07927198 0.3773399 0.5321649 0.6891779 0.5315490 ## x 0.7229362 0.08301006 0.5565746 0.7239544 0.8835630 0.7259817 ## kld ## (Intercept) 3.067485e12 ## x 6.557565e12Good the CIs contain our true values, and the mean is close to our true
value too.
What about the hyperparameters (the precision and DF)? We need to get
INLA to run some more calucations to get accurate estimates of these:
The estimate for the DF might be a ways off the mark. That is ok, we
expect that, you need lots of really good data to get accurate estimates
of hyperparameters.
Now, let’s use our skills in creating QQ plots to make QQ plot using
theoretical quantiles from the t distribution.
First step is to extract INLA’s predictions of the data, so we can
calculate residuals
Next step is to extract the marginal estimates of the DF and precision
to use when generating our QQ plot (the quantiles will change with the
DF):
Now we can use qt() to generate theoretical quantiles and the
residuals for our realised quantiles:
Note that I multiply the residuals by the sqrt of the precision
estimate. This is how INLA fits a tdistributed
GLM.
I do the same for the qqline.
Our residuals are now falling much closer to the line. The model is
doing a much better job of fitting the data. You could also calculate
the WAIC for this model and a Gaussian one, to compare the fits. The
tdistributed GLM should have a lower WAIC (better fit).
We can now be confident that our CIs are accurate.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The BayesianTools R package with generalpurpose MCMC and SMC samplers for Bayesian statistics
(This article was first published on Submitted to Rbloggers – theoretical ecology, and kindly contributed to Rbloggers)
This is a somewhat belated introduction of a package that we published on CRAN at the beginning of the year already, but I hadn’t found the time to blog about this earlier. In the R environment and beyond, a large number of packages exist that estimate posterior distributions via MCMC sampling, either for specific statistical models (e.g.…
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Submitted to Rbloggers – theoretical ecology. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...