Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 51 min ago

NFL Series

Sun, 10/08/2017 - 02:00

(This article was first published on StatOfMind, and kindly contributed to R-bloggers)

If you have previously attempted to analyze NFL data, it is likely that you have tried to scrape ESPN or football-reference, which provides a wealth on statistics surrounding game data. However, if you ever wanted to obtain truly in-depth data, then it is likely that you found yourself leveraging the API maintained by NFL.com. Unfortunately, it’s data is surfaced in a JSON format that leaves a lot to be desired (i.e. it’s a nightmare). Lucky for me, I was recently scrolling through my Twitter feed and came across an interesting mention of a new R package called nflscrapR. After some exploration, I came through this quote from the author of the package:

NFL fans and sports enthusiastic alike, I would like to introduce the nflscrapR package, an American football data aggregator that will scrape, clean, and parse play-by-play data across games, seasons, and careers.

The nflscrapR essentiallys surfaces all play-by-play data for the last 7 seasons, and this has motivated me to start a deep dive on NFL data. I plan to have two main topics, one that focusses on players at specific positions, and another focussing on team dynamics and patterns. To kick this series off, we will begin with an exploration into the performance of NFL’s best running backs.

1. Prerequisites

In order to reproduce the figures below, wou will need to have R (v3.2.3 or above) installed on your machine. There are also a couple of additional libraries that will be required. Details on how to install those are shown in the commands below.

# install.packages('devtools') library(devtools) #> Skipping install for github remote, the SHA1 (05815ef8) has not changed since last install. #> Use `force = TRUE` to force installation devtools::install_github(repo = "maksimhorowitz/nflscrapR", force=TRUE) 2 Collecting the Data

With the nflscrapR library now installed, you are now ready to collect play-by-play data for the 2016-2017 NFL season. Start by loading the library and collect the data using the command below:

# Load the package library(nflscrapR) library(ggplot2) library(dplyr) library(pheatmap) # Collect data for 2016 NFL season pbp_2016 <- season_play_by_play(2016)

Overall the pbp_2016 dataframe contains 100 data points for 45,737 plays, but for the purpose of this post, we will be focussing exclusively on fields related to running backs (In future posts, we will explore data relevant to other positions on the football field). In addition, we’ll focus primarily on frequently used running backs, which we empirically define as any player that has had at least 200 rushes over the course of the 2016-2017 season.

# Get all players with at least 200 rushes during the season min_rush_cnt <- 200 rush_cnt <- pbp_2016 %>% filter(PlayType == 'Run') %>% group_by(Rusher) %>% summarise(rush_cnt = n(), total_yards = sum(Yards.Gained), mean_yards = round(mean(Yards.Gained), 2)) %>% filter(rush_cnt >= min_rush_cnt) %>% arrange(desc(rush_cnt)) # Get all rushing data for eligible players rushing_stats <- pbp_2016 %>% filter(PlayType == 'Run' & Rusher %in% rush_cnt$Rusher & Yards.Gained <=50) %>% filter(down!=4 & !is.na(down)) %>% filter(!is.na(RunLocation))

Altogether, we find that a total of 19 players rushed over 200 times during the 2016-2017 season. A short summary of their performance is show below.



3. Who are the most consistent and productive running backs?

When talking about the overall performance of running backs, it is common for people to report the total number of yards that were rushed for, or the average yards per run. While these are perfectly acceptable numbers to share, I’ve always felt like they did not tell the whole story. For example, a player could have a high average yards per run, only for us to realize that he actually often loses yards on a run but makes up for it with a few very long runs. Therefore, I started by looking at the overall distribution of number of yards gained/lost for each play, with the hope that this would reveal whether some players were more consistent on a play-by-play basis than others. We can use the ggplot2 library to generate a density plot of yards gained per play for each of our eligible players:

# Compare distribution of rushes for eligible players ggplot(rushing_stats, aes(x = Yards.Gained, y = Rusher, fill=Rusher)) + geom_joy(scale = 3) + theme_joy() + scale_fill_manual(values=rep(c('gray', 'lightblue'), length(rushing_stats$Rusher)/2)) + scale_y_discrete(expand = c(0.01, 0)) + scale_x_continuous(expand = c(0, 0)) + theme(legend.position="none") + labs(x="Yards gained per play" ,y="")

Overall, we see that most running backs have a similar distribution of yards gained by run. However, we can see that LeSean McCoy (7th distribution from the top) has a much “flatter” distribution (i.e. more variance in the distribution of yards gained per run), meaning his performance can be a lot more variable/unpredictable for any given run.

4. When are running backs used?

Another statement that is also commonly reported is that running backs are primarily used in early downs. To verify whether this is generall true, we can compute the total amount of runs that each player made across different downs, and go even further by breaking this down by quarter too. The code chunk below counts the number of runs that each rushing back made during pairs of downs and quarters.

# Compare when rushers are used usage_stats <- pbp_2016 %>% filter(!is.na(down) & Rusher %in% rush_cnt$Rusher & qtr!=5) %>% group_by(Rusher, down, qtr) %>% summarise(cnt = n()) %>% mutate(qtr_down = paste("Q", qtr, "- Down: ", down, sep=""))

We can then leverage the d3heatmap to quickly generate a simple heatmap of how often running backs are used during specific downs and quarters.

library(d3heatmap) # pivot dataframe usage <- usage_stats %>% dcast(Rusher ~ qtr_down, value.var = "cnt") # clean data row.names(usage) <- usage$Rusher usage <- usage %>% select(-Rusher) usage[is.na(usage)] <- 0 # normalize data usage_norm <- t(apply(usage, 1, function(x) x/sum(x))) # Plot heatmap of proportions of rushes by different field locations and gaps p <- d3heatmap(usage_norm, colors="Blues", Colv=FALSE, show_grid=3) saveWidget(p, file="rusher_usage_down_quarter.html")

Proportion of rushes per quarter and downs for NFLs best running backs


In the plot above, we are essentially plotting the usage of each running back as a function of what stage of the game we are in. As we can see, it is abundantly clear that running backs are primarily used in the first two downs, and rarely during the third and fourth downs. Overall, there does not appear to be significant differences between how players are used. However, it does not answer whether some running backs perform better on certains downs, which is what we will address now.

5. Are some running backs better on certain downs?

Another question we can ask ourselves is whether some running backs perform better on later downs. To visualize this data, we can again generate a density plot of yards gained per play for each of our eligible players, while also facetting the data by downs.

# Compare distribution of rushes by downs ggplot(rushing_stats, aes(x = Yards.Gained, y = down)) + geom_joy(scale=1, rel_min_height=.03, fill='black') + scale_y_discrete(expand = c(0.01, 0)) + xlab('Value') + facet_wrap(~Rusher, scales='free', ncol=3) + theme_joy() + theme(axis.title.y = element_blank())+ labs(x="Yards gained per play" ,y="Down")

Again, we do not see any striking differences between players and the distribution of yards gained by down. However, it is interesting to note that most “long runs” (10 yards or above) tend to occur on the first down. When we look closely, we can also see that some rushers such as DeMarco Murray do exhibit visual differences between yards gained by downs. In this case, the “mass” of yards gained on the third down is much closer to zero than when compared to the “mass” for the first and second downs, which suggests that he may struggle during this down (this could be attributable to many factors: stamina, weaker offensive line on 3rd downs, etc…)

6. Where do the best running backs run?

It is fairly well accepted that the performance of a running back will be heavily influenced by the strength of the offensive line in front of them. With that in mind, let’s start by looking at the field location in which different running backs prefer to run. The plot below shows the number of yards gained by each running back based on which side of the field they ran towards (left, middle or right).

ggplot(data=rushing_stats, aes(x=RunLocation, y=Yards.Gained, color=RunLocation)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)

We can take this further by looking at the field location in which different running backs prefer to run. This can be achieved by generating a matrix that contains the proportion of rushes by field location for each player.

# Get proportions of rushes on different field locations rush_locations <- rushing_stats %>% filter(PlayType=='Run') %>% filter(!is.na(RunLocation)) %>% group_by(Rusher, RunLocation) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) loc_mat <- rush_locations %>% dcast(Rusher ~ RunLocation, value.var = "freq") row.names(loc_mat) <- loc_mat$Rusher loc_mat <- loc_mat %>% select(-Rusher)

The content of the loc_mat matrix contains the preferred rush locations of each running back, and can be plotted as a clustered heatmaps using the pheatmap library in R.

# Plot heatmap of proportions of rushes by different field locations pheatmap(loc_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)

The plot above highlights which running back are most similar in their run locations. Rushers such as J. Ajayi, E. Elliot and M. Ingram, clearly avoid running in the middle of the field. On the flipside, rushers such as M. Gordon D. Johnson, S.ware and F. Gore clearly prefer running down the middle rather than to the sides. These patterns could be attributed to the running styles of each rushers (speed, mobility, strength etc…), but also the strength of the offensive line at particular positions.

7. Who creates the gaps for the running backs?

We can also explore the number of yards gained by each running back based on the offensive line positions that created space for them.

ggplot(data=rushing_stats %>% filter(!is.na(RunGap)), aes(x=RunGap, y=Yards.Gained, color=RunGap)) + geom_jitter(position=position_jitter(0.2)) + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black") + scale_color_brewer(palette="Dark2") + theme_minimal() + facet_wrap(~Rusher)

The proportions of run opportunities that was enabled by each offensive line position can also be summarized in a matrix using the command below.

# Get proportions of gaps created by different offensive line positions rush_gaps <- rushing_stats %>% filter(!is.na(RunGap)) %>% filter(!is.na(RunGap)) %>% group_by(Rusher, RunGap) %>% summarise(rush_cnt = n()) %>% mutate(freq = rush_cnt / sum(rush_cnt)) gap_mat <- rush_gaps %>% dcast(Rusher ~ RunGap, value.var = "freq") row.names(gap_mat) <- gap_mat$Rusher gap_mat <- gap_mat %>% select(-Rusher) # Plot heatmap of proportions of rushes by different field gaps pheatmap(gap_mat, border="white", color = brewer.pal(9,"Blues"), cluster_cols=FALSE)

Again, we see many differences among the leagues top running backs. Unsurprisingly, a number of players have the most run opportunities created by the guard position, but a few players such as F. Gore, L. McCoy and D. Johnson run in gaps created by the tackle position. Finally, S. Ware from the Kansas City Chiefs often runs through gaps created by tight ends, which may be more representative of the team’s formation.

Conclusion

In this introductory post, we have explored the performance and patterns of some of NFL’s best running backs. Overall, it was a fairly superficial analysis, as it never considered interactions with other components of the team, or temporal patterns, but it does show the depth and power of this data. In the next series, I will dive into the performance and behavior of wide receivers, so stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: StatOfMind. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Linear / Logistic Regression in R: Dealing With Unknown Factor Levels in Test Data

Sat, 10/07/2017 - 21:11

(This article was first published on R – Discovering Python & R, and kindly contributed to R-bloggers)

Let’s say you have data containing a categorical variable with 50 levels. When you divide the data into train and test sets, chances are you don’t have all 50 levels featuring in your training set.

This often happens when you divide the data set into train and test sets according to the distribution of the outcome variable. In doing so, chances are that our explanatory categorical variable might not be distributed exactly the same way in train and test sets – so much so that certain levels of this categorical variable are missing from the training set. The more levels there are to a categorical variable, it gets difficult for that variable to be similarly represented upon splitting the data.

Take for instance this example data set (train.csv + test.csv) which contains a categorical variable var_b that takes 349 unique levels. Our train data has 334 of these levels – on which the model is built – and hence 15 levels are excluded from our trained model. If you try making predictions on the test set with this model in R, it throws an error:
factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820,
25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460

If you’ve used R to model generalized linear class of models such as linear, logit or probit models, then chances are you’ve come across this problem – especially when you’re validating your trained model on test data.

The workaround to this problem is in the form of a function, remove_missing_levels  that I found here written by pat-s. You need magrittr library installed and it can only work on lm, glm and glmmPQL objects.

Once you’ve sourced the above function in R, you can seamlessly proceed with using your trained model to make predictions on the test set. The code below demonstrates this for the data set shared above. You can find these codes in one of my github repos and try it out yourself.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Discovering Python & R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Visualization Course for First-Year Students

Sat, 10/07/2017 - 17:26

(This article was first published on R Project – Citizen-Statistician, and kindly contributed to R-bloggers)

A little over a year ago, we decided to propose a data visualization course at the first-year level. We had been thinking about this for awhile, but never had the time to teach it given the scheduling constraints we had. When one of the other departments on campus was shut down and the faculty merged in with other departments, we felt that the time was ripe to make this proposal.

Course description of the EPsy 1261 data visualization course

In putting together the proposal, we knew that:

  • The course would be primarily composed of social science students. My department, Educational Psychology, attracts students from the College of Education and Human Development (e.g., Child Psychology, Social Work, Family Social Science).
  • To attract students, it would be helpful if the course would fulfill the University’s Liberal Education (LE) requirement for Mathematical Thinking.

This led to several challenges and long discussions about the curriculum for this course. For example:

  • Should the class focus on producing data visualizations (very exciting for the students) or on understanding/interpreting existing visualizations (useful for most social science students)?
  • If we were going to produce data visualizations, which software tool would we use? Could this level of student handle R?
  • In order to meet the LE requirement, the curriculum for the course would need to show a rigorous treatment of students actually “doing” mathematics. How could we do this?
  • Which types of visualizations would we include in the course?
  • Would we use a textbook? How might this inform the content of the course?
Software and Content

After several conversations among the teaching team, with stakeholder departments, and with colleagues teaching data visualization courses at other universities, we eventually proposed that the course:

  • Focus both on students’ being able to read and understand existing visualizations and produce a subset of these visualizations, and
  • Use R (primary tool) and RAWGraphs for the production of these plots.
Software: Use ggplot2 in R

The choice to use R was not an immediate one. We initially looked at using Tableau, but the default choices made by the software (e.g., to immediately plot summaries rather than raw data) and the cost for students after matriculating from the course eventually sealed its fate (we don’t use it). We contemplated using Excel for a minute (gasp!), but we vetoed that even quicker than Tableau. The RAWGraphs website, we felt, held a lot of promise as a software tool for the course. It had an intuitive drag-and-drop interface, and could be used to create many of the plots we wanted students to produce. Unfortunately, we were not able to get the bar graph widget to produce side-by-side bar plots easily (actually at all). The other drawback was that the drag-and-drop interactions made it a harder sell to the LE committee as a method of building students’ computational and mathematical thinking if we used it as the primary tool.

Once we settled on using R, we had to decide between using the suite of base plots, or ggplot2 (lattice was not in the running). We decided that ggplot made the most sense in terms of thinking about extensibility. Its syntax was based on a theoretical foundation for creating and thinking about plots, which also made it a natural choice for a data visualization course. The idea of mapping variables to aesthetics was also consistent with the language used in RAWGraphs, so it helped reenforce core ideas across the tools. Lastly, we felt that using the ggplot syntax would also help students transition to other tools (such as ggviz or plotly) more easily.

One thing that the teaching team completely agreed on (and was mentioned by almost everyone who we talked to who taught data visualization) was that we wanted students to be producing graphs very early in the course; giving them a sense of power and the reenforcement that they could be successful. We felt this might be difficult for students with the ggplot syntax. To ameliorate this, we wrote a course-specific R package (epsy1261; available on github) that allows students to create a few simple plots interactively by employing functionality from the manipulate package. (We could have also done this via Shiny, but I am not as well-versed in Shiny and only had a few hours to devote to this over the summer given other responsibilities.)

Interactive creation of the bar chart using the epsy1261 package. This allows students to input  minimal syntax, barchart(data), and then use interaction to create plots.

Course Content

We decided on a three-pronged approach to the course content. The first prong would be based on the production of common statistical plots: bar charts, scatterplots, and maps, and some variations of these (e.g., donut plots, treemaps, bubble charts). The second prong was focused on reading more complex plots (e.g., networks, alluvial plots), but not producing them, except maybe by hand. The third prong was a group project. This would give students a chance to use what they had learned, and also, perhaps, access other plots we had not covered. In addition, we wanted students to consider narrative in the presentation of these plots—to tell a data-driven story.

Along with this, we had hoped to introduce students to computational skills such as data summarization, tidying, and joining data sets. We also wanted to introduce concepts such as smoothing (especially for helping describe trend in scatterplots), color choice, and projection and coordinate systems (in maps). Other things we thought about were using R Markdown and data scraping.

Reality

The reality, as we are finding now that we are over a third of the way through the course, is that this amount of content was over-ambitious. We grossly under-estimated the amount of practice time these students would need, especially working with R. Two things play a role in this:

  1. The course attracted way more students than we expected for the first offering (our class size is 44) and there is a lot of heterogeneity of students’ experiences and academic background. For example, we have graduate students from the School of Design, some first years, and mostly sophomores and juniors. We also have a variety of majors including, design, the social sciences, and computer science.
  2. We hypothesize that students are not practicing much outside of class. This means they are essentially only using R twice a week for 75 minutes when they are in class. This amount of practice is too infrequent for students to really learn the syntax.

Most of the students’ computational experiences are minimal prior to taking this course. They are very competent at using point-and-click software (e.g., Google Docs), but have an abundance of trouble when forced to use syntax. The precision of case-sensitivity, commas, and parentheses is outside their wheelhouse.

I would go so far as to say that several of these students are intimidated by the computation, and completely panic on facing an error message. This has led to us having to really think through and spend time discussing computational workflows and dealing with how to “de-bug” syntax to find errors. All of this has added more time than we anticipated on the actual computing. (While this may add time, it is still educationally useful for these students.)

The teaching team meets weekly for 90 minutes to discuss and reflect on what happened in the course. We also plan what will happen in the upcoming week based on what we observed and what we see in students’ homework. As of now, we clearly see that students need more practice, and we have begun giving students the end result of a plot and asking them to re-create these.

I am still hoping to get to scatterplots and maps in the course. However, some of the other computational ideas (scraping, joining) may have to be relegated to conceptual ideas in a reading. We are also considering scrapping the project, at least for this semester. At the very least, we will change it to a more structured set of plots they need to produce rather than letting them choose the data sets, etc. Live and learn. Next time we offer the course it will be better.

*Technology note: RAWGraphs can be adapted by designing additional chart types, so in theory, if one had time, we could write our own version to be more compatible with the course. We are also considering using the ggplotgui package, which is a Shiny dashboard for creating ggplot plots.

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Project – Citizen-Statistician. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

From Static to Numeric to Single-Choice Exercises in R/exams

Sat, 10/07/2017 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Idea

Our colleagues over at the economics department became interested in using R/exams for
generating large-scale exams in their introductory economics courses. However, they face the challenge that
so far they had been writing static exercises and modified them by hand if they wanted to reuse them in a
different exam in another semester. To let R/exams do this job it is illustrated how a static arithmetic
exercise can be turned into a dynamic exercise template either in num
format with a numeric solution or into schoice format with a single-choice solution.
The idea for the exercise is a very basic price elasticity of demand task:

Consider the following inverse demand function:
p(x)=a–b·x

for the price
p

given the demanded quantity
x

. What is the price elastiticy of demand at a price of
p

?

The natural candidates for “parameterizing” this exercise are the
price
p

and the parameters
a

and
b

of the inverse demand function.
Based on these the solution is simply:

First, we obtain the demand function by inverting the inverse demand function:
x=D(p)=(a–p)/b

.
Then, at
p

the price elasticity of demand is
D‘(p)D(p)p=–1/bxp. Overview

Below various incarnations of this exercise are provided in both R/Markdown Rmd and R/LaTeX Rnw format.
The following table gives a brief overview of all available versions along with a short description of the idea behind it.
More detailed explanations are provided in the subsequent sections.

# Exercise templates Dynamic? Type Description 1 elasticity1.Rmd
elasticity1.Rnw No num Fixed parameters and numeric solution. 2 elasticity2.Rmd
elasticity2.Rnw No schoice As in #1 but with single-choice solution (five answer alternatives). 3 elasticity3.Rmd
elasticity3.Rnw Yes num Randomly drawn parameters with dynamic computation of correct solution, based on #1. 4 elasticity4.Rmd
elasticity4.Rnw Yes schoice Randomly drawn parameters (as in #3) with dynamically-generated single-choice solution (as in #2), computed by num_to_schoice(). 5 elasticity5.Rmd
elasticity5.Rnw Yes schoice As in #4 but with the last alternative: None of the above. Static numeric

The starting point is a completely static exercise as it had been used in a previous
introductory economics exam. The parameters had been set
to
p=5

,
a=50

,
and
b=0.5

.
This implies that
x=90

,
leading to an elasticity of
–0.111111

.

The corresponding R/exams templates simply hard-code these numbers into the question/solution
and wrap everything into R/Markdown (elasticity1.Rmd)
or R/LaTeX (elasticity1.Rnw).
Note that LaTeX is used in either case for the mathematical notation. In case you are unfamiliar
with the R/exams format, please check out the First Steps tutorial.

The meta-information simply sets extype to num, supplies the exsolution with a precision of three digits,
and allows an extol tolerance of 0.01. To see what the result looks like download the files linked above
and run exams2html() and/or exams2pdf(). (The examples below always use the R/Markdown version but the
R/LaTeX version can be used in exactly the same way.)

library("exams") exams2html("elasticity1.Rmd") exams2pdf("elasticity1.Rmd") Static single-choice

Single-choice versions of exercises are often desired for use in written exams
because they can be conveniently scanned and automatically evaluated. Thus, we need to come up
with a number of incorrect alternative solutions (or “distractors”). If desired, these could include
typical wrong solutions or a None of the others alternative.

The R/exams templates
elasticity2.Rmd and
elasticity2.Rnw are
essentially copies of the static numeric exercise above but:

  1. Question/solution now contain an answerlist (with five alternatives).
  2. The extype has been changed to schoice.
  3. The exsolution now contains a binary coding of the correct solution.
  4. Furthermore, to obtain some basic randomization we have turned on shuffling by setting exshuffle
    to TRUE. (Subsampling more than five alternatives would also be possible and would add some further randomization.)

As above exams2html() and/or exams2pdf() can be used to display the exercise interactively in R/exams.

Dynamic numeric

Next, the static exercise from above is made dynamic by drawing the parameters from a suitable
data-generating process. In this case, the following works well:

## p = a - b * x p <- sample(5:15, 1) fctr <- sample(c(2, 4, 5, 10), 1) x <- sample(5:15, 1) * fctr b <- sample(1:5, 1) / fctr a <- p + b * x ## elasticity sol <- -1/b * p/x

Note that in order to obtain “nice” numbers a common scaling factor fctr is used for both x and b.
Also, while the examinees are presented with parameters a and b and have to compute x,
the data-generating process actually draws x and b and computes a from that. Again, this makes it
easier to obtain “nice” numbers.

The R/exams templates
elasticity3.Rmd and
elasticity3.Rnw include
the data-generating process above as a code chunk either in R/Markdown or R/LaTeX format.
The parameters are then inserted into question/solution/metainformation using `r a` (R/Markdown)
or \Sexpr{a} (R/LaTeX). Sometimes the fmt() function is used for formatting the parameters
with a desired number of digits. (See ?fmt for more details.)

As before exams2html() and/or exams2pdf() can be used to display the random draws from
the exercise templates. For checking that the meta-information is included correctly, it is often
helpful to run

exams_metainfo(exams2html("elasticity3.Rmd"))

Furthermore, some tweaking is usually required when calibrating the parameter ranges in the
data-generating process. The stresstest_exercise() function draws a large number of random replications
and thus can help to spot errors that occur or to find solutions that are “extreme” (e.g., much larger or smaller than usual). To clean up your
global environment and run the function use something like this:

rm(list = ls()) s <- stresstest_exercise("elasticity3.Rmd", n = 200) plot(s) plot(s, type = "solution")

The latter command plots the correct solution against the (scalar) parameters that were
generated in the exercise. This might show patterns for which parameters the solution becomes
too large or too small etc. See ?stresstest_exercise for further features/details.

Dynamic single-choice

To go from the dynamic numeric exercise to a dynamic single-choice exercise we need to
extend the data-generating process to also produce a number of wrong alternatives.
The function num_to_schoice() helps with this by providing different sampling
mechanisms. It allows to set a range in which the alternatives have to be, a minimum
distance between all alternatives, possibly include typical wrong solutions, etc.
It also shuffles the resulting alternatives and tries to make sure that the correct
solution is not a certain order statistic (e.g., almost always the largest or
smallest alternative).

The R/exams templates
elasticity4.Rmd and
elasticity4.Rnw first
wrap the data-generating process into a while loop with while(sol > -0.11). This makes sure
that there is enough “space” for four wrong alternatives between -0.11 and 0, using a minimum
distance of 0.017. Subsequently, the four wrong alternatives are generated by:

## single-choice incl. typical errors err <- c(1/sol, sol/p, p/sol) err <- err[(err > -5) & (err < -0.2) & abs(err - sol) > 0.01] rng <- c(min(1.5 * sol, -1), -0.01) sc <- num_to_schoice(sol, wrong = err, range = rng, delta = 0.017, method = "delta", digits = 3)

This suggests a number of typical wrong solutions err (provided that they are not too small
or too large) and makes sure that the range rng is large enough. With these arguments
num_to_schoice() is run, see ?num_to_schoice for the details. The resulting sc list
contains suitable $questions and $solutions that can be easily embedded in an answerlist()
and in the meta-information. Sometimes it is useful to wrap num_to_schoice() into another
while() loop to make sure a valid result is found. See the deriv2
template for illustraion.

As before exams2html() and/or exams2pdf() can be used to display the random draws from
this exercise template. And some stress-testing could be carried out by:

rm(list = ls()) s <- stresstest_exercise("elasticity4.Rmd", n = 200) plot(s) plot(s, type = "ordering")

As a final variation we could include None of the above as the last alternative. This is
very easy because we can simply replace the fifth element of the question list with the corresponding
string:

sc$questions[5] <- "None of the above."

No matter whether this was actually the correct alternative or one of the incorrect alternatives, the
corresponding solution will stay the same. The R/exams templates
elasticity5.Rmd and
elasticity5.Rnw incorporate
this additional line of code in the data-generating process.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

getSymbols and Alpha Vantage

Fri, 10/06/2017 - 23:12

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

Thanks to Paul Teetor, getSymbols() can now import data from Alpha Vantage!  This feature is part of the quantmod 0.4-11 release, and provides another another data source to avoid any Yahoo Finance API changes*.

Alpha Vantage is a free web service that provides real-time and historical equity data.  They provide daily, weekly, and monthly history for both domestic and international markets, with up to 20 years of history. Dividend and split adjusted close prices are available for daily data. They also provide near real-time price bars at a resolution of 1 minute or more, for up to 10 recent days. All you need to get started is a one-time registration for an API key.  Alpha Vantage has clean, documented, public API that returns either JSON-encoded data or a CSV file.  The arguments to getSymbols.av() closely follow the native API, so be sure to use their documentation! To get started, install the latest quantmod from CRAN.  Then you call:   getSymbols(“MSFT”, src = “av”, api.key = “[your key]”)  Where you replace “[your key”] with the API key you receive after registration.  You can use setDefaults() to set your API key one time, and use it for all getSymbols.av() calls.   setDefaults(“getSymbols.av”, api.key = “[your key]”) *Speaking of API changes, this release also includes a fix for a Yahoo Finance change (#174). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stan Biweekly Roundup, 6 October 2017

Fri, 10/06/2017 - 22:40

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

I missed last week and almost forgot to add this week’s.

  • Jonah Gabry returned from teaching a one-week course for a special EU research institute in Spain.

  • Mitzi Morris has been knocking out bug fixes for the parser and some pull requests to refactor the underlying type inference to clear the way for tuples, sparse matrices, and higher-order functions.

  • Michael Betancourt with help from Sean Talts spent last week teaching an intro course to physicists about Stan. Charles Margossian attended and said it went really well.

  • Ben Goodrich, in addition to handling a slew of RStan issues has been diving into the math library to define derivatives for Bessel functions.

  • Aki Vehtari has put us in touch with the MxNet developers at Amazon UK and we had our first conference call with them to talk about adding sparse matrix functionality to Stan (Neil Lawrence is working there now).

  • Aki is also working on revising the EP as a way of life paper and finalizing other Stan-related papers.

  • Bob Carpenter and Andrew Gelman have recruited Advait Rajagopal to help us with the Coursera specialization we’re going to offer (contingent on coming to an agreement with Columbia). The plan’s to have four course: Intro to BDA (Andrew), Stan (Bob), MCMC (Bob), and Regression and other stories (Andrew).

  • Ben Bales finished the revised pull request for vectorized RNGS. Turns out these things are much easier to write than they are to test thoroughly. Pesky problems with instantiations by integers and what not turn up.

  • Daniel Lee is getting ready for ACoP, which Bill Gillespie and Charles Margossian will also be presenting at.

  • Steven Bronder and Rok Češnovar, with some help from Daniel Lee, are going to merge the ViennaCL library for GPU matrix ops with their own specializations for derivatives in Stan into the math library. This is getting close to being real for users.

  • Sean Talts when he wasn’t teaching or learning physics has been refactoring the Jenkins test facilities. As our tests get bigger and we get more developers, it’s getting harder and harder to maintain stable continuous integration testing.

  • Breck Baldwin is taking over dealing with StanCon. Our goal is to get up to 150 registrations.

  • Breck Baldwin has also been working with Andrew Gelman and Jonathan Auerbach on non-conventional statistics training (like at Maker Fairs)—they have the beginnings of a paper. Breck’s highly recommending the math musueum in NY to see how this kind of thing’s done.

  • Bob Carpenter published a Wiki page on a Stan 3 model concept, which is probably what we’ll be going with going forward. It’s pretty much like what we have now with better const correctness and some better organized utility functions.

  • Imad Ali went to the the New England Sports Stats conference. Expect to see more models of basketball using Stan soon.

  • Ben Goodrich fixed the problem with exception handling in RStan on some platforms (always a pain because it happened on Macs and he’s not a Mac user).

  • Advait Rajagopal has been working with Imad Ali on adding ARMA and ARIMA time-series functions to rstanarm.

  • Aki Vehtari enhanced the loo package with automated code for K-fold cross validation.

  • Lizzie Wolkovich visited us for a meeting (she’s on our NumFOCUS leadership body), where she reported that she and a postdoc have been working on calibrating Stan models for phenology (look it up).

  • Krzysztof Sakrejda has been working on proper standalone function generation for Rcpp. Turns out to be tricky with their namespace requirements, but I think we have it sorted out as of today.

  • Michael Andreae has kicked off is meta-analysis and graphics project at Penn State with Jonah Gabry and Ben Goodrich chipping in.

  • Ben Goodrich also fixed the infrastructure for RStan so that multiple models may be supported more easily, which should make it much easier for R package writers to incorporate Stan models.

  • Yuling Yao gave us the rundown on where ADVI testing stands. It may falsely report convergence when it’s not at a maximum, it may converge to a local minimum, or it may converge but the Gaussian approximation may be terrible, either in terms of the posterior means or the variances. He and Andrew Gelman are looking at using Pareto smoothed importance sampling (a la the loo package) to try to sort out the quality of the approximation. Yuling thinks convergence is mostly scaling issues and preconditioning along with natural gradients may solve the problem. It’s nice to see grad students sink their teeth into a problem! It’d be great if we could come up a more robust ADVI implementation that had diagnostic warnings if the approximation wasn’t reliable.

The post Stan Biweekly Roundup, 6 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A cRossword about R

Fri, 10/06/2017 - 18:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The members of the R Ladies DC user group put together an R-themed crossword for a recent networking event. It's a fun way to test out your R knowledge. (Click to enlarge, or download a printable version here.)

If you get stuck, you can find the answers here or at the link below. And if you'd like to join your local R-Ladies chapter, you can find a list of meetups here.

Github (rladies): meetup-presentations_dc / NetworkingCrosswordPuzzle-2017

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Practical Machine Learning with R and Python – Part 1

Fri, 10/06/2017 - 16:07

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Introduction

This is the 1st part of a series of posts I intend to write on some common Machine Learning Algorithms in R and Python. In this first part I cover the following Machine Learning Algorithms

  • Univariate Regression
  • Multivariate Regression
  • Polynomial Regression
  • K Nearest Neighbors Regression

The code includes the implementation in both R and Python. This series of posts are based on the following 2 MOOC courses I did at Stanford Online and at Coursera

  1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
  2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG). I also use the Boston data set from MASS package

While coding in R and Python I found that there were some aspects that were more convenient in one language and some in the other. For example, plotting the fit in R is straightforward in R, while computing the R squared, splitting as Train & Test sets etc. are already available in Python. In any case, these minor inconveniences can be easily be implemented in either language.

R squared computation in R is computed as follows


Note: You can download this R Markdown file and the associated data sets from Github at MachineLearning-RandPython
Note 1: This post was created as an R Markdown file in RStudio which has a cool feature of including R and Python snippets. The plot of matplotlib needs a workaround but otherwise this is a real cool feature of RStudio!

1.1a Univariate Regression – R code

Here a simple linear regression line is fitted between a single input feature and the target variable

# Source in the R function library source("RFunctions.R") # Read the Boston data file df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - Statistical Learning # Split the data into training and test sets (75:25) train_idx <- trainTestSplit(df,trainPercent=75,seed=5) train <- df[train_idx, ] test <- df[-train_idx, ] # Fit a linear regression line between 'Median value of owner occupied homes' vs 'lower status of # population' fit=lm(medv~lstat,data=df) # Display details of fir summary(fit) ## ## Call: ## lm(formula = medv ~ lstat, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.168 -3.990 -1.318 2.034 24.500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 34.55384 0.56263 61.41 <2e-16 *** ## lstat -0.95005 0.03873 -24.53 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.216 on 504 degrees of freedom ## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 ## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16 # Display the confidence intervals confint(fit) ## 2.5 % 97.5 % ## (Intercept) 33.448457 35.6592247 ## lstat -1.026148 -0.8739505 plot(df$lstat,df$medv, xlab="Lower status (%)",ylab="Median value of owned homes ($1000)", main="Median value of homes ($1000) vs Lowe status (%)") abline(fit) abline(fit,lwd=3) abline(fit,lwd=3,col="red")

rsquared=Rsquared(fit,test,test$medv) sprintf("R-squared for uni-variate regression (Boston.csv) is : %f", rsquared) ## [1] "R-squared for uni-variate regression (Boston.csv) is : 0.556964" 1.1b Univariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression #os.chdir("C:\\software\\machine-learning\\RandPython") # Read the CSV file df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") # Select the feature variable X=df['lstat'] # Select the target y=df['medv'] # Split into train and test sets (75:25) X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) X_train=X_train.values.reshape(-1,1) X_test=X_test.values.reshape(-1,1) # Fit a linear model linreg = LinearRegression().fit(X_train, y_train) # Print the training and test R squared score print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Plot the linear regression line fig=plt.scatter(X_train,y_train) # Create a range of points. Compute yhat=coeff1*x + intercept and plot x=np.linspace(0,40,20) fig1=plt.plot(x, linreg.coef_ * x + linreg.intercept_, color='red') fig1=plt.title("Median value of homes ($1000) vs Lowe status (%)") fig1=plt.xlabel("Lower status (%)") fig1=plt.ylabel("Median value of owned homes ($1000)") fig.figure.savefig('foo.png', bbox_inches='tight') fig1.figure.savefig('foo1.png', bbox_inches='tight') print "Finished" ## R-squared score (training): 0.571 ## R-squared score (test): 0.458 ## Finished

1.2a Multivariate Regression – R code # Read crimes data crimesDF <- read.csv("crimes.csv",stringsAsFactors = FALSE) # Remove the 1st 7 columns which do not impact output crimesDF1 <- crimesDF[,7:length(crimesDF)] # Convert all to numeric crimesDF2 <- sapply(crimesDF1,as.numeric) # Check for NAs a <- is.na(crimesDF2) # Set to 0 as an imputation crimesDF2[a] <-0 #Create as a dataframe crimesDF2 <- as.data.frame(crimesDF2) #Create a train/test split train_idx <- trainTestSplit(crimesDF2,trainPercent=75,seed=5) train <- crimesDF2[train_idx, ] test <- crimesDF2[-train_idx, ] # Fit a multivariate regression model between crimesPerPop and all other features fit <- lm(ViolentCrimesPerPop~.,data=train) # Compute and print R Squared rsquared=Rsquared(fit,test,test$ViolentCrimesPerPop) sprintf("R-squared for multi-variate regression (crimes.csv) is : %f", rsquared) ## [1] "R-squared for multi-variate regression (crimes.csv) is : 0.653940" 1.2b Multivariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression # Read the data crimesDF =pd.read_csv("crimes.csv",encoding="ISO-8859-1") #Remove the 1st 7 columns crimesDF1=crimesDF.iloc[:,7:crimesDF.shape[1]] # Convert to numeric crimesDF2 = crimesDF1.apply(pd.to_numeric, errors='coerce') # Impute NA to 0s crimesDF2.fillna(0, inplace=True) # Select the X (feature vatiables - all) X=crimesDF2.iloc[:,0:120] # Set the target y=crimesDF2.iloc[:,121] X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) # Fit a multivariate regression model linreg = LinearRegression().fit(X_train, y_train) # compute and print the R Square print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) ## R-squared score (training): 0.699 ## R-squared score (test): 0.677 1.3a Polynomial Regression – R

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

# Polynomial degree 1 df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) # Select key columns df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Split as train and test sets train_idx <- trainTestSplit(df3,trainPercent=75,seed=5) train <- df3[train_idx, ] test <- df3[-train_idx, ] # Fit a model of degree 1 fit <- lm(mpg~. ,data=train) rsquared1 <-Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : %f", rsquared1) ## [1] "R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : 0.763607" # Polynomial degree 2 - Quadratic x = as.matrix(df3[1:6]) # Make a polynomial of degree 2 for feature variables before split df4=as.data.frame(poly(x,2,raw=TRUE)) df5 <- cbind(df4,df3[7]) # Split into train and test set train_idx <- trainTestSplit(df5,trainPercent=75,seed=5) train <- df5[train_idx, ] test <- df5[-train_idx, ] # Fit the quadratic model fit <- lm(mpg~. ,data=train) # Compute R squared rsquared2=Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared2) ## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.831372" #Polynomial degree 3 x = as.matrix(df3[1:6]) # Make polynomial of degree 4 of feature variables before split df4=as.data.frame(poly(x,3,raw=TRUE)) df5 <- cbind(df4,df3[7]) train_idx <- trainTestSplit(df5,trainPercent=75,seed=5) train <- df5[train_idx, ] test <- df5[-train_idx, ] # Fit a model of degree 3 fit <- lm(mpg~. ,data=train) # Compute R squared rsquared3=Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared3) ## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.773225" df=data.frame(degree=c(1,2,3),Rsquared=c(rsquared1,rsquared2,rsquared3)) # Make a plot of Rsquared and degree ggplot(df,aes(x=degree,y=Rsquared)) +geom_point() + geom_line(color="blue") + ggtitle("Polynomial regression - R squared vs Degree of polynomial") + xlab("Degree") + ylab("R squared")

1.3a Polynomial Regression – Python

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns # Select key columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] # Convert columns to numeric autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Drop NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Polynomial degree 1 X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('R-squared score - Polynomial degree 1 (training): {:.3f}'.format(linreg.score(X_train, y_train))) # Compute R squared rsquared1 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 1 (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Polynomial degree 2 poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) # Compute R squared print('R-squared score - Polynomial degree 2 (training): {:.3f}'.format(linreg.score(X_train, y_train))) rsquared2 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 2 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) #Polynomial degree 3 poly = PolynomialFeatures(degree=3) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('(R-squared score -Polynomial degree 3 (training): {:.3f}' .format(linreg.score(X_train, y_train))) # Compute R squared rsquared3 =linreg.score(X_test, y_test) print('R-squared score Polynomial degree 3 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) degree=[1,2,3] rsquared =[rsquared1,rsquared2,rsquared3] fig2=plt.plot(degree,rsquared) fig2=plt.title("Polynomial regression - R squared vs Degree of polynomial") fig2=plt.xlabel("Degree") fig2=plt.ylabel("R squared") fig2.figure.savefig('foo2.png', bbox_inches='tight') print "Finished plotting and saving" ## R-squared score - Polynomial degree 1 (training): 0.811 ## R-squared score - Polynomial degree 1 (test): 0.799 ## R-squared score - Polynomial degree 2 (training): 0.861 ## R-squared score - Polynomial degree 2 (test): 0.847 ## ## (R-squared score -Polynomial degree 3 (training): 0.933 ## R-squared score Polynomial degree 3 (test): 0.710 ## ## Finished plotting and saving

1.4 K Nearest Neighbors

The code below implements KNN Regression both for R and Python. This is done for different neighbors. The R squared is computed in each case. This is repeated after performing feature scaling. It can be seen the model fit is much better after feature scaling. Normalization refers to

Another technique that is used is Standardization which is

1.4a K Nearest Neighbors Regression – R( Unnormalized)

The R code below does not use feature scaling

# KNN regression requires the FNN package df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Split train and test train_idx <- trainTestSplit(df3,trainPercent=75,seed=5) train <- df3[train_idx, ] test <- df3[-train_idx, ] # Select the feature variables train.X=train[,1:6] # Set the target for training train.Y=train[,7] # Do the same for test set test.X=test[,1:6] test.Y=test[,7] rsquared <- NULL # Create a list of neighbors neighbors <-c(1,2,4,8,10,14) for(i in seq_along(neighbors)){ # Perform a KNN regression fit knn=knn.reg(train.X,test.X,train.Y,k=neighbors[i]) # Compute R sqaured rsquared[i]=knnRSquared(knn$pred,test.Y) } # Make a dataframe for plotting df <- data.frame(neighbors,Rsquared=rsquared) # Plot the number of neighors vs the R squared ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("R squared") + ggtitle("KNN regression - R squared vs Number of Neighors (Unnormalized)")

1.4b K Nearest Neighbors Regression – Python( Unnormalized)

The Python code below does not use feature scaling

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.neighbors import KNeighborsRegressor autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Perform a train/test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Create a list of neighbors rsquared=[] neighbors=[1,2,4,8,10,14] for i in neighbors: # Fit a KNN model knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train, y_train) # Compute R squared rsquared.append(knnreg.score(X_test, y_test)) print('R-squared test score: {:.3f}' .format(knnreg.score(X_test, y_test))) # Plot the number of neighors vs the R squared fig3=plt.plot(neighbors,rsquared) fig3=plt.title("KNN regression - R squared vs Number of neighbors(Unnormalized)") fig3=plt.xlabel("Neighbors") fig3=plt.ylabel("R squared") fig3.figure.savefig('foo3.png', bbox_inches='tight') print "Finished plotting and saving" ## R-squared test score: 0.527 ## R-squared test score: 0.678 ## R-squared test score: 0.707 ## R-squared test score: 0.684 ## R-squared test score: 0.683 ## R-squared test score: 0.670 ## Finished plotting and saving 1.4c K Nearest Neighbors Regression – R( Normalized)

It can be seen that R squared improves when the features are normalized.

df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Perform MinMaxScaling of feature variables train.X.scaled=MinMaxScaler(train.X) test.X.scaled=MinMaxScaler(test.X) # Create a list of neighbors rsquared <- NULL neighbors <-c(1,2,4,6,8,10,12,15,20,25,30) for(i in seq_along(neighbors)){ # Fit a KNN model knn=knn.reg(train.X.scaled,test.X.scaled,train.Y,k=i) # Compute R ssquared rsquared[i]=knnRSquared(knn$pred,test.Y) } df <- data.frame(neighbors,Rsquared=rsquared) # Plot the number of neighors vs the R squared ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("R squared") + ggtitle("KNN regression - R squared vs Number of Neighors(Normalized)")

1.4d K Nearest Neighbors Regression – Python( Normalized)

R squared improves when the features are normalized with MinMaxScaling

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.neighbors import KNeighborsRegressor from sklearn.preprocessing import MinMaxScaler autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Perform a train/ test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Use MinMaxScaling scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) # Apply scaling on test set X_test_scaled = scaler.transform(X_test) # Create a list of neighbors rsquared=[] neighbors=[1,2,4,6,8,10,12,15,20,25,30] for i in neighbors: # Fit a KNN model knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train_scaled, y_train) # Compute R squared rsquared.append(knnreg.score(X_test_scaled, y_test)) print('R-squared test score: {:.3f}' .format(knnreg.score(X_test_scaled, y_test))) # Plot the number of neighors vs the R squared fig4=plt.plot(neighbors,rsquared) fig4=plt.title("KNN regression - R squared vs Number of neighbors(Normalized)") fig4=plt.xlabel("Neighbors") fig4=plt.ylabel("R squared") fig4.figure.savefig('foo4.png', bbox_inches='tight') print "Finished plotting and saving" ## R-squared test score: 0.703 ## R-squared test score: 0.810 ## R-squared test score: 0.830 ## R-squared test score: 0.838 ## R-squared test score: 0.834 ## R-squared test score: 0.828 ## R-squared test score: 0.827 ## R-squared test score: 0.826 ## R-squared test score: 0.816 ## R-squared test score: 0.815 ## R-squared test score: 0.809 ## Finished plotting and saving

Conclusion

In this initial post I cover the regression models when the output is continous. I intend to touch upon other Machine Learning algorithms.
Comments, suggestions and corrections are welcome.

Watch this this space!

To be continued….

You may like
1. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
2. Neural Networks: The mechanics of backpropagation
3. More book, more cricket! 2nd edition of my books now on Amazon
4. Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
5. Introducing cricket package yorkr:Part 4-In the block hole!

To see all posts see Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ARIMA models and Intervention Analysis

Fri, 10/06/2017 - 14:54

In my previous tutorial Structural Changes in Global Warming I introduced the strucchange package and some basic examples to date structural breaks in time series. In the present tutorial, I am going to show how dating structural changes (if any) and then Intervention Analysis can help in finding better ARIMA models. Dating structural changes consists in determining if there are any structural breaks in the time series data generating process, and, if so, their dates. Intervention analysis estimates the effect of an external or exogenous intervention on a time series. As an example of intervention, a permanent level shift, as we will see in this tutorial. In our scenario, the external or exogenous intervention is not known in advance, (or supposed to be known), it is inferred from the structural break we will identify.

The dataset considered for the analysis is the Arbuthnot dataset containing information of male and female births in London from year 1639 to 1710. Based on that, a metric representing the fractional excess of boys births versus girls is defined as:

$$
\begin{equation}
\begin{aligned}
\dfrac{(Boys – Girls)}{Girls}
\end{aligned}
\end{equation}
$$

The time series so defined is analyzed to determine candidate ARIMA models. The present tutorial is so organized. First, a brief exploratory analysis is carried on. Then, six ARIMA models are defined, analyzed and compared. Forecast of the time series under analysis is computed. Finally, a short historical background digression is introduced describing what was happening in England on 17-th century and citing studies on the matter of sex ratio at birth.

Packages suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(astsa)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(fUnitRoots)) suppressPackageStartupMessages(library(FitARMA)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(reshape)) suppressPackageStartupMessages(library(Rmisc)) suppressPackageStartupMessages(library(fBasics))

Note:The fUnitRoots package is not any longer maintained by CRAN, however it can be installed from source available at the following link:

fUnitRoots version 3010.78 sources

Exploratory Analysis

Loading the Arbuthnot dataset and showing some basic metrics and plots.

url <- "https://www.openintro.org/stat/data/arbuthnot.csv" abhutondot <- read.csv(url, header=TRUE) nrow(abhutondot) [1] 82 head(abhutondot) year boys girls 1 1629 5218 4683 2 1630 4858 4457 3 1631 4422 4102 4 1632 4994 4590 5 1633 5158 4839 6 1634 5035 4820 abhutondot_rs <- melt(abhutondot, id = c("year")) head(abhutondot_rs) year variable value 1 1629 boys 5218 2 1630 boys 4858 3 1631 boys 4422 4 1632 boys 4994 5 1633 boys 5158 6 1634 boys 5035 tail(abhutondot_rs) year variable value 159 1705 girls 7779 160 1706 girls 7417 161 1707 girls 7687 162 1708 girls 7623 163 1709 girls 7380 164 1710 girls 7288 ggplot(data = abhutondot_rs, aes(x = year)) + geom_line(aes(y = value, colour = variable)) + scale_colour_manual(values = c("blue", "red"))

Gives this plot.

Boys births appear to be consistently greater than girls ones. Let us run a t.test to further verify if there is a true difference in the mean of the two groups as represented by boys and girls births.

t.test(value ~ variable, data = abhutondot_rs) Welch Two Sample t-test data: value by variable t = 1.4697, df = 161.77, p-value = 0.1436 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -128.0016 872.9040 sample estimates: mean in group boys mean in group girls 5907.098 5534.646

Based on the p-value, we cannot reject the null hypothesis that the true difference in means is equal to zero.

basicStats(abhutondot[-1]) boys girls nobs 8.200000e+01 8.200000e+01 NAs 0.000000e+00 0.000000e+00 Minimum 2.890000e+03 2.722000e+03 Maximum 8.426000e+03 7.779000e+03 1. Quartile 4.759250e+03 4.457000e+03 3. Quartile 7.576500e+03 7.150250e+03 Mean 5.907098e+03 5.534646e+03 Median 6.073000e+03 5.718000e+03 Sum 4.843820e+05 4.538410e+05 SE Mean 1.825161e+02 1.758222e+02 LCL Mean 5.543948e+03 5.184815e+03 UCL Mean 6.270247e+03 5.884477e+03 Variance 2.731595e+06 2.534902e+06 Stdev 1.652754e+03 1.592137e+03 Skewness -2.139250e-01 -2.204720e-01 Kurtosis -1.221721e+00 -1.250893e+00 p1 <- ggplot(data = abhutondot_rs, aes(x = variable, y = value)) + geom_boxplot() p2 <- ggplot(data = abhutondot, aes(boys)) + geom_density() p3 <- ggplot(data = abhutondot, aes(girls)) + geom_density() multiplot(p1, p2, p3, cols = 3)

Gives this plot.

Let us define the time series to be analysed with frequency = 1 as data is collected yearly, see ref. [5] for details.

excess_frac <- (abhutondot$boys - abhutondot$girls)/abhutondot$girls excess_ts <- ts(excess_frac, frequency = 1, start = abhutondot$year[1]) autoplot(excess_ts)

Gives this plot.

Basic statistics metrics are reported.

basicStats(excess_frac) excess_frac nobs 82.000000 NAs 0.000000 Minimum 0.010673 Maximum 0.156075 1. Quartile 0.048469 3. Quartile 0.087510 Mean 0.070748 Median 0.064704 Sum 5.801343 SE Mean 0.003451 LCL Mean 0.063881 UCL Mean 0.077615 Variance 0.000977 Stdev 0.031254 Skewness 0.680042 Kurtosis 0.073620

Boys births were at least 1% higher than girls ones, reaching a top percentage excess equal to 15%.

Further, unit roots tests (run by urdfTest() within fUnitRoots package) show that we cannot reject the null hypothesis of unit root presence. See their test statistics compared with critical values below (see ref. [2] for details about the urdfTest() report).

urdftest_lag = floor(12*(length(excess_ts)/100)^0.25) urdfTest(excess_ts, type = "nc", lags = urdftest_lag, doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression none Call: lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.052739 -0.018246 -0.002899 0.019396 0.069349 Coefficients: Estimate Std. Error t value Pr(>|t|) z.lag.1 -0.01465 0.05027 -0.291 0.771802 z.diff.lag1 -0.71838 0.13552 -5.301 1.87e-06 *** z.diff.lag2 -0.66917 0.16431 -4.073 0.000143 *** z.diff.lag3 -0.58640 0.18567 -3.158 0.002519 ** z.diff.lag4 -0.56529 0.19688 -2.871 0.005700 ** z.diff.lag5 -0.58489 0.20248 -2.889 0.005432 ** z.diff.lag6 -0.60278 0.20075 -3.003 0.003944 ** z.diff.lag7 -0.43509 0.20012 -2.174 0.033786 * z.diff.lag8 -0.28608 0.19283 -1.484 0.143335 z.diff.lag9 -0.14212 0.18150 -0.783 0.436769 z.diff.lag10 0.13232 0.15903 0.832 0.408801 z.diff.lag11 -0.07234 0.12774 -0.566 0.573346 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03026 on 58 degrees of freedom Multiple R-squared: 0.4638, Adjusted R-squared: 0.3529 F-statistic: 4.181 on 12 and 58 DF, p-value: 0.0001034 Value of test-statistic is: -0.2914 Critical values for test statistics: 1pct 5pct 10pct tau1 -2.6 -1.95 -1.61 urdfTest(excess_ts, type = "c", lags = urdftest_lag, doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression drift Call: lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.051868 -0.018870 -0.005227 0.019503 0.067936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02239 0.01789 1.251 0.2159 z.lag.1 -0.31889 0.24824 -1.285 0.2041 z.diff.lag1 -0.44287 0.25820 -1.715 0.0917 . z.diff.lag2 -0.40952 0.26418 -1.550 0.1266 z.diff.lag3 -0.34933 0.26464 -1.320 0.1921 z.diff.lag4 -0.35207 0.25966 -1.356 0.1805 z.diff.lag5 -0.39863 0.25053 -1.591 0.1171 z.diff.lag6 -0.44797 0.23498 -1.906 0.0616 . z.diff.lag7 -0.31103 0.22246 -1.398 0.1675 z.diff.lag8 -0.19044 0.20656 -0.922 0.3604 z.diff.lag9 -0.07128 0.18928 -0.377 0.7079 z.diff.lag10 0.18023 0.16283 1.107 0.2730 z.diff.lag11 -0.04154 0.12948 -0.321 0.7495 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03012 on 57 degrees of freedom Multiple R-squared: 0.4781, Adjusted R-squared: 0.3683 F-statistic: 4.352 on 12 and 57 DF, p-value: 6.962e-05 Value of test-statistic is: -1.2846 0.8257 Critical values for test statistics: 1pct 5pct 10pct tau2 -3.51 -2.89 -2.58 phi1 6.70 4.71 3.86

ACF and PACF plots are given.

par(mfrow=c(1,2)) acf(excess_ts) pacf(excess_ts)

Gives this plot:

We observe the auto-correlation spike at lag = 10 beyond confidence region. That suggests the presence of a seasonal component with period = 10. Structural changes are now investigated. First let us verify if regression against a constant is significative for our time series.

summary(lm(excess_ts ~ 1)) Call: lm(formula = excess_ts ~ 1) Residuals: Min 1Q Median 3Q Max -0.060075 -0.022279 -0.006044 0.016762 0.085327 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.070748 0.003451 20.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03125 on 81 degrees of freedom

The intercept is reported as significative. Let us see if there are any structural breaks.

(break_point <- breakpoints(excess_ts ~ 1)) Optimal 2-segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: 42 Corresponding to breakdates: 1670 plot(break_point)

Gives this plot.

summary(break_point) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: m = 1 42 m = 2 20 42 m = 3 20 35 48 m = 4 20 35 50 66 m = 5 17 30 42 56 69 Corresponding to breakdates: m = 1 1670 m = 2 1648 1670 m = 3 1648 1663 1676 m = 4 1648 1663 1678 1694 m = 5 1645 1658 1670 1684 1697 Fit: m 0 1 2 3 4 5 RSS 0.07912 0.06840 0.06210 0.06022 0.05826 0.05894 BIC -327.84807 -330.97945 -330.08081 -323.78985 -317.68933 -307.92410

The BIC minimum value is reached when m = 1, hence just one break point is determined corresponding to year 1670. Let us plot the original time series against its structural break and its confidence interval.

plot(excess_ts) lines(fitted(break_point, breaks = 1), col = 4) lines(confint(break_point, breaks = 1))

Gives this plot.

Boys vs girls sex ratio at birth changed from:

fitted(break_point)[1] 0.08190905

to:

fitted(breakpoint)[length(excess_ts)] 0.05902908

Running a t.test() to verify further the difference in mean is significative across the two time windows identified by the breakpoint date, year 1670.

break_date <- breakdates(break_point) win_1 <- window(excess_ts, end = break_date) win_2 <- window(excess_ts, start = break_date + 1) t.test(win_1, win_2) Welch Two Sample t-test data: win_1 and win_2 t = 3.5773, df = 70.961, p-value = 0.0006305 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01012671 0.03563322 sample estimates: mean of x mean of y 0.08190905 0.05902908

Based on reported p-value, we reject the null hypothesis that the true difference is equal to zero.

ARIMA Models

I am going to compare the following six ARIMA models (represented with the usual (p,d,q)(P,D,Q)[S] notation):

1. non seasonal (1,1,1), as determined by auto.arima() within forecast package 2. seasonal (1,0,0)(0,0,1)[10] 3. seasonal (1,0,0)(1,0,0)[10] 4. seasonal (0,0,0)(0,0,1)[10] with level shift regressor as intervention variable 5. seasonal (1,0,0)(0,0,1)[10] with level shift regressor as intervention variable 6. non seasonal (1,0,0) with level shift regressor as intervention variable

Here we go.

Model #1

The first model is determined by the auto.arima() function within the forecast package, using the options:

a. stepwise = FALSE, which allows for a more in-depth search of potential models
b. trace = TRUE, which allows to get a list of all the investigated models

Further, as default input to auto.arima() :
c. stationary = FALSE, so that models search is not restricted to stationary models
d. seasonal = TRUE, so that models search is not restricted to non seasonal models

(model_1 <- auto.arima(excess_ts, stepwise = FALSE, trace = TRUE)) ARIMA(0,1,0) : -301.4365 ARIMA(0,1,0) with drift : -299.3722 ARIMA(0,1,1) : -328.9381 ARIMA(0,1,1) with drift : -326.9276 ARIMA(0,1,2) : -329.4061 ARIMA(0,1,2) with drift : Inf ARIMA(0,1,3) : -327.2841 ARIMA(0,1,3) with drift : Inf ARIMA(0,1,4) : -325.7604 ARIMA(0,1,4) with drift : Inf ARIMA(0,1,5) : -323.4805 ARIMA(0,1,5) with drift : Inf ARIMA(1,1,0) : -312.8106 ARIMA(1,1,0) with drift : -310.7155 ARIMA(1,1,1) : -329.5727 ARIMA(1,1,1) with drift : Inf ARIMA(1,1,2) : -327.3821 ARIMA(1,1,2) with drift : Inf ARIMA(1,1,3) : -325.1085 ARIMA(1,1,3) with drift : Inf ARIMA(1,1,4) : -323.446 ARIMA(1,1,4) with drift : Inf ARIMA(2,1,0) : -317.1234 ARIMA(2,1,0) with drift : -314.9816 ARIMA(2,1,1) : -327.3795 ARIMA(2,1,1) with drift : Inf ARIMA(2,1,2) : -325.0859 ARIMA(2,1,2) with drift : Inf ARIMA(2,1,3) : -322.9714 ARIMA(2,1,3) with drift : Inf ARIMA(3,1,0) : -315.9114 ARIMA(3,1,0) with drift : -313.7128 ARIMA(3,1,1) : -325.1497 ARIMA(3,1,1) with drift : Inf ARIMA(3,1,2) : -323.1363 ARIMA(3,1,2) with drift : Inf ARIMA(4,1,0) : -315.3018 ARIMA(4,1,0) with drift : -313.0426 ARIMA(4,1,1) : -324.2463 ARIMA(4,1,1) with drift : -322.0252 ARIMA(5,1,0) : -315.1075 ARIMA(5,1,0) with drift : -312.7776 Series: excess_ts ARIMA(1,1,1) Coefficients: ar1 ma1 0.2224 -0.9258 s.e. 0.1318 0.0683 sigma^2 estimated as 0.0009316: log likelihood=167.94 AIC=-329.88 AICc=-329.57 BIC=-322.7

As we can see, all investigated models have d=1, which is congruent with the augmented Dickey-Fuller test results.

summary(model_1) Series: excess_ts ARIMA(1,1,1) Coefficients: ar1 ma1 0.2224 -0.9258 s.e. 0.1318 0.0683 sigma^2 estimated as 0.0009316: log likelihood=167.94 AIC=-329.88 AICc=-329.57 BIC=-322.7 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.002931698 0.02995934 0.02405062 -27.05674 46.53832 0.8292635 -0.01005689

The significance of our (1,1,1) model coefficients is further investigated.

coeftest(model_1) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.222363 0.131778 1.6874 0.09153 . ma1 -0.925836 0.068276 -13.5603 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #2

A spike at lag = 1 in the ACF plot suggests the presence of an auto-regressive component. Our second model candidate takes into account the seasonality observed at lag = 10. As a result, the candidate model (1,0,0)(0,0,1)[10] is investigated.

model_2 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period = 10), include.mean = TRUE) summary(model_2) Series: excess_ts ARIMA(1,0,0)(0,0,1)[10] with non-zero mean Coefficients: ar1 sma1 mean 0.2373 0.3441 0.0708 s.e. 0.1104 0.1111 0.0053 sigma^2 estimated as 0.0008129: log likelihood=176.23 AIC=-344.46 AICc=-343.94 BIC=-334.83 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.0002212383 0.02798445 0.02274858 -21.47547 42.96717 0.7843692 -0.004420048 coeftest(model_2) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.2372975 0.1104323 2.1488 0.031650 * sma1 0.3440811 0.1110791 3.0976 0.001951 ** intercept 0.0707836 0.0052663 13.4409 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #3

In the third model, I introduce a seasonal auto-regressive component in place of the moving average one as present into model #2.

model_3 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(1,0,0), period = 10), include.mean = TRUE) summary(model_3) Series: excess_ts ARIMA(1,0,0)(1,0,0)[10] with non-zero mean Coefficients: ar1 sar1 mean 0.2637 0.3185 0.0705 s.e. 0.1069 0.1028 0.0058 sigma^2 estimated as 0.0008173: log likelihood=176.1 AIC=-344.19 AICc=-343.67 BIC=-334.56 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.0002070952 0.02806013 0.02273145 -21.42509 42.85735 0.7837788 -0.005665764 coeftest(model_3) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.2636602 0.1069472 2.4653 0.013689 * sar1 0.3185397 0.1027903 3.0989 0.001942 ** intercept 0.0704636 0.0058313 12.0836 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Model #4

This model takes into account the level shift highlighted by the structural change analysis and the seasonal component at lag = 10 observed in the ACF plot. To represent the structural change as level shift, a regressor variable named as level is defined as equal to zero for the timeline preceeding the breakpoint date and as equal to one afterwards such date. In other words, it is a step function which represents a permanent level shift. Such variable is input to the Arima() function as xreg argument. That is one of the most simple representation of an intervention variable, for a more complete overview see ref. [6].

level <- c(rep(0, break_point$breakpoints), rep(1, length(excess_ts) - break_point$breakpoints)) model_4 <- Arima(excess_ts, order = c(0,0,0), seasonal = list(order = c(0,0,1), period = 10), xreg = level, include.mean = TRUE) summary(model_4) Series: excess_ts Regression with ARIMA(0,0,0)(0,0,1)[10] errors Coefficients: sma1 intercept level 0.3437 0.0817 -0.0225 s.e. 0.1135 0.0053 0.0072 sigma^2 estimated as 0.0007706: log likelihood=178.45 AIC=-348.9 AICc=-348.38 BIC=-339.27 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.0001703111 0.02724729 0.02184016 -19.82639 41.28977 0.7530469 0.1608774 coeftest(model_4) z test of coefficients: Estimate Std. Error z value Pr(>|z|) sma1 0.3437109 0.1135387 3.0273 0.002468 ** intercept 0.0817445 0.0052825 15.4745 < 2.2e-16 *** level -0.0225294 0.0072468 -3.1089 0.001878 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ‘level’ regressor coefficient indicates an average 2.2% decrease in the boys vs. girls excess birth ratio, afterwards year 1670.

Model #5

The present model adds to model #4 an auto-regressive component.

model_5 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period=10), xreg = level, include.mean = TRUE) summary(model_5) Series: excess_ts Regression with ARIMA(1,0,0)(0,0,1)[10] errors Coefficients: ar1 sma1 intercept level 0.1649 0.3188 0.0820 -0.0230 s.e. 0.1099 0.1112 0.0061 0.0084 sigma^2 estimated as 0.0007612: log likelihood=179.56 AIC=-349.11 AICc=-348.32 BIC=-337.08 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 8.225255e-05 0.02690781 0.02174375 -19.42485 40.90147 0.7497229 0.007234682 coeftest(model_5) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.1648878 0.1099118 1.5002 0.133567 sma1 0.3187896 0.1112301 2.8660 0.004156 ** intercept 0.0819981 0.0061227 13.3926 < 2.2e-16 *** level -0.0230176 0.0083874 -2.7443 0.006064 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The auto-regressive coefficient ar1 is reported as not significative, hence this model is not taken into account further, even though it would provide a (very) slight improvement in the AIC metric.

Model #6

This model consideres the structural change as in model #4 without the seasonal component. That because I want to evaluate if any benefits comes from including a seasonal component.

model_6 <- Arima(excess_ts, order = c(1,0,0), xreg = level, include.mean = TRUE) summary(model_6) Series: excess_ts Regression with ARIMA(1,0,0) errors Coefficients: ar1 intercept level 0.1820 0.0821 -0.0232 s.e. 0.1088 0.0053 0.0076 sigma^2 estimated as 0.0008369: log likelihood=175.68 AIC=-343.35 AICc=-342.83 BIC=-333.73 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -7.777648e-05 0.02839508 0.02258574 -21.93086 43.2444 0.7787544 0.0003897161 coeftest(model_6) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.1820054 0.1088357 1.6723 0.094466 . intercept 0.0821470 0.0053294 15.4139 < 2.2e-16 *** level -0.0232481 0.0076044 -3.0572 0.002234 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Models Residuals Analysis

It is relevant to verify if our models residuals are white noise or not. If not, the model under analysis has not captured all the structure of our original time series. For this purpose, I will take advantage of the checkresiduals() function within forecast package and the sarima() within the astsa package. What I like of the sarima() function is the residuals qqplot() with confidence region and the Ljung-Box plot to check significance of residuals correlations. If you like to further verify the Ljung-Box test results, I would suggest to take advantage of the LjungBoxTest() function within FitARMA package.

Notes:
1. Model #5 was taken out of the prosecution of the analysis, hence its residuals will not be checked.
2. checkresiduals() and sarima() gives textual outputs which are omitted as equivalent information is already included elsewhere. The checkresiduals() reports a short LjungBox test result. The sarima() function reports a textual model summary showing coefficients and metrics similar to already shown summaries.

Model #1 Residuals Diagnostic

Checking ARIMA model (1,1,1) residuals.

checkresiduals(model_1)

Gives this plot.

It is important to verify the significance of residuals auto-correlation, in order to see if the spike at lag = 16 is as such. In fact, the purpose of running the LjungBox test is to verify if any auto-correlation beyond the confidence region (as seen in the ACF plot) is a true positive (p-value < 0.05) or is false positive (p-value >= 0.05).

LjungBoxTest(residuals(model_1), k = 2, lag.max = 20) m Qm pvalue 1 0.01 0.92611002 2 0.01 0.91139925 3 0.17 0.68276539 4 0.82 0.66216496 5 1.35 0.71745835 6 1.99 0.73828536 7 3.32 0.65017878 8 3.98 0.67886254 9 5.16 0.64076272 10 13.34 0.10075806 11 15.32 0.08260244 12 15.32 0.12066369 13 15.35 0.16692082 14 15.39 0.22084407 15 15.40 0.28310047 16 23.69 0.04989871 17 25.63 0.04204503 18 27.65 0.03480954 19 30.06 0.02590249 20 30.07 0.03680262

The p-value at lag = 16 is < 0.05 hence the auto-correlation spike at lag = 16 out of the confidence interval is significative. Our model #1 has not captured all the structure of the original time series. Also auto-correlations at lags = {17, 18, 19, 20} have p-value < 0.05, however they stand within the confidence inteval.

Now doing the same with sarima() function.

sarima(excess_ts, p = 1, d = 1, q = 1)

Gives this plot.

The sarima() output plot shows results congruent with previous comments.

Model #2 Residuals Diagnostic

Checking ARIMA (1,0,0)(0,0,1)[10] model residuals.

checkresiduals(model_2)

Gives this plot.

We observe how the model #2 does not show auto-correlation spikes above the confidence interval, and that is a confirmation of the presence of seasonality with period = 10.

LjungBoxTest(residuals(model_2), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9674875 2 0.00 0.9545816 3 0.30 0.5815774 4 0.69 0.7096699 5 0.82 0.8442508 6 0.98 0.9121811 7 2.01 0.8481327 8 4.24 0.6442302 9 8.56 0.2861290 10 8.63 0.3742209 11 10.06 0.3459882 12 10.13 0.4290298 13 10.15 0.5172343 14 10.20 0.5985932 15 10.44 0.6577499 16 14.30 0.4275627 17 17.14 0.3104476 18 18.86 0.2759461 19 22.35 0.1715052 20 22.41 0.2142307

No p-value > 0.05 are shown by the LjungBox test report. Now showing sarima() plots.

sarima(excess_ts, p = 1, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10)

Gives this plot.

The sarima() output plot shows results congruent with previous comments. As a conclusion, model #2 has white noise residuals.

Model #3 Residuals Diagnostic

Checking ARIMA (1,0,0)(1,0,0)[10] model residuals.

checkresiduals(model_3)

Gives this plot.

LjungBoxTest(residuals(model_3), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9583318 2 0.00 0.9553365 3 0.18 0.6719971 4 0.37 0.8313599 5 0.46 0.9285781 6 0.63 0.9600113 7 1.63 0.8975737 8 3.90 0.6901431 9 8.23 0.3126132 10 8.34 0.4005182 11 10.36 0.3222430 12 10.36 0.4091634 13 10.39 0.4960029 14 10.52 0.5708185 15 10.78 0.6290166 16 14.81 0.3914043 17 17.91 0.2675070 18 19.69 0.2343648 19 23.37 0.1374412 20 23.70 0.1651785 sarima(excess_ts, p = 1, d = 0, q = 0, P = 1, D = 0, Q = 0, S = 10)

Gives this plot.

Model#3 has white noise residuals.

Model #4 Residuals Diagnostic

Checking residuals of the ARIMA (0,0,0)(0,0,1)[10] model with level shift.

checkresiduals(model_4)

Gives this plot.

LjungBoxTest(residuals(model_4), k = 1, lag.max = 20) m Qm pvalue 1 2.20 0.1379312 2 2.23 0.1349922 3 2.24 0.3262675 4 3.68 0.2977682 5 4.99 0.2884494 6 5.23 0.3884858 7 5.52 0.4787801 8 7.45 0.3837810 9 10.78 0.2142605 10 10.79 0.2905934 11 12.61 0.2465522 12 12.61 0.3198632 13 12.61 0.3979718 14 12.63 0.4769589 15 12.65 0.5538915 16 16.37 0.3578806 17 16.77 0.4005631 18 19.73 0.2882971 19 25.25 0.1182396 20 25.31 0.1504763 sarima(excess_ts, p = 0, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10, xreg = level)

Gives this plot.

As a conclusion, model #4 has white noise residuals.

Model #6 Residuals Diagnostic

Checking residuals of the ARIMA (1,0,0) model with level shift.

checkresiduals(model_6)

Gives this plot

We can observe ACF spikes above the confidence region for lags = {10, 16}.

LjungBoxTest(residuals(model_6), k = 1, lag.max = 20) m Qm pvalue 1 0.00 0.99713258 2 0.00 0.96036600 3 0.07 0.96612792 4 1.51 0.67947886 5 2.70 0.60998895 6 4.47 0.48361437 7 5.20 0.51895133 8 5.49 0.59997164 9 6.51 0.58979614 10 14.72 0.09890580 11 17.09 0.07239050 12 17.21 0.10175761 13 17.21 0.14179063 14 17.24 0.18844158 15 17.34 0.23872998 16 25.31 0.04596230 17 27.53 0.03591335 18 29.50 0.03021450 19 31.47 0.02537517 20 31.71 0.03370585

The LjungBox test reports the residuals auto-correlation at lag = 10 as not signficative (p-value > 0.05). The lag = 16 residuals auto-correlation is significative (p-value < 0.05). Let us do same verifications with sarima().

sarima(excess_ts, p = 1, d = 0, q = 0, xreg = level)

Gives this plot.

The sarima() plots confirm the presence of significative auto-correlations in the residuals at lag = 16. As a conclusion, this model does not capture all the structure of our original time series.

As overall conclusion, only the seasonal models have white noise residuals.

Models Summary

Finally, it is time to gather the overall AIC, AICc and BIC metrics of our five candidate models (please remember that model #5 has been cut off from the prosecution of the analysis) and choose the final model.

df <- data.frame(col_1_res = c(model_1$aic, model_2$aic, model_3$aic, model_4$aic, model_6$aic), col_2_res = c(model_1$aicc, model_2$aicc, model_3$aicc, model_4$aicc, model_6$aicc), col_3_res = c(model_1$bic, model_2$bic, model_3$bic, model_4$bic, model_6$bic)) colnames(df) <- c("AIC", "AICc", "BIC") rownames(df) <- c("ARIMA(1,1,1)", "ARIMA(1,0,0)(0,0,1)[10]", "ARIMA(1,0,0)(1,0,0)[10]", "ARIMA(0,0,0)(0,0,1)[10] with level shift", "ARIMA(1,0,0) with level shift") df AIC AICc BIC ARIMA(1,1,1) -329.8844 -329.5727 -322.7010 ARIMA(1,0,0)(0,0,1)[10] -344.4575 -343.9380 -334.8306 ARIMA(1,0,0)(1,0,0)[10] -344.1906 -343.6711 -334.5637 ARIMA(0,0,0)(0,0,1)[10] with level shift -348.8963 -348.3768 -339.2694 ARIMA(1,0,0) with level shift -343.3529 -342.8334 -333.7260

The model which provides the best AIC, AICc and BIC metrics at the same time is model #4, ARIMA(0,0,0)(0,0,1)[10] with level shift.

Note: In case of structural changes, the (augmented) Dickey-Fuller test is biased toward the non rejection of the null, as ref. [2] explains. This may justify why the null hypothesis of unit root presence was not rejected by such test and model #1 performs worse than the other ones in terms of AIC metric. Further, you may verify that with d=1 in models from #2 up to #6, the AIC metric does not improve.

Forecast

I take advantage of the forecast() function provided within forecast package using model #4 as input. The xreg variable is determined as a constant level equal to one congruently with the structural analysis results.

h_fut <- 20 plot(forecast(model_4, h = h_fut, xreg = rep(1, h_fut)))

Gives this plot.

Historical Background

So we observed a level shift equal approximately to 2.25% in the boys vs girls births excess ratio occurring on year 1670. Two questions arises:

1. What could have influenced the sex-ratio at birth ?
2. What was happening in London during the 17-th century ?

There are studies showing results in support of the fact that sex-ratio at birth is influenced by war periods. Studies such as ref. [7], [8] and [9] suggest an increase of boys births during and/or after war time. General justification is for an adaptive response to external conditions and stresses. Differently, studies as ref. [10] state that there is no statistical evidence of war time influence on sex-ratio at births. Under normal circumstances, the boys vs girls sex ratio at birth is approximately 105:100 on average, (according to some sources), and generally between 102:100 and 108:100.

Our time series covers most of the following eras of the England history (ref. [11]):

* Early Stuart era: 1603-1660
* Later Stuart era 1660-1714

The English Civil War occured during the Early Stuart era and consisted of a series of armed conflicts and political machinations that took place between Parliamentarians (known as Roundheads) and Royalists (known as Cavaliers) between 1642 and 1651. The English conflict left some 34,000 Parliamentarians and 50,000 Royalists dead, while at least 100,000 men and women died from war-related diseases, bringing the total death toll caused by the three civil wars in England to almost 200,000 (see ref. [12]).

If we step back to view the first plot showing the abhutondot dataset, we can observe a sharp drop on births (both boys and girls) between 1642 and 1651, as testify a period of scarce resources and famine, and we can infer it was due to the civil war. Let us run a quick analysis on the total number of births.

abhutondot.ts <- ts(abhutondot$boys + abhutondot$girls, frequency = 1 , start = abhutondot$year[1]) autoplot(abhutondot.ts)

Gives this plot.

I then verify if any structural breaks are there with respect a constant level as regressor.

summary(lm(abhutondot.ts ~ 1)) Call: lm(formula = abhutondot.ts ~ 1) Residuals: Min 1Q Median 3Q Max -5829.7 -2243.0 371.3 3281.3 4703.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11442 358 31.96 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3242 on 81 degrees of freedom

The regression is reported as significative. Going on with the search for structural breaks, if any,

(break_point <- breakpoints(abhutondot.ts ~ 1)) Optimal 4-segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: 15 33 52 Corresponding to breakdates: 1643 1661 1680 plot(break_point)

Gives this plot.

summary(break_point) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: m = 1 39 m = 2 33 52 m = 3 15 33 52 m = 4 15 33 52 68 m = 5 15 32 44 56 68 Corresponding to breakdates: m = 1 1667 m = 2 1661 1680 m = 3 1643 1661 1680 m = 4 1643 1661 1680 1696 m = 5 1643 1660 1672 1684 1696 Fit: m 0 1 2 3 4 5 RSS 851165494 211346686 139782582 63564154 59593922 62188019 BIC 1566 1461 1436 1380 1383 1396

The BIC minimum value is reached with m = 3. Let us plot the original time series against its structural breaks and their confidence intervals.

plot(abhutondot.ts) fitted.ts <- fitted(break_point, breaks = 3) lines(fitted.ts, col = 4) lines(confint(break_point, breaks = 3))

Gives this plot.

The fitted levels and the breakpoints dates are as follows.

unique(as.integer(fitted.ts)) [1] 9843 6791 11645 14902 breakdates(break_point, breaks = 3) [1] 1648 1663 1676

So it is confirmed that the total number of births went through a sequence of level shift due to exogeneous shocks. The civil war is very likely determining the first break and its end the second one. It is remarkable the total births decrease from the 9843 level down to 6791 occurring around year 1648. As well remarkable, the level up shift happened on year 1663, afterwards the civil war end. The excess sex ratio structural break on year 1670 occurred at a time approximately in between total births second and third breaks.

The fitted multiple level shifts (as determined by the structural breaks analysis) can be used as intervention variable to fit an ARIMA model, as shown below.

fitted.ts <- fitted(break_point, breaks = 3) autoplot(fitted.ts)

Gives this plot.

abhutondot_xreg <- Arima(abhutondot.ts, order = c(0,1,1), xreg = fitted.ts, include.mean = TRUE) summary(abhutondot_xreg) Series: abhutondot.ts Regression with ARIMA(0,1,1) errors Coefficients: ma1 fitted.ts -0.5481 0.5580 s.e. 0.1507 0.1501 sigma^2 estimated as 743937: log likelihood=-661.65 AIC=1329.3 AICc=1329.61 BIC=1336.48 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 71.60873 846.5933 622.1021 0.0132448 5.92764 1.002253 0.08043183 coeftest(abhutondot_xreg) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ma1 -0.54809 0.15071 -3.6368 0.0002761 *** fitted.ts 0.55799 0.15011 3.7173 0.0002013 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 checkresiduals(abhutondot_xreg)

Gives this plot.

LjungBoxTest(residuals(abhutondot_xreg), k=1, lag.max=20) m Qm pvalue 1 1.39 0.2387836 2 2.26 0.1325458 3 2.31 0.3156867 4 2.69 0.4416912 5 2.93 0.5701546 6 3.32 0.6503402 7 4.14 0.6580172 8 4.53 0.7170882 9 4.54 0.8054321 10 7.86 0.5479118 11 8.51 0.5791178 12 8.54 0.6644773 13 8.69 0.7292904 14 9.31 0.7491884 15 11.16 0.6734453 16 11.50 0.7163115 17 12.58 0.7035068 18 12.60 0.7627357 19 13.01 0.7906889 20 14.83 0.7334703 sarima(abhutondot.ts, p=0, d=1, q=1, xreg = fitted.ts)

Gives this plot.

The residuals are white noise.

Conclusions

To have spot seasonality and level shifts were important aspects in our ARIMA models analysis. The seasonal component allowed to define models with white noise residuals. The structural breaks analysis allowed to define models with better AIC metric introducing as regressor the identified level shifts. We also had the chance to go through some historical remarks of the history of England and get to know about sex ratio at birth studies.

If you have any questions, please feel free to comment below.

References

    Related Post

    1. A Gentle Introduction on Market Basket Analysis — Association Rules
    2. Building A Book Recommender System – The Basics, kNN and Matrix Factorization
    3. Explaining complex machine learning models with LIME
    4. Text Message Classification
    5. Analyzing Google Trends Data in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Governance, Engagement, and Resistance in the Open Science Movement: A Comparative Study

    Fri, 10/06/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    A growing community of scientists from a variety of disciplines is moving the norms of scientific research toward open practices. Supporters of open science hope to increase the quality and efficiency of research by enabling the widespread sharing of datasets, research software source code, publications, and other processes and products of research. The speed at which the open science community seems to be growing mirrors the rapid development of technological capabilities, including robust open source scientific software, new services for data sharing and publication, and novel data science techniques for working with massive datasets. Organizations like rOpenSci harness such capabilities and deploy various combinations of these research tools, or what I refer to here as open science infrastructures, to facilitate open science.

    As studies of other digital infrastructures have pointed out, developing and deploying the technological capabilities that support innovative work within a community of practitioners constitutes just one part of making innovation happen. As quickly as the technical solutions to improving scientific research may be developing, a host of organizational and social issues are lagging behind and hampering the open science community’s ability to inscribe open practices in the culture of scientific research. Remedying organizational and social issues requires paying attention to open science infrastructures’ human components, such as designers, administrators, and users, as well as the policies, practices, and organizational structures that contribute to the smooth functioning of these systems.12 These elements of infrastructure development have, in the past, proven to be equal to or more important than technical capabilities in determining the trajectory the infrastructure takes (e.g., whether it “succeeds” or “fails”).34.

    As a postdoc with rOpenSci, I have begun a qualitative, ethnographic project to explore the organizational and social processes involved in making open science the norm in two disciplines: astronomy and ecology. I focus on these two disciplines to narrow, isolate, and compare the set of contextual factors (e.g., disciplinary histories, research norms, and the like) that might influence perceptions of open science. Specifically, I aim to answer the following research questions (RQ):

    RQ1a: What are the primary motivations of scientists who actively engage with open science infrastructures?

    RQ1b: What are the factors that influence resistance to open science among some scientists?

    RQ2: What strategies do open science infrastructure leaders use to encourage participation, govern contributions, and overcome resistance to open science infrastructure use?
    a. To what extent do governance strategies balance standardization and flexibility, centralization and decentralization, and voluntary and mandatory contributions?
    b. By what mechanisms are open science policies and practices enforced?
    c. What are the commonalities and differences in the rationale behind choices of governance strategies?

    Below, I describe how I am systematically investigating these questions in two parts. In Part 1, I am identifying the issues raised by scientists who engage with or resist the open science movement. In Part 2, I am studying the governance strategies open science leaders and decision-makers use to elicit engagement with open science infrastructures in these disciplines.

    Part 1: Engagement with and Resistance to Open Science

    I am firmly rooted in a research tradition which emphasizes that studying the uptake of a new technology or technological approach, no matter the type of work or profession, begins with capturing how the people charged with changing their activities respond to the change “on the ground.” In this vein, Part 1 of the study aims to lend empirical support or opposition to arguments for and against open science that are commonly found in opinion pieces, on social media, and in organizational mission statements. A holistic reading of such documents reveals several commonalities in the positions for and against open science. Supporters of open science often cite increased transparency, reproducibility, and collaboration as the overwhelming benefits of making scientific research processes and products openly available. Detractors highlight concerns over “scooping,” ownership, and the time costs of curating and publishing code and data.

    I am seeking to verify and test these claims by interviewing and surveying astronomers and ecologists or, more broadly, earth and environmental scientists who fall on various parts of the open science engagement-to-resistance spectrum. I am conducting interviews using a semi-structured interview protocol5 across all interviewees. I will then use a qualitative data analysis approach based on the grounded theory method6 to extract themes from the responses, focusing on the factors that promote engagement (e.g., making data available, spending time developing research software, or making publications openly accessible) or resistance (e.g., unwillingness to share code used in a study or protecting access to research data). Similar questions will be asked at scale via a survey.

    Armed with themes from the responses, I will clarify and refine the claims often made in the public sphere about the benefits and drawbacks of open science. I hope to develop this part of the study into actionable recommendations for promoting open science, governing contributions to open science repositories, and addressing the concerns of scientists who are hesitant about engagement.

    Part 2: Focusing on Governance

    Even with interviews and surveys of scientists on the ground, it is difficult to systematically trace and analyze the totality of social and political processes that support open science infrastructure development because the processes occur across geographic, disciplinary, and other boundaries.

    However, as others have pointed out,7 the organizational and social elements of digital infrastructure development often become visible and amenable to study through infrastructure governance. Governance refers to the combination of “executive and management roles, program oversight functions organized into structures, and policies that define management principles and decision making.”8 Effective governance provides projects with the direction and oversight necessary to achieve desired outcomes of infrastructure development while allowing room for creativity and innovation.2910 Studying a project’s governance surfaces the negotiation processes that occur among stakeholders—users, managers, organizations, policymakers, and the like—throughout the development process. Outcomes include agreements about the types of technologies used, standards defining the best practices for technology use, and other policies to ensure that a robust, sustainable infrastructure evolves.911

    Despite the scientific research community’s increasing reliance on open science infrastructures, few studies compare different infrastructure governance strategies2 and even fewer develop new or revised strategies for governing infrastructure development and use.12 The primary goal of this part of the project is to address this gap in our understanding of the governance strategies used to create, maintain, and grow open science infrastructures.

    I am administering this part of the study by conducting in-depth, semi-structured interviews with leaders of various open science infrastructure projects supporting work in astronomy and ecology. I define “leaders” in this context as individuals or small groups of individuals who make decisions about the management of open science infrastructures and their component parts. This set of leaders includes founders and administrators of widely-used scientific software packages and collections of packages, of open data repositories, of open access publication and preprint services, and various combinations of open science tools. Furthermore, I intend to interview the leaders of organizations with which the open science community interacts—top publication editors, for example—to gauge how open science practices and processes are being governed outside of active open science organizations.

    I will conduct qualitative coding as described above to develop themes from the responses of open science leaders. I will then ground these themes in the literature on digital infrastructure governance—which emphasizes gradual, decentralized, and voluntary development—and look for avenues to improve governance strategies.

    Alongside the interview and survey methods, I am actively observing and retaining primary documents from the ongoing discourse around open science in popular scientific communication publications (e.g., Nature and Science), conferences and meetings (e.g., OpenCon and discipline-specific hackweeks), and in the popular media/social media (e.g., The New York Times and Twitter threads).

    Preliminary Themes

    I entered this project with a very basic understanding of how open science “works”—the technical and social mechanisms by which scientists make processes and outputs publicly available. In learning about the open science movement, in general and in particular instantiations, I’ve begun to see the intricacies involved in efforts to change scientific research and its modes of communication: research data publication, citation, and access; journal publication availability; and research software development and software citation standards. Within the community trying to sustain these changes are participants and leaders who are facing and tackling several important issues head-on. I list some of the most common engagement, resistance, and governance challenges appearing in interview and observation transcripts below.

    Participation challenges
    • Overcoming the fear of sharing code and data, specifically the fear of sharing “messy” code and the fear of being shamed for research errors.
    • Defending the time and financial costs of participation in open science—particularly open source software development—to supervisors, collaborators, or tenure and promotion panels who are not engaged with open science.
    • Finding time to make code and data usable for others (e.g., through good documentation or complete metadata) and, subsequently, finding a home where code and data can easily be searched and found.
    Governance challenges
    • Navigating the issue of convincing researchers that software development and data publication/archiving “count” as research products, even though existing funding, publication, and tenure and promotion models may not yet value those contributions.
    • Developing guidelines and processes for conducting peer review on research publication, software, and data contributions, especially the tensions involved in “open review.”
    • Deciding whose responsibility it is to enforce code and data publication standards or policies, both within open science organizations and in traditional outlets like academic journals.

    The points raised in this post and the questions guiding my project might seem like discussions you’ve had too many times over coffee during a hackathon break or over beers after a conference session. If so, I’d love to hear from you, even if you are not an astronomer, an ecologist, or an active leader of an open science infrastructure. I am always looking for new ideas, both confirming and disconfirming, to refine my approach to this project.

    1. Braa, J., Hanseth, O., Heywood, A., Mohammed, W., Shaw, V. 2007. Developing health information systems in developing countries: The flexible standards strategy. MIS Quarterly, 31(2), 381-402. https://doi.org/10.2307/25148796 

    2. Tilson, D., Lyytinen, K., Sørensen, C. 2010 Digital infrastructures: The missing IS research agenda. Information Systems Research, 21(4), 748-759. https://doi.org/10.1287/isre.1100.0318 

    3. Borgman, C. L. 2010. Scholarship in the digital age: Information, infrastructure, and the Internet. MIT Press, Cambridge, MA. ISBN: 9780262250863 

    4. Vaast, E., Walsham, G. 2009. Trans-situated learning: Supporting a network of practice with an information infrastructure. Information Systems Research, 20(4), 547-564. https://doi.org/10.1287/isre.1080.0228 

    5. Spradley, J. P. (2016). The ethnographic interview. Longegrove, IL: Waveland Press. ISBN: 0030444969 

    6. Corbin, J., Strauss, A., 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology, 13(1), 3-21. https://doi.org/10.1007/BF00988593 

    7. Barrett, M., Davidson, E., Prabhu, J., Vargo, S. L. 2015. Service innovation in the digital age: Key contributions and future directions. MIS Quarterly, 39(1) 135-154. DOI: 10.25300/MISQ/2015/39:1.03 

    8. Hanford, M. 2005. Defining program governance and structure. IBM developerWorks. Available at: https://www.ibm.com/developerworks/rational/library/apr05/hanford/

    9. Star, S. L., Ruhleder, K. 1996. Steps toward an ecology of infrastructure: Design and access for large information spaces. Information Systems Research, 7(1), 111-134. https://doi.org/10.1287/isre.7.1.111 

    10. Edwards, P. N., Jackson, S. J., Bowker, G. C., Knobel, C. P. 2007. Understanding infrastructure: Dynamics, tensions, and design. Final report for Workshop on History and Theory of Infrastructure: Lessons for New Scientific Cyberinfrastructures. NSF Report. Available at: https://deepblue.lib.umich.edu/handle/2027.42/49353

    11. Hanseth, O., Jacucci, E., Grisot, M., Aanestad, M. 2006. Reflexive standardization: Side effects and complexity in standard making. MIS Quarterly, 30(1), 563-581. https://doi.org/10.2307/25148773 

    12. Hanseth, O., & Lyytinen, K. (2010). Design theory for dynamic complexity in information infrastructures: the case of building internet. Journal of Information Technology, 25(1), 1-19. 

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R-Ladies global tour

    Fri, 10/06/2017 - 02:00

    (This article was first published on Maëlle, and kindly contributed to R-bloggers)

    It was recently brought to my attention by Hannah Frick that there are now sooo many R-Ladies chapters around the world! R-Ladies is a world-wide organization to promote gender diversity in the R community, and I’m very grateful to be part of this community through which I met so many awesome ladies! Since we’re all connected, it has now happened quite a few times that R-Ladies gave talks at chapters outside of their hometowns. An R-Lady from Taiwan giving a talk in Madrid while on a trip in Europe and another one doing the same in Lisbon, an R-Lady from San Francisco presenting at the London and Barcelona chapters thanks to a conference on the continent, an R-Lady from Uruguay sharing her experience for the New York City and San Francisco chapters… It’s like rockstars tours!

    Therefore we R-Ladies often joke about doing an exhaustive global tour. Hannah made me think about this tour again… If someone were to really visit all of the chapters, what would be the shortest itinerary? And could we do a cool gif with the results? These are the problems we solve here.

    Getting the chapters

    To find all chapters, I’ll use Meetup information about meetups whose topics include “r-ladies”, although it means forgetting a few chapters that maybe haven’t updated their topics yet. Thus, I’ll scrape this webpage because I’m too impatient to wait for the cool meetupr package to include the Meetup API topic endpoint and because I’m too lazy to include it myself. I did open an issue though. Besides, I was allowed to scrape the page:

    robotstxt::paths_allowed("https://www.meetup.com/topics/") ## [1] TRUE

    Yesss. So let’s scrape!

    library("rvest") link <- "https://www.meetup.com/topics/r-ladies/all/" page_content <- read_html(link) css <- 'span[class="text--secondary text--small chunk"]' chapters <- html_nodes(page_content, css) %>% html_text(trim = TRUE) chapters <- stringr::str_replace(chapters, ".*\\|", "") chapters <- trimws(chapters) head(chapters) ## [1] "London, United Kingdom" "San Francisco, CA" ## [3] "Istanbul, Turkey" "Melbourne, Australia" ## [5] "New York, NY" "Madrid, Spain" # Montenegro chapters[chapters == "HN\\, Montenegro"] <- "Herceg Novi, Montenegro" Geolocating the chapters

    Here I decided to use a nifty package to the awesome OpenCage API. Ok, this is my own package. But hey it’s really a good geocoding API. And the package was reviewed for rOpenSci by Julia Silge! In the docs of the package you’ll see how to save your API key in order not to have to input it as a function parameter every time.

    Given that there are many chapters but not that many (41 to be exact), I could inspect the results and check them.

    geolocate_chapter <- function(chapter){ # query the API results <- opencage::opencage_forward(chapter)$results # deal with Strasbourg if(chapter == "Strasbourg, France"){ results <- dplyr::filter(results, components.city == "Strasbourg") } # get a CITY results <- dplyr::filter(results, components._type == "city") # sort the results by confidence score results <- dplyr::arrange(results, desc(confidence)) # choose the first line among those with highest confidence score results <- results[1,] # return only long and lat tibble::tibble(long = results$geometry.lng, lat = results$geometry.lat, chapter = chapter, formatted = results$formatted) } chapters_df <- purrr::map_df(chapters, geolocate_chapter) # add an index variable chapters_df <- dplyr::mutate(chapters_df, id = 1:nrow(chapters_df)) knitr::kable(chapters_df[1:10,]) long lat chapter formatted id -0.1276473 51.50732 London, United Kingdom London, United Kingdom 1 -122.4192362 37.77928 San Francisco, CA San Francisco, San Francisco City and County, California, United States of America 2 28.9651646 41.00963 Istanbul, Turkey Istanbul, Fatih, Turkey 3 144.9631608 -37.81422 Melbourne, Australia Melbourne VIC 3000, Australia 4 -73.9865811 40.73060 New York, NY New York City, United States of America 5 -3.7652699 40.32819 Madrid, Spain Leganés, Community of Madrid, Spain 6 -77.0366455 38.89495 Washington, DC Washington, District of Columbia, United States of America 7 -83.0007064 39.96226 Columbus, OH Columbus, Franklin County, Ohio, United States of America 8 -71.0595677 42.36048 Boston, MA Boston, Suffolk, Massachusetts, United States of America 9 -79.0232050 35.85030 Durham, NC Durham, NC, United States of America 10 Planning the trip

    I wanted to use the ompr package inspired by this fantastic use case, “Boris Johnson’s fully global itinerary of apology” – be careful, the code of this use case is slightly outdated but is up-to-date in the traveling salesperson vignette. The ompr package supports modeling and solving Mixed Integer Linear Programs. I got a not so bad notion of what this means by looking at this collection of use cases. Sadly, the traveling salesperson problem is a complicated problem and its solving time exponentially increases with the number of stops… in that case, it became really too long for plain mixed integer linear programming, as in “more than 24 hours later not done” too long.

    Therefore, I decided to use a specific R package for traveling salesperson problems TSP. Dirk, ompr’s maintainer, actually used it once as seen in this gist and then in this newspaper piece about how to go to all 78 Berlin museums during the night of the museums. Quite cool!

    We first need to compute the distance between chapters. In kilometers and rounded since it’s enough precision.

    convert_to_km <- function(x){ round(x/1000) } distance <- geosphere::distm(as.matrix(dplyr::select(chapters_df, long, lat)), fun = geosphere::distGeo) %>% convert_to_km()

    I used methods that do not find the optimal tour. This means that probably my solution isn’t the best one, but let’s say it’s ok for this time. Otherwise, the best thing is to ask Concorde’s maintainer if one can use their algorithm which is the best one out there, see its terms of use here.

    library("TSP") set.seed(42) result0 <- solve_TSP(TSP(distance), method = "nearest_insertion") result <- solve_TSP(TSP(distance), method = "two_opt", control = list(tour = result0))

    And here is how to link the solution to our initial chapters data.frame.

    paths <- tibble::tibble(from = chapters_df$chapter[as.integer(result)], to = chapters_df$chapter[c(as.integer(result)[2:41], as.integer(result)[1])], trip_id = 1:41) paths <- tidyr::gather(paths, "property", "chapter", 1:2) paths <- dplyr::left_join(paths, chapters_df, by = "chapter") knitr::kable(paths[1:3,]) trip_id property chapter long lat formatted id 1 from Charlottesville, VA -78.55676 38.08766 Charlottesville, VA, United States of America 38 2 from Washington, DC -77.03665 38.89495 Washington, District of Columbia, United States of America 7 3 from New York, NY -73.98658 40.73060 New York City, United States of America 5 Plotting the tour, boring version

    I’ll start by plotting the trips as it is done in the vignette, i.e. in a static way. Note: I used Dirk’s code in the Boris Johnson use case for the map, and had to use a particular branch of ggalt to get coord_proj working.

    library("ggplot2") library("ggalt") library("ggthemes") library("ggmap") world <- map_data("world") %>% dplyr::filter(region != "Antarctica") ggplot(data = paths, aes(long, lat)) + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(aes(group = trip_id), color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8) + theme_map(base_size =20) + coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") + ggtitle("R-Ladies global tour", subtitle = paste0(tour_length(result), " km"))

    Dirk told me the map would look better with great circles instead of straight lines so I googled a bit around, asked for help on Twitter before finding this post.

    library("geosphere") # find points on great circles between chapters gc_routes <- gcIntermediate(paths[1:length(chapters), c("long", "lat")], paths[(length(chapters)+1):(2*length(chapters)), c("long", "lat")], n = 360, addStartEnd = TRUE, sp = TRUE, breakAtDateLine = TRUE) gc_routes <- SpatialLinesDataFrame(gc_routes, data.frame(id = paths$id, stringsAsFactors = FALSE)) gc_routes_df <- fortify(gc_routes) p <- ggplot() + geom_map(data = world, map = world, aes(long, lat, map_id = region), fill = "white", color = "darkgrey", alpha = 0.8, size = 0.2) + geom_path(data = gc_routes_df, aes(long, lat, group = group), alpha = 0.5, color = "#88398A") + geom_point(data = chapters_df, color = "#88398A", size = 0.8, aes(long, lat)) + theme_map(base_size =20)+ coord_proj("+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") p + ggtitle("R-Ladies global tour", subtitle = paste0(tour_length(result), " km"))

    Ok this is nicer, it was worth the search.

    Plotting the tour, magical version

    And now I’ll use magick because I want to add a small star flying around the world. By the way if this global tour were to happen I reckon that one would need to donate a lot of money to rainforest charities or the like, because it’d have a huge carbon footprint! Too bad really, I don’t want my gif to promote planet destroying behaviours.

    To make the gif I used code similar to the one shared in this post but in a better version thanks to Jeroen who told me to read the vignette again. Not saving PNGs saves time!

    I first wanted to really show the emoji flying along the route and even created data for that, with a number of rows between chapters proportional to the distance between them. It’d have looked nice and smooth. But making a gif with hundreds of frames ended up being too long for me at the moment. So I came up with another idea, I’ll have to hope you like it!

    library("emojifont") load.emojifont('OpenSansEmoji.ttf') library("magick") plot_one_moment <- function(chapter, size, p, chapters_df){ print(p + ggtitle(paste0("R-Ladies global tour, ", chapters_df[chapters_df$chapter == chapter,]$chapter), subtitle = paste0(tour_length(result), " km"))+ geom_text(data = chapters_df[chapters_df$chapter == chapter,], aes(x = long, y = lat, label = emoji("star2")), family="OpenSansEmoji", size = size)) } img <- image_graph(1000, 800, res = 96) out <- purrr::walk2(rep(chapters[as.integer(result)], each = 2), rep(c(5, 10), length = length(chapters)*2), p = p, plot_one_moment, chapters_df = chapters_df) dev.off() ## png ## 2 image_animate(img, fps=1) %>% image_write("rladiesglobal.gif")

    At least I made a twinkling star. I hope Hannah will be happy with the gif, because now I’d like to just dream of potential future trips! Or learn a bit of geography by looking at the gif.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Checking residual distributions for non-normal GLMs

    Fri, 10/06/2017 - 02:00

    (This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

    Checking residual distributions for non-normal GLMs Quantile-quantile plots

    If you are fitting a linear regression with Gaussian (normally
    distributed) errors, then one of the standard checks is to make sure the
    residuals are approximately normally distributed.

    It is a good idea to do these checks for non-normal GLMs too, to make
    sure your residuals approximate the model’s assumption.

    Here I explain how to create quantile-quantile plots for non-normal
    data, using an example of fitting a GLM using Student-t distributed
    errors. Such models can be appropriate when the residuals are
    overdispersed
    .

    First let’s create some data. We will make a linear predictor (ie the
    true regression line) eta and then simulate some data by adding
    residuals. We will simulate two data-sets that have the same linear
    predictor, but the first will have normally distributed errors and the
    second will have t distributed errors:

    n <- 100 phi <- 0.85 mu <- 0.5 set.seed(23) x <- rnorm(n) eta <- mu + phi * x nu <- 2.5 tau <- 3 y_tdist <- eta + (rt(n, df=nu)/sqrt(tau)) y_normdist <- eta + rnorm(n, sd = 1/sqrt(tau)) plot(x, y_tdist) points(x, y_normdist, col = "red", pch = 16, cex = 0.8) legend("topleft", legend = c("t-distributed errors", "normally distributed errors"), pch = c(1,16), col = c("black", "red"))

    Notice how the t-distributed data are more spread out. The df
    parameter, here named nu=2.5 controls how dispersed the data are.
    Lower values will give data that are more dispersed, large values
    approach a normal distribution.

    Now let’s fit a Gaussian glm (just a linear regression really) to both
    these data-sets

    m1_norm <- glm(y_normdist ~ x) m1_t <- glm(y_tdist ~ x)

    We should check whether the two models meet the normal assumption, using
    the standard ‘qq’ (quantile-quantile) plot:

    par(mfrow = c(1,2)) plot(m1_norm, 2) plot(m1_t, 2)

    These plots compare the theoretical quantiles to the actual quantiles of
    the residuals. If the points fall on the straight line then the
    theoretical and realised are very similar, and the assumption is met.
    Clearly this is not the case for the second model, which is
    overdispersed.

    We know it is overdispersed because the theoretical quantiles are much
    smaller than the actual at the tails (notice how the ends down then up).

    The p-values (or CIs if you use them) for m1_t are therefore likely
    biased and too narrow, leading potentially to type I errors (us saying
    that x affects y, which in fact it does not). In this case we know we
    haven’t made a type I error, because we made up the data. However, if
    you were using real data you wouldn’t be so sure.

    Doing our own quantile-quantile plot

    To better understand the QQ plot it helps to generate it yourself,
    rather than using R’s automatic checks.

    First we calculate the model residuals (in plot(m1_t) R did this
    internally):

    m1_t_resid <- y_tdist - predict(m1_t)

    Then we can plot the quantiles for the residuals against theoretical
    quantiles generated using qnorm. Below we also plot the original QQ
    plot from above, so you can see that our version is the same as R’s
    automatic one:

    par(mfrow = c(1,2)) qqplot(qnorm(ppoints(n), sd = 1), m1_t_resid) qqline(m1_t_resid, lty = 2) plot(m1_t,2)

    I added the qqline for comparative purposes. It just puts a line
    through the 25th and 75th quantiles.

    QQ plot for a non-normal GLM

    Now we have learned how to write our own custom for a QQ plot, we can
    use it to check other types of non-normal data.

    Here we will fit a GLM to the y_tdist data using student-t distributed
    errors. I do this using the Bayesian package INLA.

    library(INLA) data <- list(y=y_tdist, x = x) mod_tdist <- inla(y ~ x, family="T", data=data, control.predictor = list(compute = TRUE), control.family = list( hyper = list(prec = list(prior="loggamma",param=c(1,0.5)), dof = list(prior="loggamma",param=c(1,0.5)) ) ) )

    The family ="T" command tells INLA to use the t-distribution, rather
    than the Normal distribution. Note also I have specified the priors
    using the control.family command. This is best practice. We need a
    prior for the precision (1/variance) and a prior for the dof (=
    degrees of freedom, which has to be >2 in INLA).

    It is sometimes help to visualise the priors, so we can check too see
    they look sensible. Here we visualise the prior for the dof, (which in
    INLA has a min value of 2):

    xgamma <- seq(0.01, 10, length.out = 100) plot(xgamma+2, dgamma(xgamma, 1, 0.5), type = 'l', xlab = "Quantile", ylab = "Density")

    We don’t really expect values much greater than 10, so this prior makes
    sense. If we used an old-school prior that was flat in 2-1000 we might
    get issues with model fitting.

    Now enough about priors. Let’s look at the estimated coefficients:

    mod_tdist$summary.fixed ## mean sd 0.025quant 0.5quant 0.975quant mode ## (Intercept) 0.5324814 0.07927198 0.3773399 0.5321649 0.6891779 0.5315490 ## x 0.7229362 0.08301006 0.5565746 0.7239544 0.8835630 0.7259817 ## kld ## (Intercept) 3.067485e-12 ## x 6.557565e-12

    Good the CIs contain our true values, and the mean is close to our true
    value too.
    What about the hyper-parameters (the precision and DF)? We need to get
    INLA to run some more calucations to get accurate estimates of these:

    h_tdist <- inla.hyperpar(mod_tdist) h_tdist$summary.hyperpar[,3:5] ## 0.025quant 0.5quant 0.975quant ## precision for the student-t observations 0.2663364 0.6293265 1.163440 ## degrees of freedom for student-t 2.2404966 2.7396391 4.459057

    The estimate for the DF might be a ways off the mark. That is ok, we
    expect that, you need lots of really good data to get accurate estimates
    of hyper-parameters.

    Now, let’s use our skills in creating QQ plots to make QQ plot using
    theoretical quantiles from the t distribution.

    First step is to extract INLA’s predictions of the data, so we can
    calculate residuals

    preds <- mod_tdist$summary.fitted.values resids <- y_tdist - preds[,4]

    Next step is to extract the marginal estimates of the DF and precision
    to use when generating our QQ plot (the quantiles will change with the
    DF):

    tau_est <- h_tdist$summary.hyperpar[1,4] nu_est <- h_tdist$summary.hyperpar[2,4]

    Now we can use qt() to generate theoretical quantiles and the
    residuals for our realised quantiles:

    qqplot(qt(ppoints(n), df = nu_est), resids * sqrt(tau_est), xlab = "Theoretical quantile", ylab = "residuals") qqline(resids * sqrt(tau_est), lty = 2)

    Note that I multiply the residuals by the sqrt of the precision
    estimate. This is how INLA fits a t-distributed
    GLM
    .
    I do the same for the qqline.

    Our residuals are now falling much closer to the line. The model is
    doing a much better job of fitting the data. You could also calculate
    the WAIC for this model and a Gaussian one, to compare the fits. The
    t-distributed GLM should have a lower WAIC (better fit).

    We can now be confident that our CIs are accurate.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The BayesianTools R package with general-purpose MCMC and SMC samplers for Bayesian statistics

    Thu, 10/05/2017 - 14:59

    (This article was first published on Submitted to R-bloggers – theoretical ecology, and kindly contributed to R-bloggers)

    This is a somewhat belated introduction of a package that we published on CRAN at the beginning of the year already, but I hadn’t found the time to blog about this earlier. In the R environment and beyond, a large number of packages exist that estimate posterior distributions via MCMC sampling, either for specific statistical models (e.g.…

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Submitted to R-bloggers – theoretical ecology. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The [R] Kenntnis-Tage 2017: A special event

    Thu, 10/05/2017 - 11:52

    (This article was first published on eoda english R news, and kindly contributed to R-bloggers)

    Days are getting shorter and the leaves begin to change: autumn is here and this means only one more month until the [R] Kenntnis-Tagen 2017 on November 8 and 9. A diverse agenda, cross-industry networking and a clear business context – many aspects make the [R] Kenntnis-Tage a special event and are reasons for you to participate:Inspiring Insights

    From “googleVis” author and finance expert Markus Gesmann to “Algorithm Watch” co-founder Prof. Dr. Katharina Anna Zweig – the [R] Kenntnis-Tage 2017 offer a stage for people with a strong passion for data science and R. Whether in the sessions or the networking: participants will gain detailed insights into the various fields of application of the free programming language as well as inspiration for the daily business.

    Clear Business Context

    Unlike many other events related to R, the [R] Kenntnis-Tage have a distinct business focus. They serve as a platform for the German-speaking R community to exchange challenges and solutions in the daily work with R, making it an event which is unique in this form. In this context, eoda will present an innovation which takes the work with data in the daily business to a new level.

    Topical Diversity

    Forecasting models for the energy sector, R in the automotive industry or in data journalism – the topics of the [R] Kenntnis-Tage 2017 are as diverse as R itself. Compared to events with a focus on specific industries, eoda promotes the concept of interdisciplinarity at their event. Cross-industry transfer is one of the cornerstones of continuous development and enables participants to learn about new approaches and solutions. Very often data science presents itself as one thing in particular: a journey of discovery.

    Insightful Tutorials

    For the third time, the [R] Kenntnis-Tage combine data science success stories with practical R tutorials. Experienced eoda trainers will share their methodical know-how with the participants in intense tutorial sessions. Topics include the building of a Shiny app, the data mining workflow with “caret” and the data management with data.table among others. These tutorials offer an opportunity to learn new approaches and re-evaluate established ones.

    Enthusiastic Participants

    In the past two years participants from all industries and companies were more than satisfied with the event’s concept, the exchange with other participants and the insights they could gather at the event. The [R] Kenntnis-Tage have become a firmly established event for R users with a business background.

    Flair of the documenta City Kassel

    The documenta as most significant exhibition of contemporary art has just come to an end but the special flair is still perceptible in the city. The event location of the [R] Kenntnis-Tage 2017 in the heart of the city as well as the dinner location in the UNESCO World Heritage Site Bergpark Wilhelmshöhe provide an appropriate setting for a successful event.

    You are convinced? Then register for the [R] Kenntnis-Tage 2017 and make use of the price advantage – only until October 6.

    Further information is available here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Working with R

    Thu, 10/05/2017 - 11:29

    (This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers)

    I’ve been pretty quiet on the blog front recently. That’s because I overhauled my site, migrating it to Hugo (the foundation of blogdown). Just doing one extra thing on top of my usual workload, I also did another thing. I wrote a book too!
    I’m a big book fan, and I’m especially a Kindle Unlimited fan (all the books you can read for £8 per month, heck yeah!) so I wanted to make books that I could publish and see on Kindle Unlimited.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Is udpipe your new NLP processor for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing

    Thu, 10/05/2017 - 10:45

    (This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

    If you work on natural language processing in a day-to-day setting which involves statistical engineering, at a certain timepoint you need to process your text with a number of text mining procedures of which the following ones are steps you must do before you can get usefull information about your text

    • Tokenisation (splitting your full text in words/terms)
    • Parts of Speech (POS) tagging (assigning each word a syntactical tag like is the word a verb/noun/adverb/number/…)
    • Lemmatisation (a lemma means that the term we “are” is replaced by the verb to “be”, more information: https://en.wikipedia.org/wiki/Lemmatisation)
    • Dependency Parsing (finding relationships between, namely between “head” words and words which modify those heads, allowing you to look to words which are maybe far away from each other in the raw text but influence each other)

    If you do this in R, there aren’t much available tools to do this. In fact there are none which

    1. do this for multiple language
    2. do not depend on external software dependencies (java/python)
    3. which also allow you to train your own parsing & tagging models.

    Except R package udpipe (https://github.com/bnosac/udpipe, https://CRAN.R-project.org/package=udpipe) which satisfies these 3 criteria.

    If you are interested in doing the annotation, pre-trained models are available for 50 languages (see ?udpipe_download_model) for details. Let’s show how this works on some Dutch text and what you get of of this.

    library(udpipe)
    dl <- udpipe_download_model(language = "dutch")
    dl

    language                                                                      file_model
       dutch C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-ud-2.0-170801.udpipe

    udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
    x <- udpipe_annotate(udmodel_dutch,
                         x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
    x <- as.data.frame(x)
    x

    The result of this is a dataset where text has been splitted in paragraphs, sentences, words, words are replaced by their lemma (ging > ga, nam > neem), you get the universal parts of speech tags, detailed parts of speech tags, you get features of the word and with the head_token_id we see which words are influencing other words in the text as well as the dependency relationship between these words.

    To go from that dataset to meaningfull visualisations like this one is than just a matter of a few lines of code. The following visualisation shows the co-occurrence of nouns with customer feedback on Airbnb appartment stays in Brussels (open data available at http://insideairbnb.com/get-the-data.html).

    In a next post, we’ll show how to train your own tagging models.

    If you like this type of analysis or if you are interested in text mining with R, we have 3 upcoming courses planned on text mining. Feel free to register at the following links.

      • 18-19/10/2017: Statistical machine learning with R. Leuven (Belgium). Subscribe here
      • 08+10/11/2017: Text mining with R. Leuven (Belgium). Subscribe here
      • 27-28/11/2017: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to training@di-academy.com
      • 19-20/12/2017: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
      • 20-21/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
      • 08-09/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
      • 22-23/03/2018: Text Mining with R. Leuven (Belgium). Subscribe here

    For business questions on text mining, feel free to contact BNOSAC by sending us a mail here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Writing academic articles using R Markdown and LaTeX

    Thu, 10/05/2017 - 01:00

    (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

    One of my favourite activities in R is using Markdown to create business reports. Most of my work I export to MS Word to communicate analytical results with my colleagues. For my academic work and eBooks, I prefer LaTeX to produce great typography. This article explains how to write academic articles and essays combining R Markdown and LaTeX. The article is formatted in accordance with the APA (American Psychological Association) requirements.

    To illustrate the principles of using R Markdown and LaTeX, I recycled an essay about problems with body image that I wrote for a psychology course many years ago. You can find the completed paper and all necessary files on my GitHub repository.

    Body Image

    Body image describes the way we feel about the shape of our body. The literature on this topic demonstrates that many people, especially young women, struggle with their body image. A negative body image has been strongly associated with eating disorders. Psychologists measure body image using a special scale, shown in the image below.

    My paper measures the current and ideal body shape of the subject and the body shape of the most attractive other sex. The results confirm previous research which found that body dissatisfaction for females is significantly higher than for men. The research also found a mild positive correlation between age and ideal body shape for women and between age and the female body shape found most attractive by men. You can read the full paper on my personal website.

    Body shape measurement scale.

    R Markdown and LaTeX

    The R Markdown file for this essay uses the Sweave package to integrate R code with LaTeX. The first two code chunks create a table to summarise the respondents using the xtable package. This package creates LaTeX or HTML tables from data generated by R code.

    The first lines of the code read and prepare the data, while the second set of lines creates a table in LaTeX code. The code chunk uses results=tex to ensure the output is interpreted as LaTeX. This approach is used in most of the other chunks. The image is created within the document and saved as a pdf file and back integrated into the document as an image with appropriate label and caption.

    <>= body <- read.csv("body_image.csv") # Respondent characteristics body$Cohort <- cut(body$Age, c(0, 15, 30, 50, 99), labels = c("<16", "16--30", "31--50", ">50")) body$Date <- as.Date(body$Date) body$Current_Ideal <- body$Current - body$Ideal library(xtable) respondents <- addmargins(table(body$Gender, body$Cohort)) xtable(respondents, caption = "Age profile of survey participants", label = "gender-age", digits = 0) @ Configuration

    I created this file in R Studio, using the Sweave and knitr functionality. To knit the R Markdown file for this paper you will need to install the apa6 and ccicons packages in your LaTeX distribution. The apa6 package provides macros to format papers in accordance with the requirements American Psychological Association.

    The post Writing academic articles using R Markdown and LaTeX appeared first on The Devil is in the Data.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introducing the Deep Learning Virtual Machine on Azure

    Wed, 10/04/2017 - 21:33

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    A new member has just joined the family of Data Science Virtual Machines on Azure: The Deep Learning Virtual Machine. Like other DSVMs in the family, the Deep Learning VM is a pre-configured environment with all the tools you need for data science and AI development pre-installed. The Deep Learning VM is designed specifically for GPU-enabled instances, and comes with a complete suite of deep learning frameworks including Tensorflow, PyTorch, MXNet, Caffe2 and CNTK. It also comes witth example scripts and data sets to get you started on deep learning and AI problems, including:

    The DLVM along with all the DSVMs also provides a complete suite of data science tools including R, Python, Spark, and much more:

    There have also been some updates and additions to the tools provided in the entire DSVM family, including:

    All Data Science Virtual Machines, including the Deep Learning Virtual Machine, are available as Windows and Ubuntu Linux instances, and are free of any software charges: pay only for the infrastructure charge according to the power and size of the instance you choose. An Azure account is required, but you can get started with $200 in free Azure credits here.

    Microsoft Azure: Data Science Virtual Machines

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Time Series Analysis in R Part 3: Getting Data from Quandl

    Wed, 10/04/2017 - 15:00

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    This is part 3 of a multi-part guide on working with time series data in R. You can find the previous parts here: Part 1, Part 2.

    Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.

    Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:

    install.packages('Quandl')

    You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl function and pass it the code of the series we want. We also specify “ts” for the type argument so that the data is imported as an R ts object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date argument because I want the most recent data available. You can find this data on the web here.

    library(Quandl) hpidata <- Quandl("YALE/NHPI", type="ts", start_date="1990-01-01") plot.ts(hpidata, main = "Robert Shiller's Nominal Home Price Index")

    Gives this plot:

    While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl website. The data are located on the web here, here, and here.

    gdpdata <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01") pidata <- Quandl("FRED/PINCOME", type="ts", start_date="1990-01-01") umdata <- Quandl("UMICH/SOC43", type="ts")[, 1] plot.ts(cbind(gdpdata, pidata), main="US GPD and Personal Income, billions $")

    Gives this plot:

    plot.ts(umdata, main = "University of Michigan Consumer Survey, Selling Conditions for Houses")

    Gives this plot:

    The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.

    gdpdata_ann <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", collapse="annual") frequency(gdpdata_ann) [1] 1

    We can also transform our data on the fly as its imported. The Quandl function has a argument transform that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.

    For example, here we get the data in percent change form:

    gdpdata_pc <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", transform="rdiff") plot.ts(gdpdata_pc * 100, ylab= "% change", main="US Gross Domestic Product, % change")

    Gives this plot:

    You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).

    In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.

      Related Post

      1. Pulling Data Out of Census Spreadsheets Using R
      2. Extracting Tables from PDFs in R using the Tabulizer Package
      3. Extract Twitter Data Automatically using Scheduler R package
      4. An Introduction to Time Series with JSON Data
      5. Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      nzelect 0.4.0 on CRAN with results from 2002 to 2014 and polls up to September 2017 by @ellis2013nz

      Wed, 10/04/2017 - 13:00

      (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

      More nzelect New Zealand election data on CRAN

      Version 0.4.0 of my nzelect R package is now on CRAN. The key changes from version 0.3.0 are:

      • election results by voting place are now available back to 2002 (was just 2014)
      • polling place locations, presented as consistent values of latitude and longitude, now available back to 2008 (was just 2014)
      • voting intention polls are now complete up to the 2017 general election (previously stopped about six months ago on CRAN, although the GitHub version was always kept up to date)
      • a few minor bug fixes eg allocate_seats now takes integer arguments, not just numeric.

      The definitive source of New Zealand election statistics is the Electoral Commission. If there are any discrepencies between their results and those in nzelect, it’s a bug, and please file an issue on GitHub. The voting intention polls come from Wikipedia.

      Example – special and early votes

      Currently, while we wait for the counting of the “special” votes from the 2017 election, there’s renewed interest in the differences between special votes and those counted on election night. Special votes are those made by anyone outside their electorate, enrolled in the month before the election, or are in one of a few other categories. Most importantly, it’s people who are on the move or who are very recently enrolled, and obviously such people are different from the run of the mill voter.

      Here’s a graphic created with the nzelect R package that shows how the “special votes” in the past have been disproportionately important for Greens and Labour:

      Here’s the code to create that graphic:

      # get the latest version from CRAN: install.packages("nzelect") library(nzelect) library(tidyverse) library(ggplot2) library(scales) library(ggrepel) library(forcats) # palette of colours for the next couple of charts: palette <- c(parties_v, Other = "pink2", `Informal Votes` = "grey") # special votes: nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("special", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Special' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that came from special votes") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were 'special'", label = percent) + scale_colour_manual(values = palette, guide = FALSE)

      Special votes are sometimes confused with advance voting in general. While many special votes are advance votes, the relationship is far from one to one. We see this particularly acutely by comparing the previous graphic to one that is identical except that it identifies all advance votes (those with the phrase “BEFORE” in the Electoral Commission’s description of polling place):

      While New Zealand First are the party that gains least proportionately from special votes, they gain the most from advance votes, although the difference between parties is fairly marginal. New Zealand First voters are noticeably more likely to be in an older age bracket than the voters for other parties. My speculation on their disproportionate share of advance voting is that it is related to that, although I’m not an expert in that area and am interested in alternative views.

      This second graphic also shows nicely just how much the advance voting is becoming a feature of the electoral landscape. Unlike the proportion of votes that are “special” which has been fairly stable, the proportion of votes that are case in advance has increased very substantially over the past decade, and increased further in the 2017 election (for which final results come out on Saturday).

      Here’s the code for the second graphic; it’s basically the same as the previous chunk of code, except filtering on a different character string in the voting place name:

      nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("before", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Before' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that were cast before election day") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were before election day", label = percent) + scale_colour_manual(values = palette, guide = FALSE) ‘Party’ compared to ‘Candidate’ vote

      Looking for something else to showcase, I thought it might be interesting to pool all five elections for which I have the detailed results and compare the party vote (ie the proportional representation choice out of the two votes New Zealanders get) to the candidate vote (ie the representative member choice). Here’s a graphic that does just that:

      We see that New Zealand First and the Greens are the two parties that are most noticeably above the diagonal line indicating equality between party and candidate votes. This isn’t a surprise – these are minority parties that appeal to (different) demographic and issues-based communities that are dispersed across the country, and generally have little chance of winning individual electorates. Hence the practice of voters is often to split their votes. This is all perfectly fine and is exactly how mixed-member proportional voting systems are meant to work.

      Here’s the code that produced the scatter plot:

      nzge %>% group_by(voting_type, party) %>% summarise(votes = sum(votes)) %>% spread(voting_type, votes) %>% ggplot(aes(x = Candidate, y = Party, label = party)) + geom_abline(intercept = 0, slope = 1, colour = "grey50") + geom_point() + geom_text_repel(colour = "steelblue") + scale_x_log10("Total 'candidate' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + scale_y_log10("Total 'party' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + ggtitle("Lots of political parties: total votes from 2002 to 2014", "New Zealand general elections") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package") + coord_equal() What next?

      Obviously, the next big thing for nzelect is to get the 2017 results in once they are announced on Saturday. I should be able to do this for the GitHub version by early next week. I would have delayed the CRAN release until 2017 results were available, but unfortunately I had a bug in some of the examples in my helpfiles that stopped them working after the 23 September 2017, so I had to rush a fix and the latest enhancements to CRAN to avoid problems with the CRAN maintainers (which I fully endorse and thank by the way).

      My other plans for nzelect over the next months to a year include:

      • reliable point locations for voting places back to 2002
      • identify consistent/duplicate voting places over time to make it easier to analyse comparative change by micro location
      • add detailed election results for the 1999 and 1996 elections (these are saved under a different naming convention to those from 2002 onwards, which is why they need a bit more work)
      • add high level election results for prior to 1994

      The source code for cleaning the election data and packaging it into nzelect is on GitHub. The package itself is on CRAN and installable in the usual way.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages