Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 day 16 hours ago

Stepwise Bayesian Optimization with mlrMBO

Wed, 01/10/2018 - 01:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

With the release of the new version of mlrMBO we added some minor fixes and added a practical feature called Human-in-the-loop MBO.
It enables you to sequentially

  • visualize the state of the surrogate model,
  • obtain the suggested parameter configuration for the next iteration and
  • update the surrogate model with arbitrary evaluations.

In the following we will demonstrate this feature on a simple example.

First we need an objective function we want to optimize.
For this post a simple function will suffice but note that this function could also be an external process as in this mode mlrMBO does not need to access the objective function as you will only have to pass the results of the function to mlrMBO.

library(mlrMBO)
library(ggplot2)
set.seed(1)

fun = function(x) {
x^2 + sin(2 * pi * x) * cos(0.3 * pi * x)
}

However we still need to define the our search space.
In this case we look for a real valued value between -3 and 3.
For more hints about how to define ParamSets you can look here or in the help of ParamHelpers.

ps = makeParamSet(
makeNumericParam("x", lower = -3, upper = 3)
)

We also need some initial evaluations to start the optimization.
The design has to be passed as a data.frame with one column for each dimension of the search space and one column y for the outcomes of the objective function.

des = generateDesign(n = 3, par.set = ps)
des$y = apply(des, 1, fun)
des ## x y
## 1 -1.1835844 0.9988801
## 2 -0.5966361 0.8386779
## 3 2.7967794 8.6592973

With these values we can initialize our sequential MBO object.

ctrl = makeMBOControl()
ctrl = setMBOControlInfill(ctrl, crit = crit.ei)
opt.state = initSMBO(
par.set = ps,
design = des,
control = ctrl,
minimize = TRUE,
noisy = FALSE)

The opt.state now contains all necessary information for the optimization.
We can even plot it to see how the Gaussian process models the objective function.

plot(opt.state)

In the first panel the expected improvement ($EI = E(y_{min}-\hat{y})$) (see Jones et.al.) is plotted over the search space.
The maximum of the EI indicates the point that we should evaluate next.
The second panel shows the mean prediction of the surrogate model, which is the Gaussian regression model aka Kriging in this example.
The third panel shows the uncertainty prediction of the surrogate.
We can see, that the EI is high at points, where the mean prediction is low and/or the uncertainty is high.

To obtain the specific configuration suggested by mlrMBO for the next evaluation of the objective we can run:

prop = proposePoints(opt.state)
prop ## $prop.points
## x
## 969 -2.999979
##
## $propose.time
## [1] 0.273
##
## $prop.type
## [1] "infill_ei"
##
## $crit.vals
## [,1]
## [1,] -0.3733677
##
## $crit.components
## se mean
## 1 2.8899 3.031364
##
## $errors.model
## [1] NA
##
## attr(,"class")
## [1] "Proposal" "list"

We will execute our objective function with the suggested value for x and feed it back to mlrMBO:

y = fun(prop$prop.points$x)
y ## [1] 8.999752 updateSMBO(opt.state, x = prop$prop.points, y = y)

The nice thing about the human-in-the-loop mode is, that you don’t have to stick to the suggestion.
In other words we can feed the model with values without receiving a proposal.
Let’s assume we have an expert who tells us to evaluate the values $x=-1$ and $x=1$ we can easily do so:

custom.prop = data.frame(x = c(-1,1))
ys = apply(custom.prop, 1, fun)
updateSMBO(opt.state, x = custom.prop, y = as.list(ys))
plot(opt.state, scale.panels = TRUE)

We can also automate the process easily:

replicate(3, {
prop = proposePoints(opt.state)
y = fun(prop$prop.points$x)
updateSMBO(opt.state, x = prop$prop.points, y = y)
})

Note: We suggest to use the normal mlrMBO if you are only doing this as mlrMBO has more advanced logging, termination and handling of errors etc.

Let’s see how the surrogate models the true objective function after having seen seven configurations:

plot(opt.state, scale.panels = TRUE)

You can convert the opt.state object from this run to a normal mlrMBO result object like this:

res = finalizeSMBO(opt.state)
res ## Recommended parameters:
## x=-0.22
## Objective: y = -0.913
##
## Optimization path
## 3 + 6 entries in total, displaying last 10 (or less):
## x y dob eol error.message exec.time ei
## 1 -1.1835844 0.9988801 0 NA NA NA
## 2 -0.5966361 0.8386779 0 NA NA NA
## 3 2.7967794 8.6592973 0 NA NA NA
## 4 -2.9999793 8.9997519 4 NA NA -0.3733677
## 5 -1.0000000 1.0000000 5 NA NA -0.3136111
## 6 1.0000000 1.0000000 6 NA NA -0.1366630
## 7 0.3010828 1.0016337 7 NA NA -0.7750916
## 8 -0.2197165 -0.9126980 8 NA NA -0.1569065
## 9 -0.1090728 -0.6176863 9 NA NA -0.1064289
## error.model train.time prop.type propose.time se mean
## 1 NA initdesign NA NA NA
## 2 NA initdesign NA NA NA
## 3 NA initdesign NA NA NA
## 4 0 manual NA 2.8899005 3.0313640
## 5 0 manual NA 0.5709559 0.6836938
## 6 NA NA 3.3577897 5.3791930
## 7 0 manual NA 1.2337881 0.3493416
## 8 0 manual NA 0.4513106 0.8870228
## 9 0 manual NA 0.3621550 -0.8288961

Note: You can always run the human-in-the-loop MBO on res$final.opt.state.

For the curious, let’s see how our original function actually looks like and which points we evaluated during our optimization:

plot(fun, -3, 3)
points(x = getOptPathX(res$opt.path)$x, y = getOptPathY(res$opt.path))

We can see, that we got pretty close to the global optimum and that the surrogate in the previous plot models the objective quite accurate.

For more in-depth information look at the Vignette for Human-in-the-loop MBO and check out the other topics of our mlrMBO page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Looking beyond accuracy to improve trust in machine learning

Wed, 01/10/2018 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

I have written another blogpost about Looking beyond accuracy to improve trust in machine learning at my company codecentric’s blog:

Traditional machine learning workflows focus heavily on model training and optimization; the best model is usually chosen via performance measures like accuracy or error and we tend to assume that a model is good enough for deployment if it passes certain thresholds of these performance criteria. Why a model makes the predictions it makes, however, is generally neglected. But being able to understand and interpret such models can be immensely important for improving model quality, increasing trust and transparency and for reducing bias. Because complex machine learning models are essentially black boxes and too complicated to understand, we need to use approximations to get a better sense of how they work. One such approach is LIME, which stands for Local Interpretable Model-agnostic Explanations and is a tool that helps understand and explain the decisions made by complex machine learning models.

Continue reading at https://blog.codecentric.de/en/2018/01/look-beyond-accuracy-improve-trust-machine-learning/

Links to the entire example code and more info are given at the end of the blog post.

The blog post is also available in German.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Three new domain-specific (embedded) languages with a Stan backend

Tue, 01/09/2018 - 21:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

One is an accident. Two is a coincidence. Three is a pattern.

Perhaps it’s no coincidence that there are three new interfaces that use Stan’s C++ implementation of adaptive Hamiltonian Monte Carlo (currently an updated version of the no-U-turn sampler).

  • ScalaStan embeds a Stan-like language in Scala. It’s a Scala package largely (if not entirely written by Joe Wingbermuehle.
    [GitHub link]

  • tmbstan lets you fit TMB models with Stan. It’s an R package listing Kasper Kristensen as author.
    [CRAN link]

  • SlicStan is a “blockless” and self-optimizing version of Stan. It’s a standalone language coded in F# written by Maria Gorinova.
    [pdf language spec]

These are in contrast with systems that entirely reimplement a version of the no-U-turn sampler, such as PyMC3, ADMB, and NONMEM.

The post Three new domain-specific (embedded) languages with a Stan backend appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rquery: Fast Data Manipulation in R

Tue, 01/09/2018 - 20:19

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique.

We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic regression scoring. This is an example query we have been using for some time.

The graph above the distribution of repeated run times for:

  • data.table in memory: How long it takes data.table to complete the task starting from an in-memory data.frame. We will take data.table’s 256MS average runtime as the definition of “fast”, as data.table is commonly seen as the fastest system for in-memory data manipulation in R.
  • rquery in memory: How long it takes rquery to complete the task starting from an in-memory data.frame and returning to an in-memory data.frame. The actual implementation is performed by moving data to a PostgreSQL database for processing and then collecting the result back. The observed 352MS average runtime is 38% slower than data.table. However, as we will see, that is actually a good result.
  • dplyr tbl in memory: A standard dplyr in-memory pipeline working on a data.frame that is first converted into a tbl structure (to see if this is a significant speedup).
  • dplyr in memory no grouped filter: A re-structured dplyr pipeline avoiding filtering inside a grouped context.
  • dplyr from memory to db and back: Moved the data to a database, perform dplyr steps there, and then copy the data back. The runtime average is 602MS making it much faster than in-memory dplyr, but still 135% slower than data.table, 71% slower than rquery.

As is detailed in the experiment backing materials the task is processing 40,000 records through either of the following non-trivial pipelines (the first one for dplyr, and the second one for rquery):

dplyr_pipeline <- . %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale), na.rm = TRUE)) %>% arrange(probability, surveyCategory) %>% filter(row_number() == n()) %>% ungroup() %>% rename(diagnosis = surveyCategory) %>% select(subjectID, diagnosis, probability) %>% arrange(subjectID) rquery_pipeline <- . := { extend_nse(., probability := exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale)), count := count(1), partitionby = 'subjectID') %.>% extend_nse(., rank := rank(), partitionby = 'subjectID', orderby = c('probability', 'surveyCategory')) %.>% rename_columns(., 'diagnosis' := 'surveyCategory') %.>% select_rows_nse(., rank == count) %.>% select_columns(., c('subjectID', 'diagnosis', 'probability')) %.>% orderby(., 'subjectID') }

The primary purpose of rquery is pure in-database processing; in-memory processing is only a convenience. So let’s look at database only timings. In these tasks the data starts in the database (as is typical for large data projects) and is either consumed (by a row count called “count”) or saved as a database table (called “land”). In neither case does data move in or out of the database (skipping those overheads).

And, as we demonstrated earlier, rquery’s query generator has additional features (not shown here) that can yield additional speedup in production environments (and also increase query safety and query power).

I think results this good this early in the project are very promising. They are the product of some design decisions that I think will continue to pay off and some intentional scope restrictions (including that rquery only targets SQL engines, and initially only a few including Spark, PostgreSQL, and with some caveats SQLite). More work needs to be done to add all the SQL translation features needed for general production use, but we think we have demonstrated a good foundation to build on.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: December 2017 roundup

Tue, 01/09/2018 - 18:55

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from December of particular interest to R users.

Hadley Wickham's Shiny app for making eggnog.

Using R to analyze the vocal range of pop singers.

A video tour of the data.table package from its creator, Matt Dowle.

The European R Users Meeting (eRum) will be held in Budapest, May 14-18.

Winners of the ASA Police Data Challenge student visualization contest.

An introduction to seplyr, a re-skinning of the dplyr package to a standard R evaluation interface.

How to run R in the Windows Subsystem for Linux, along with the rest of the Linux ecosystem.

A chart of Bechdel scores, showing representation of women in movies over time.

The British Ecological Society's Guide to Reproducible Science advocates the use of R and Rmarkdown.

Eight modules from the Microsoft AI School cover Microsoft R and SQL Server ML Services.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R/Finance 2018: Call for Papers

Tue, 01/09/2018 - 18:32

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

R/Finance 2018: Applied Finance with R June 1 and 2, 2018 University of Illinois at Chicago Call For Papers The tenth annual R/Finance conference for applied finance using R will be held June 1 and 2, 2018 in Chicago, IL, USA at the University of Illinois at Chicago. The conference will cover topics including portfolio management, time series analysis, advanced risk tools, high-performance computing, market microstructure, and econometrics. All will be discussed within the context of using R as a primary tool for financial risk management, portfolio construction, and trading. Over the past nine years, R/Finance has included attendees from around the world. It has featured presentations from prominent academics and practitioners, and we anticipate another exciting line-up for 2018. We invite you to submit complete papers in pdf format for consideration. We will also consider one-page abstracts (in txt or pdf format) although more complete papers are preferred. We welcome submissions for both full talks and abbreviated lightning talks. Both academic and practitioner proposals related to R are encouraged. All slides will be made publicly available at conference time. Presenters are strongly encouraged to provide working R code to accompany the slides. Data sets should also be made public for the purposes of reproducibility (though we realize this may be limited due to contracts with data vendors). Preference may be given to presenters who have released R packages. Please submit proposals online at http://go.uic.edu/rfinsubmit. Submissions will be reviewed and accepted on a rolling basis with a final submission deadline of February 2, 2018. Submitters will be notified via email by March 2, 2018 of acceptance, presentation length, and financial assistance (if requested). Financial assistance for travel and accommodation may be available to presenters. Requests for financial assistance do not affect acceptance decisions. Requests should be made at the time of submission. Requests made after submission are much less likely to be fulfilled. Assistance will be granted at the discretion of the conference committee. Additional details will be announced via the conference website as they become available. Information on previous years’ presenters and their presentations are also at the conference website. We will make a separate announcement when registration opens. For the program committee: Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson,  Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What’s the difference between data science, machine learning, and artificial intelligence?

Tue, 01/09/2018 - 18:00

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

When I introduce myself as a data scientist, I often get questions like “What’s the difference between that and machine learning?” or “Does that mean you work on artificial intelligence?” I’ve responded enough times that my answer easily qualifies for my “rule of three”:

When you’ve written the same code 3 times, write a function

When you’ve given the same in-person advice 3 times, write a blog post

— David Robinson (@drob) November 9, 2017

The fields do have a great deal of overlap, and there’s enough hype around each of them that the choice can feel like a matter of marketing. But they’re not interchangeable: most professionals in these fields have an intuitive understanding of how particular work could be classified as data science, machine learning, or artificial intelligence, even if it’s difficult to put into words.

So in this post, I’m proposing an oversimplified definition of the difference between the three fields:

  • Data science produces insights
  • Machine learning produces predictions
  • Artificial intelligence produces actions

To be clear, this isn’t a sufficient qualification: not everything that fits each definition is a part of that field. (A fortune teller makes predictions, but we’d never say that they’re doing machine learning!) These also aren’t a good way of determining someone’s role or job title (“Am I a data scientist?”), which is a matter of focus and experience. (This is true of any job description: I write as part of my job but I’m not a professional writer).

But I think this definition is a useful way to distinguish the three types of work, and to avoid sounding silly when you’re talking about it. It’s worth noting that I’m taking a descriptivist rather than a prescriptivist approach: I’m not interested in what these terms “should mean”, but rather how people in the field typically use them.

Data science produces insights

Data science is distinguished from the other two fields because its goal is an especially human one: to gain insight and understanding. Jeff Leek has an excellent definition of the types of insights that data science can achieve, including descriptive (“the average client has a 70% chance of renewing”) exploratory (“different salespeople have different rates of renewal”) and causal (“a randomized experiment shows that customers assigned to Alice are more likely to renew than those assigned to Bob”).

Again, not everything that produces insights qualifies as data science (the classic definition of data science is that it involves a combination of statistics, software engineering, and domain expertise). But we can use this definition to distinguish it from ML and AI. The main distinction is that in data science there’s always a human in the loop: someone is understanding the insight, seeing the figure, or benefitting from the conclusion. It would make no sense to say “Our chess-playing algorithm uses data science to choose its next move,” or “Google Maps uses data science to recommend driving directions”.

This definition of data science thus emphasizes:

  • Statistical inference
  • Data visualization
  • Experiment design
  • Domain knowledge
  • Communication

Data scientists might use simple tools: they could report percentages and make line graphs based on SQL queries. They could also use very complex methods: they might work with distributed data stores to analyze trillions of records, develop cutting-edge statistical techniques, and build interactive visualizations. Whatever they use, the goal is to gain a better understanding of their data.

Machine learning produces predictions

I think of machine learning as the field of prediction: of “Given instance X with particular features, predict Y about it”. These predictions could be about the future (“predict whether this patient will go into sepsis”), but they also could be about qualities that aren’t immediately obvious to a computer (“predict whether this image has a bird in it”). Almost all Kaggle competitions qualify as machine learning problems: they offer some training data, and then see if competitors can make accurate predictions about new examples.

There’s plenty of overlap between data science and machine learning. For example, logistic regression can be used to draw insights about relationships (“the richer a user is the more likely they’ll buy our product, so we should change our marketing strategy”) and to make predictions (“this user has a 53% chance of buying our product, so we should suggest it to them”).

Models like random forests have slightly less interpretability and are more likely to fit the “machine learning” description, and methods such as deep learning are notoriously challenging to explain. This could get in the way if your goal is to extract insights rather than make predictions. We could thus imagine a “spectrum” of data science and machine learning, with more interpretable models leaning towards the data science side and more “black box” models on the machine learning side.

Most practitioners will switch back and forth between the two tasks very comfortably. I use both machine learning and data science in my work: I might fit a model on Stack Overflow traffic data to determine which users are likely to be looking for a job (machine learning), but then construct summaries and visualizations that examine why the model works (data science). This is an important way to discover flaws in your model, and to combat algorithmic bias. This is one reason that data scientists are often responsible for developing machine learning components of a product.

Artificial intelligence produces actions

Artificial intelligence is by far the oldest and the most widely recognized of these three designations, and as a result it’s the most challenging to define. The term is surrounded by a great deal of hype, thanks to researchers, journalists, and startups who are looking for money or attention.

When you’re fundraising, it’s AI
When you’re hiring, it’s ML
When you’re implementing, it’s linear regression
When you’re debugging, it’s printf()

— Baron Schwartz (@xaprb) November 15, 2017

This has led to a backlash that strikes me as unfortunate, since it means some work that probably should be called AI isn’t described as such. Some researchers have even complained about the AI effect: “AI is whatever we can’t do yet”.1 So what work can we fairly describe as AI?

One common thread in definitions of “artificial intelligence” is that an autonomous agent executes or recommends actions (e.g. Poole, Mackworth and Goebel 1998, Russell and Norvig 2003). Some systems I think should described as AI include:

  • Game-playing algorithms (Deep Blue, AlphaGo)
  • Robotics and control theory (motion planning, walking a bipedal robot)
  • Optimization (Google Maps choosing a route)
  • Natural language processing (bots2)
  • Reinforcement learning

Again, we can see a lot of overlap with the other fields. Deep learning is particuarly interesting for straddling the fields of ML and AI. The typical use case is training on data and then producing predictions, but it has shown enormous success in game-playing algorithms like AlphaGo. (This is in contrast to earlier game-playing systems, like Deep Blue, which focused more on exploring and optimizing the future solution space).

But there are also distinctions. If I analyze some sales data and discover that clients from particular industries renew more than others (extracting an insight), the output is some numbers and graphs, not a particular action. (Executives might use those conclusions to change our sales strategy, but that action isn’t autonomous) This means I’d describe my work as data science: it would be cringeworthy to say that I’m “using AI to improve our sales.”

please

please

please do not write that someone who trained an algorithm has "harnessed the power of AI"

— Dave Gershgorn (@davegershgorn) September 18, 2017

The difference between artificial intelligence and machine learning is a bit more subtle, and historically ML has often been considered a subfield of AI (computer vision in particular was a classic AI problem). But I think the ML field has largely “broken off” from AI, partly because of the backlash described above: most people who to work on problems of prediction don’t like to describe themselves as AI researchers. (It helped that many important ML breakthroughs came from statistics, which had less of a presence in the rest of the AI field). This means that if you can describe a problem as “predict X from Y,” I’d recommend avoiding the term AI completely.

by today's definition, y=mx+b is an artificial intelligence bot that can tell you where a line is going

— Amy Hoy (@amyhoy) March 29, 2017

Case study: how would the three be used together?

Suppose we were building a self-driving car, and were working on the specific problem of stopping at stop signs. We would need skills drawn from all three of these fields.

  • Machine learning: The car has to recognize a stop sign using its cameras. We construct a dataset of millions of photos of streetside objects, and train an algorithm to predict which have stop signs in them.

  • Artificial intelligence: Once our car can recognize stop signs, it needs to decide when to take the action of applying the brakes. It’s dangerous to apply them too early or too late, and we need it to handle varying road conditions (for example, to recognize on a slippery road that it’s not slowing down quickly enough), which is a problem of control theory.

  • Data science: In street tests we find that the car’s performance isn’t good enough, with some false negatives in which it drives right by a stop sign. After analyzing the street test data, we gain the insight that the rate of false negatives depends on the time of day: it’s more likely to miss a stop sign before sunrise or after sunset. We realize that most of our training data included only objects in full daylight, so we construct a better dataset including nighttime images and go back to the machine learning step.

  1. It doesn’t help that AI is often conflated with general AI, capable of performing tasks across many different domains, or even superintelligent AI, which surpasses human intelligence. This sets unrealistic expectations for any system described as “AI”. 

  2. By “bots” here I’m referring to systems meant to interpret natural language and then respond in kind. This can be distinguished from text mining, where the goal is to extract insights (data science) or text classification, where the goal is to categorize documents (machine learning) 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

OSM Nominatim with R: getting Location’s Geo-coordinates by its Address

Tue, 01/09/2018 - 15:00

It is quite likely to get address info when scraping data from the web, but not geo-coordinates which may be required for further analysis like clustering. Thus geocoding is often needed to get a location’s coordinates by its address.

There are several options, including one of the most popular, google geocoding API. This option can be easily implemented into R with the function geocode from the library ggmap. It has the limitation of 2500 request a day (when it’s used free of charge), see details here.

To increase the number of free of charge geocoding requests, OpenStreetMap (OSM) Nominatim API can be used. OSM allows up to 1 request per second (see the usage policy), that gives about 35 times more API calls compared to the google geocoding API.

Here is one of the ways on how to implement OSM nominatim API in R:

## geocoding function using OSM Nominatim API ## details: http://wiki.openstreetmap.org/wiki/Nominatim ## made by: D.Kisler nominatim_osm <- function(address = NULL) { if(suppressWarnings(is.null(address))) return(data.frame()) tryCatch( d <- jsonlite::fromJSON( gsub('\\@addr\\@', gsub('\\s+', '\\%20', address), 'http://nominatim.openstreetmap.org/search/@addr@?format=json&addressdetails=0&limit=1') ), error = function(c) return(data.frame()) ) if(length(d) == 0) return(data.frame()) return(data.frame(lon = as.numeric(d$lon), lat = as.numeric(d$lat))) }

The function requires the library jsonlite.

Function input: the location address as string.
Function output: a data.frame with lon (longitude) and lat (latitude) of the input location, or empty data.frame if no/invalid address provided as the function input.

Let’s test the function.

#dplyr will be used to stack lists together into a data.frame and to get the pipe operator '%>%' suppressPackageStartupMessages(library(dplyr)) #input addresses addresses <- c("Baker Street 221b, London", "Brandenburger Tor, Berlin", "Platz der Deutschen Einheit 1, Hamburg", "Arc de Triomphe de l’Etoile, Paris", "Дворцовая пл., Санкт-Петербург, Россия") # d <- suppressWarnings(lapply(addresses, function(address) { #set the elapsed time counter to 0 t <- Sys.time() #calling the nominatim OSM API api_output <- nominatim_osm(address) #get the elapsed time t <- difftime(Sys.time(), t, 'secs') #return data.frame with the input address, output of the nominatim_osm function and elapsed time return(data.frame(address = address, api_output, elapsed_time = t)) }) %>% #stack the list output into data.frame bind_rows() %>% data.frame()) #output the data.frame content into console d address lon lat elapsed_time 1 Baker Street 221b, London -0.1584945 51.52376 0.2216313 secs 2 Brandenburger Tor, Berlin 13.3777025 52.51628 0.1038268 secs 3 Platz der Deutschen Einheit 1, Hamburg 9.9842058 53.54129 0.1253307 secs 4 Arc de Triomphe de l’Etoile, Paris 2.2950372 48.87378 0.1097755 secs 5 Дворцовая пл., Санкт-Петербург, Россия 30.3151066 59.93952 0.1000750 secs

    Related Post

    1. Spark RDDs Vs DataFrames vs SparkSQL – Part 3 : Web Server Log Analysis
    2. Painlessly Merge Data into Actuarial Loss Development Triangles with R
    3. The Mcomp Package
    4. Building a Hacker News scraper with 8 lines of R code using rvest library
    5. Building a Telecom Dictionary scraping web using rvest in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Exploring Aviation Accidents from 1908 through the Present

    Tue, 01/09/2018 - 09:05

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    If you can walk away from a landing, it’s a good landing. If you use the airplane the next day, it’s an outstanding landing.
    -Chuck Yeager

    As the first airplane invented by Wright brothers in 1903, aviation accident became an inevitable tragedy as well as an attractive topic. In this project, I used the data of full history of airplane crashes throughout the world, from 1908 to present, to analyze time, airplane type, airlines and accident summary of aviation accidents

     

    Data

    The original data I used contain over 5,000 rows coming from Open Data by Socrata, each row of the data represent a single accident, and the data consist of following features:
    Date: date the accident happened
    Time: time the accident happened
    Location: where the accident happened
    Operator: operator the crashed aircrafts belong to
    Type: type of crashed aircrafts
    Aborad: # of people aboard the crashed aircrafts
    Fatalities: # of people dead (who aboard the aircrafts)
    Ground: # of people dead (who did not aboard the aircrafts)
    Summary: brief accident description
    Notice that I removed all the rows belong to military operator in order to focus on commercial airplane.

     

    Time


    The first two plot shows air crash count in each year and fatality count and death ratio in each ten years. We can see that after 1970, total amount of accidents and people die from accidents are both decreasing. Due to lack of annually flights count data, we can’t simply say that accident rate is decreasing. However, the death rate went down from over 90% to around 65% throughout history, which means passengers are more likely to survive than before in an air crash.
    The third plot shows air crash count and death ratio by time of day. Number of air crash are distinguished by day and night. Again, due to data limitation, we can’t conclude that at what time it has higher accident rate, but death rate during night is higher than during day. (4 of the top 5 highest death rate time periods are within 12am to 5am)

     

    Aircraft Type

    Then I analyzed the type of crashed aircraft. Firstly, I want to see if aircraft size affect death rate. As we can see from the first plot, as the aircraft size increasing, the death rate is overall decreasing. We can conclude that passenger in larger aircraft have higher survive rate than in smaller aircraft.
    The next two plot shows air crash count for different aircraft model and air craft manufacturer. By selecting whole time range (can drag the bar in shiny app to change the year range), we find that Douglas DC-3 has the most count of air crash, which is far more than the second one DHC-6, but if only consider accidents after 1964, DHC-6 surpass DC-3 become the one has most accidents. After 2000, Cessna 208-B become the lead.
    In the second plot, it’s easy to compare the proportion of different manufacturer’s crashed aircraft in different age. And also, we can select different manufacturer to compare in the shiny app. For example, in the subplot, we selected Boeing and Airbus. From the plot, it’s obvious that Boeing experienced more accident than Airbus through the whole history, but it’s too rough to conclude which is safer. We still need more data such as total number of aircraft in service per year, total number of passenger it delivered, etc.

     

    Airlines

     

    The plots show air crash count and death rate by different airlines. Throughout whole history, Aeroflot has the most count of air crash (179 accidents in total), which is far more than the second one Air France(67 accidents in total), but by selecting after 1991, private airplane surpass Aeroflot, become the one has most accidents.

     

    Accident Summary

    Lastly, I put all accident summary together and generated a word cloud. Not surprisingly, Crashed is the most frequent word in the summary. We can see that Landing is more frequent than Takeoff, which illustrate that air crash happened more in the phase of landing rather than takeoff. Also, depends on words such as Mountain, Ground, Runway, Engine, Fuel, Fire, etc. we can roughly estimate where the accident happened and what was the cause of the accident.

     

    Thanks for watching, please feel free to browse my Shiny App and Github via following link!

    Link to Shiny App

    Link to Githup

    Email: tianyigu@nyu.edu

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Recode values with character subsetting

    Tue, 01/09/2018 - 07:00

    (This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

    Do you ever have to recode many values at once? It’s a frequent chore when
    preparing data. For example, suppose we had to replace state abbreviations
    with the full names:

    abbs <- c("AL", "AK", "AZ", "AZ", "WI", "WS")

    You could write several ifelse() statements.

    ifelse(abbs == "AL", "Alabama", ifelse(abbs == "AK", "Alaska", ifelse(abbs == "AZ", "Arizona",

    Actually, never mind! That gets out of hand very quickly.

    case_when() is nice, especially when the replacement rules are more complex
    than 1-to-1 matching.

    dplyr::case_when( # Syntax: logical test ~ value to use when test is TRUE abbs == "AL" ~ "Alabama", abbs == "AK" ~ "Alaska", abbs == "AZ" ~ "Arizona", abbs == "WI" ~ "Wisconsin", # a fallback/default value TRUE ~ "No match" ) #> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"

    We could also use one of my very favorite R tricks:
    Character subsetting.
    We create a named vector where the names are the data we have and the values are
    the data we want. I use the mnemonic old_value = new_value. In this case, we
    make a lookup table like so:

    lookup <- c( # Syntax: name = value "AL" = "Alabama", "AK" = "Alaska", "AZ" = "Arizona", "WI" = "Wisconsin")

    For example, subsetting with the string "AL" will retrieve the value with the
    name "AL".

    lookup["AL"] #> AL #> "Alabama"

    With a vector of names, we can look up the values all at once.

    lookup[abbs] #> AL AK AZ AZ WI #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA

    If the names and the replacement values are stored in vectors, we can construct
    the lookup table programmatically using setNames(). In our case, the datasets
    package provides vectors with state names and state abbreviations.

    full_lookup <- setNames(datasets::state.name, datasets::state.abb) head(full_lookup) #> AL AK AZ AR CA #> "Alabama" "Alaska" "Arizona" "Arkansas" "California" #> CO #> "Colorado" full_lookup[abbs] #> AL AK AZ AZ WI #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA

    One complication is that the character subsetting yields NA when the
    lookup table doesn’t have a matching name. That’s what’s happening above with
    the illegal abbreviation "WS". We can fix this by replacing the NA
    values with some default value.

    matches <- full_lookup[abbs] matches[is.na(matches)] <- "No match" matches #> AL AK AZ AZ WI #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"

    Finally, to clean away any traces of the matching process, we can unname() the
    results.

    unname(matches) #> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match" Many-to-one lookup tables

    By the way, the lookup tables can be many-to-one. That is, different names can
    retrieve the same value. For example, we can handle this example that has
    synonymous names and differences in capitalization with many-to-one matching.

    lookup <- c( "python" = "Python", "r" = "R", "node" = "Javascript", "js" = "Javascript", "javascript" = "Javascript") languages <- c("JS", "js", "Node", "R", "Python", "r", "JAvascript") # Use tolower() to normalize the language names so # e.g., "R" and "r" can both match R lookup[tolower(languages)] #> js js node r python #> "Javascript" "Javascript" "Javascript" "R" "Python" #> r javascript #> "R" "Javascript" Character by character string replacement

    I’m motivated to write about character subsetting today because I used it in a
    Stack Overflow answer.
    Here is my paraphrasing of the problem.

    Let’s say I have a long character string, and I’d like to use
    stringr::str_replace_all to replace certain letters with others. According to
    the documentation, str_replace_all can take a named vector and replaces the
    name with the value. That works fine for 1 replacement, but for multiple, it
    seems to do the replacements iteratively, so that one replacement can replace
    another one.

    library(tidyverse) text_string = "developer" # This works fine text_string %>% str_replace_all(c(e ="X")) #> [1] "dXvXlopXr" # But this is not what I want text_string %>% str_replace_all(c(e ="p", p = "e")) #> [1] "develoeer" # Desired result would be "dpvploepr"

    The iterative behavior here is that
    str_replace_all("developer", c(e ="p", p = "e")) first replaces e with p
    (yielding "dpvploppr") and then it applies the second rule on the output of
    the first rule, replacing p with e (yielding "develoeer").

    When I read this question, the replacement rules looked a lot like the lookup
    tables that I use in character subsetting so I presented a function that
    handles this problem by using character subsetting.

    Let’s work through the question’s example. First, let’s break the string into
    characters.

    input <- "developer" rules <- c(e = "p", p = "e") chars <- unlist(strsplit(input, "")) chars #> [1] "d" "e" "v" "e" "l" "o" "p" "e" "r"

    To avoid the issue of NAs, we create default rules so that every character in
    the input is replaced by itself.

    unique_chars <- unique(chars) complete_rules <- setNames(unique_chars, unique_chars) complete_rules #> d e v l o p r #> "d" "e" "v" "l" "o" "p" "r"

    Now, we overwrite the default rules with the specific ones we are interested in.

    # Find rules with the names as the real rules. # Replace them with the real rules. complete_rules[names(rules)] <- rules complete_rules #> d e v l o p r #> "d" "p" "v" "l" "o" "e" "r"

    Then lookup with character subsetting will effectively apply all the replacement
    rules. We glue the characters back together again to finish the transformation

    replaced <- unname(complete_rules[chars]) paste0(replaced, collapse = "") #> [1] "dpvploepr"

    Here is everything combined into a single function, with some additional steps
    needed to handle multiple strings at once.

    str_replace_chars <- function(string, rules) { # Expand rules to replace characters with themselves # if those characters do not have a replacement rule chars <- unique(unlist(strsplit(string, ""))) complete_rules <- setNames(chars, chars) complete_rules[names(rules)] <- rules # Split each string into characters, replace and unsplit for (string_i in seq_along(string)) { chars_i <- unlist(strsplit(string[string_i], "")) string[string_i] <- paste0(complete_rules[chars_i], collapse = "") } string } rules <- c(a = "X", p = "e", e = "p") strings <- c("application", "developer") str_replace_chars(strings, rules) #> [1] "XeelicXtion" "dpvploepr" var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Classifying duplicate questions from Quora with Keras

    Tue, 01/09/2018 - 01:00

    (This article was first published on TensorFlow for R, and kindly contributed to R-bloggers)

    In this post we will use Keras to classify duplicated questions from Quora. The dataset first appeared in the Kaggle competition Quora Question Pairs and consists of approximately 400,000 pairs of questions along with a column indicating if the question pair is considered a duplicate.

    Our implementation is inspired by the Siamese Recurrent Architecture, with modifications to the similarity measure and the embedding layers (the original paper uses pre-trained word vectors). Using this kind of architecture dates back to 2005 with Le Cun et al and is useful for verification tasks. The idea is to learn a function that maps input patterns into a target space such that a similarity measure in the target space approximates the “semantic” distance in the input space.

    After the competition, Quora also described their approach to this problem in this blog post.

    Dowloading data

    Data can be downloaded from the Kaggle dataset webpage or from Quora’s release of the dataset:

    library(keras) quora_data <- get_file( "quora_duplicate_questions.tsv", "https://qim.ec.quoracdn.net/quora_duplicate_questions.tsv" )

    We are using the Keras get_file() function so that the file download is cached.

    Reading and preprocessing

    We will first load data into R and do some preprocessing to make it easier to include in the model. After downloading the data, you can read it using the readr read_tsv() function.

    library(readr) df <- read_tsv(quora_data)

    We will create a Keras tokenizer to transform each word into an integer token. We will also specify a hyperparameter of our model: the vocabulary size. For now let’s use the 50,000 most common words (we’ll tune this parameter later). The tokenizer will be fit using all unique questions from the dataset.

    tokenizer <- text_tokenizer(num_words = 50000) tokenizer %>% fit_text_tokenizer(unique(c(df$question1, df$question2)))

    Let’s save the tokenizer to disk in order to use it for inference later.

    save_text_tokenizer(tokenizer, "tokenizer-question-pairs")

    We will now use the text tokenizer to transform each question into a list of integers.

    question1 <- texts_to_sequences(tokenizer, df$question1) question2 <- texts_to_sequences(tokenizer, df$question2)

    Let’s take a look at the number of words in each question. This will helps us to decide the padding length, another hyperparameter of our model. Padding the sequences normalizes them to the same size so that we can feed them to the Keras model.

    library(purrr) questions_length <- c( map_int(question1, length), map_int(question2, length) ) quantile(questions_length, c(0.8, 0.9, 0.95, 0.99)) 80% 90% 95% 99% 14 18 23 31

    We can see that 99% of questions have at most length 31 so we’ll choose a padding length between 15 and 30. Let’s start with 20 (we’ll also tune this parameter later). The default padding value is 0, but we are already using this value for words that don’t appear within the 50,000 most frequent, so we’ll use 50,001 instead.

    question1_padded <- pad_sequences(question1, maxlen = 20, value = 50000 + 1) question2_padded <- pad_sequences(question2, maxlen = 20, value = 50000 + 1)

    We have now finished the preprocessing steps. We will now run a simple benchmark model before moving on to the Keras model.

    Simple benchmark

    Before creating a complicated model let’s take a simple approach. Let’s create two predictors: percentage of words from question1 that appear in the question2 and vice-versa. Then we will use a logistic regression to predict if the questions are duplicate.

    perc_words_question1 <- map2_dbl(question1, question2, ~mean(.x %in% .y)) perc_words_question2 <- map2_dbl(question2, question1, ~mean(.x %in% .y)) df_model <- data.frame( perc_words_question1 = perc_words_question1, perc_words_question2 = perc_words_question2, is_duplicate = df$is_duplicate ) %>% na.omit()

    Now that we have our predictors, let’s create the logistic model. We will take a small sample for validation.

    val_sample <- sample.int(nrow(df_model), 0.1*nrow(df_model)) logistic_regression <- glm( is_duplicate ~ perc_words_question1 + perc_words_question2, family = "binomial", data = df_model[-val_sample,] ) summary(logistic_regression) Call: glm(formula = is_duplicate ~ perc_words_question1 + perc_words_question2, family = "binomial", data = df_model[-val_sample, ]) Deviance Residuals: Min 1Q Median 3Q Max -1.5938 -0.9097 -0.6106 1.1452 2.0292 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.259007 0.009668 -233.66 <2e-16 *** perc_words_question1 1.517990 0.023038 65.89 <2e-16 *** perc_words_question2 1.681410 0.022795 73.76 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 479158 on 363843 degrees of freedom Residual deviance: 431627 on 363841 degrees of freedom (17 observations deleted due to missingness) AIC: 431633 Number of Fisher Scoring iterations: 3

    Let’s calculate the accuracy on our validation set.

    pred <- predict(logistic_regression, df_model[val_sample,], type = "response") pred <- pred > mean(df_model$is_duplicate[-val_sample]) accuracy <- table(pred, df_model$is_duplicate[val_sample]) %>% prop.table() %>% diag() %>% sum() accuracy [1] 0.6573577

    We got an accuracy of 65.7%. Not all that much better than random guessing. Now let’s create our model in Keras.

    Model definition

    We will use a Siamese network to predict whether the pairs are duplicated or not. The idea is to create a model that can embed the questions (sequence of words) into a vector. Then we can compare the vectors for each question using a similarity measure and tell if the questions are duplicated or not.

    First let’s define the inputs for the model.

    input1 <- layer_input(shape = c(20), name = "input_question1") input2 <- layer_input(shape = c(20), name = "input_question2")

    Then let’s the define the part of the model that will embed the questions in a vector.

    word_embedder <- layer_embedding( input_dim = 50000 + 2, # vocab size + UNK token + padding value output_dim = 128, # hyperparameter - embedding size input_length = 20, # padding size, embeddings_regularizer = regularizer_l2(0.0001) # hyperparameter - regularization ) seq_embedder <- layer_lstm( units = 128, # hyperparameter -- sequence embedding size kernel_regularizer = regularizer_l2(0.0001) # hyperparameter - regularization )

    Now we will define the relationship between the input vectors and the embeddings layers. Note that we use the same layers and weights on both inputs. That’s why this is called a Siamese network. It makes sense, because we don’t want to have different outputs if question1 is switched with question2.

    vector1 <- input1 %>% word_embedder() %>% seq_embedder() vector2 <- input2 %>% word_embedder() %>% seq_embedder()

    We then define the similarity measure we want to optimize. We want duplicated questions to have higher values of similarity. In this example we’ll use the cosine similarity, but any similarity measure could be used. Remember that the cosine similarity is the normalized dot product of the vectors, but for training it’s not necessary to normalize the results.

    cosine_similarity <- layer_dot(list(vector1, vector2), axes = 1)

    Next, we define a final sigmoid layer to output the probability of both questions being duplicated.

    output <- cosine_similarity %>% layer_dense(units = 1, activation = "sigmoid")

    Now that let’s define the Keras model in terms of it’s inputs and outputs and compile it. In the compilation phase we define our loss function and optimizer. Like in the Kaggle challenge, we will minimize the logloss (equivalent to minimizing the binary crossentropy). We will use the Adam optimizer.

    model <- keras_model(list(input1, input2), output) model %>% compile( optimizer = "adam", metrics = list(acc = metric_binary_accuracy), loss = "binary_crossentropy" )

    We can then take a look at out model with the summary function.

    summary(model) _______________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ======================================================================================= input_question1 (InputLayer (None, 20) 0 _______________________________________________________________________________________ input_question2 (InputLayer (None, 20) 0 _______________________________________________________________________________________ embedding_1 (Embedding) (None, 20, 128) 6400256 input_question1[0][0] input_question2[0][0] _______________________________________________________________________________________ lstm_1 (LSTM) (None, 128) 131584 embedding_1[0][0] embedding_1[1][0] _______________________________________________________________________________________ dot_1 (Dot) (None, 1) 0 lstm_1[0][0] lstm_1[1][0] _______________________________________________________________________________________ dense_1 (Dense) (None, 1) 2 dot_1[0][0] ======================================================================================= Total params: 6,531,842 Trainable params: 6,531,842 Non-trainable params: 0 _______________________________________________________________________________________ Model fitting

    Now we will fit and tune our model. However before proceeding let’s take a sample for validation.

    set.seed(1817328) val_sample <- sample.int(nrow(question1_padded), size = 0.1*nrow(question1_padded)) train_question1_padded <- question1_padded[-val_sample,] train_question2_padded <- question2_padded[-val_sample,] train_is_duplicate <- df$is_duplicate[-val_sample] val_question1_padded <- question1_padded[val_sample,] val_question2_padded <- question2_padded[val_sample,] val_is_duplicate <- df$is_duplicate[val_sample]

    Now we use the fit() function to train the model:

    model %>% fit( list(train_question1_padded, train_question2_padded), train_is_duplicate, batch_size = 64, epochs = 10, validation_data = list( list(val_question1_padded, val_question2_padded), val_is_duplicate ) ) Train on 363861 samples, validate on 40429 samples Epoch 1/10 363861/363861 [==============================] - 89s 245us/step - loss: 0.5860 - acc: 0.7248 - val_loss: 0.5590 - val_acc: 0.7449 Epoch 2/10 363861/363861 [==============================] - 88s 243us/step - loss: 0.5528 - acc: 0.7461 - val_loss: 0.5472 - val_acc: 0.7510 Epoch 3/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5428 - acc: 0.7536 - val_loss: 0.5439 - val_acc: 0.7515 Epoch 4/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5353 - acc: 0.7595 - val_loss: 0.5358 - val_acc: 0.7590 Epoch 5/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5299 - acc: 0.7633 - val_loss: 0.5358 - val_acc: 0.7592 Epoch 6/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5256 - acc: 0.7662 - val_loss: 0.5309 - val_acc: 0.7631 Epoch 7/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5211 - acc: 0.7701 - val_loss: 0.5349 - val_acc: 0.7586 Epoch 8/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5173 - acc: 0.7733 - val_loss: 0.5278 - val_acc: 0.7667 Epoch 9/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5138 - acc: 0.7762 - val_loss: 0.5292 - val_acc: 0.7667 Epoch 10/10 363861/363861 [==============================] - 88s 242us/step - loss: 0.5092 - acc: 0.7794 - val_loss: 0.5313 - val_acc: 0.7654

    After training completes, we can save our model for inference with the save_model_hdf5() function.

    save_model_hdf5(model, "model-question-pairs.hdf5") Model tuning

    Now that we have a reasonable model, let’s tune the hyperparameters using the tfruns package. We’ll begin by adding FLAGS declarations to our script for all hyperparameters we want to tune (FLAGS allow us to vary hyperparmaeters without changing our source code):

    FLAGS <- flags( flag_integer("vocab_size", 50000), flag_integer("max_len_padding", 20), flag_integer("embedding_size", 256), flag_numeric("regularization", 0.0001), flag_integer("seq_embedding_size", 512) )

    With this FLAGS definition we can now write our code in terms of the flags. For example:

    input1 <- layer_input(shape = c(FLAGS$max_len_padding)) input2 <- layer_input(shape = c(FLAGS$max_len_padding)) embedding <- layer_embedding( input_dim = FLAGS$vocab_size + 2, output_dim = FLAGS$embedding_size, input_length = FLAGS$max_len_padding, embeddings_regularizer = regularizer_l2(l = FLAGS$regularization) )

    The full source code of the script with FLAGS can be found here.

    We additionally added an early stopping callback in the training step in order to stop training if validation loss doesn’t decrease for 5 epochs in a row. This will hopefully reduce training time for bad models. We also added a learning rate reducer to reduce the learning rate by a factor of 10 when the loss doesn’t decrease for 3 epochs (this technique typically increases model accuracy).

    model %>% fit( ..., callbacks = list( callback_early_stopping(patience = 5), callback_reduce_lr_on_plateau(patience = 3) ) )

    We can now execute a tuning run to probe for the optimal combination of hyperparameters. We call the tuning_run() function, passing a list with the possible values for each flag. The tuning_run() function will be responsible for executing the script for all combinations of hyperparameters. We also specify the sample parameter to train the model for only a random sample from all combinations (reducing training time significantly).

    library(tfruns) runs <- tuning_run( "question-pairs.R", flags = list( vocab_size = c(30000, 40000, 50000, 60000), max_len_padding = c(15, 20, 25), embedding_size = c(64, 128, 256), regularization = c(0.00001, 0.0001, 0.001), seq_embedding_size = c(128, 256, 512) ), runs_dir = "tuning", sample = 0.2 )

    The tuning run will return a data.frame with results for all runs. We found that the best run attained 84.9% accuracy using the combination of hyperparameters shown below, so we modify our training script to use these values as the defaults:

    FLAGS <- flags( flag_integer("vocab_size", 50000), flag_integer("max_len_padding", 20), flag_integer("embedding_size", 256), flag_numeric("regularization", 1e-4), flag_integer("seq_embedding_size", 512) ) Making predictions

    Now that we have trained and tuned our model we can start making predictions. At prediction time we will load both the text tokenizer and the model we saved to disk earlier.

    library(keras) model <- load_model_hdf5("model-question-pairs.hdf5", compile = FALSE) tokenizer <- load_text_tokenizer("tokenizer-question-pairs")

    Since we won’t continue training the model, we specified the compile = FALSE argument.

    Now let`s define a function to create predictions. In this function we we preprocess the input data in the same way we preprocessed the training data:

    predict_question_pairs <- function(model, tokenizer, q1, q2) { q1 <- texts_to_sequences(tokenizer, list(q1)) q2 <- texts_to_sequences(tokenizer, list(q2)) q1 <- pad_sequences(q1, 20) q2 <- pad_sequences(q2, 20) as.numeric(predict(model, list(q1, q2))) }

    We can now call it with new pairs of questions, for example:

    predict_question_pairs( model, tokenizer, "What's R programming?", "What's R in programming?" ) [1] 0.9784008

    Prediction is quite fast (~40 milliseconds).

    Deploying the model

    To demonstrate deployment of the trained model, we created a simple Shiny application, where you can paste 2 questions from Quora and find the probability of them being duplicated. Try changing the questions below or entering two entirely different questions.


    The shiny application can be found at https://jjallaire.shinyapps.io/shiny-quora/ and it’s source code at https://github.com/dfalbel/shiny-quora-question-pairs.

    Note that when deploying a Keras model you only need to load the previously saved model file and tokenizer (no training data or model training steps are required).

    Wrapping up
    • We trained a Siamese LSTM that gives us reasonable accuracy (84%). Quora’s state of the art is 87%.
    • We can improve our model by using pre-trained word embeddings on larger datasets. For example, try using what’s described in this example. Quora uses their own complete corpus to train the word embeddings.
    • After training we deployed our model as a Shiny application which given two Quora questions calculates the probability of their being duplicates.
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Hierarchical compartmental reserving models

    Tue, 01/09/2018 - 01:00

    (This article was first published on R on mages' blog, and kindly contributed to R-bloggers)

    Today, I will sketch out ideas from the Hierarchical Compartmental Models for Loss Reserving paper by Jake Morris, which was published in the summer of 2016 (Morris (2016)). Jake’s model is inspired by PK/PD models (pharmacokinetic/pharmacodynamic models) used in the pharmaceutical industry to describe the time course of effect intensity in response to administration of a drug dose.

    The hierarchical compartmental model fits outstanding and paid claims simultaneously, combining ideas of Clark (2003), Quarg and Mack (2004), Miranda, Nielsen, and Verrall (2012), Guszcza (2008) and Zhang, Dukic, and Guszcza (2012). You find the first two models implemented in the ChainLadder package, the third in the DCL package and the fourth one in an earlier blog post of mine and as part of the brms package vignette.

    Jake uses OpenBUGS (Lunn et al. (2000)) for the Bayesian examples in his paper. Thanks to Paul-Christian Bürkner’s brms package (Bürkner (2017)) I can run version of Jake’s model in Stan (Stan Development Team (2017)) easily, as I will show in this post.

    Note: Paul and Jake will be at the Stan workshop in London, following the Insurance Data Science conference, on 17 July 2018.

    Hierarchical Compartmental Models

    Similar to a PK/PD model, Jake describes the state-space process of claims developments with a set of ordinary differential equations (ODE), using \(EX=\) Exposure, \(OS=\) Outstanding claims, \(PD=\) Paid claims (these are the compartments):

    \[
    \begin{aligned}
    dEX/dt & = -k_{er} \cdot EX \\
    dOS/dt & = k_{er} \cdot RLR \cdot EX – k_p \cdot OS \\
    dPD/dt & = k_{p} \cdot RRF \cdot OS
    \end{aligned}
    \]

    with initial conditions \(EX(0) = \Pi\) (ultimate earned premiums), \(OS(0)=0, PD(0)=0\).

    The parameters describe:

    • \(k_{er}\) the rate at which claim events occur and are subsequently reported to the insurer
    • \(RLR\) the reported loss ratio
    • \(RRF\) the reserve robustness factor, the proportion of outstanding claims that eventually are paid
    • \(k_p\) the rate of payment, i.e. the rate at which outstanding claims are paid

    Note, the product of \(RLR\) and \(RRF\) describes the ultimate loss ratio \(ULR =\) ultimate loss / premium, which is a key metric in insurance to assess performance and profitability.

    Setting the parameters \(k_{er}=1.7\), \(RLR=0.8\), \(k_p=0.5\), \(RRF=0.95\) produces the following output.

    The chart illustrates nicely the different compartments (processes) for a policy over time.

    The autonomous system of ODEs above can be solved analytically:

    \[
    \begin{aligned}
    EX(t) & = \Pi \cdot \exp(- k_{er} t) \\
    OS(t) & = \frac{\Pi \cdot RLR \cdot k_{er}}{k_{er} – k_p} \cdot
    \left(\exp(-k_p t) – \exp(-k_{er} t)\right)\\
    PD (t) & = \frac{\Pi \cdot RLR \cdot RRF}{k_{er} – k_p}
    \left( k_{er} \cdot (1 – \exp(-k_p t) – k_p \cdot (1 – \exp(-k_{er}t ) \right)
    \end{aligned}
    \]

    In this post I will focus on the last two equations, as they describe the observable data of outstanding claims and cumulative paid claims over time. However, one could expand the model, e.g. to allow for different premium earning patterns.

    Get example data

    In his paper Jake uses data from the CAS Reserving Database, namely company 337 from the worker’s compensation file.

    The following function downloads the data and reshapes it into a more useful format, very similar to the function used in an earlier post for Commercial Auto data.

    library(data.table) CASdata <- fread("http://www.casact.org/research/reserve_data/wkcomp_pos.csv") createLossData2 <- function(CASdata, company_code){ compData <- CASdata[GRCODE==company_code, c("EarnedPremDIR_D", "AccidentYear", "DevelopmentLag", "IncurLoss_D", "CumPaidLoss_D", "BulkLoss_D")] setnames(compData, names(compData), c("premium", "accident_year", "dev", "incurred_loss", "paid_loss", "bulk_loss")) compData <- compData[, `:=`( origin = accident_year - min(accident_year) + 1)] compData0 <- compData[dev==1] compData0 <- compData0[, `:=`(dev = 0, incurred_loss = 0, paid_loss = 0, bulk_loss = 0)] compData <- rbindlist(list(compData0, compData)) compData <- compData[, cal := origin + dev - 1][order(origin, dev)] compData <- compData[, `:=`( paid_train = ifelse(cal <= max(origin), paid_loss, NA), paid_test = ifelse(cal > max(origin), paid_loss, NA), os_train = ifelse(cal <= max(origin), incurred_loss - paid_loss, NA), os_test = ifelse(cal > max(origin), incurred_loss - paid_loss, NA))] traintest <- rbindlist(list(compData[cal <= max(origin)], compData[cal > max(origin)])) return(traintest) } lossData <- createLossData2(CASdata, company_code = 337)

    Let’s plot the data of outstanding and paid loss ratio against development years (\(t\)=dev) for the different accident years.

    The data looks similar to the example output from our model above. That’s a good start.

    Re-organising data

    In order to fit the two curves simultaneously I use a little trick. I stack the paid and outstanding claims into a single column and add another indicator column (delta), which is either \(0\) for outstanding claims or \(1\) for paid claims. This turns the multivariate data back into a univariate set.

    Additionally, I remove the artificial data for dev=0 again, add another column of the indicator variable as a factor (deltaf) and divide all measurement variables by the premium of the oldest accident year. The last step normalises the data and should make modelling a little easier as everything is in the ballpark of 1.

    lossData0 <- rbindlist( list(lossData[, list(accident_year, dev, loss_train=os_train, loss_test=os_test, delta=0, premium=premium)], lossData[,list(accident_year, dev,loss_train=paid_train, loss_test=paid_test, delta=1, premium=premium)] ))[order(accident_year, dev)] premind <- lossData0[accident_year==min(accident_year) & dev==min(dev) & delta==1, premium] lossData0 <- lossData0[, `:=`(premium = premium/premind, loss_train = loss_train/premind, loss_test = loss_test/premind, deltaf = factor(delta, labels = c("os", "paid")), cal=accident_year + dev - 1)][dev>0] Model fitting Non-linear Least Squares

    Before I build a complex Bayesian model I start with a simple non-linear least squares model. Or in other words, I believe the data is generated from a Normal distribution, with the mean described by an analytical function \(\mu(t)=f(t,\dots)\) and constant variance \(\sigma^2\).

    \[
    \begin{aligned}
    y(t) & \sim \mathcal{N}(\mu(t), \sigma^2) \\
    \mu(t) & = f(t, \Pi, k_{er}, k_p, RLR, RRF, \delta)\\
    & = \Pi \cdot [(1 – \delta) \frac{ RLR \cdot k_{er}}{k_{er} – k_p} \cdot
    \left(\exp(-k_p t) – \exp(-k_{er} t) \right) + \\
    & \delta \frac{ RLR \cdot RRF}{k_{er} – k_p}
    \left( k_{er} \cdot (1 – \exp(-k_p t) – k_p \cdot (1 – \exp(-k_{er}t ) \right)]\\
    \delta & = \begin{cases}
    1 \mbox{ if } y \mbox{ is outstanding claim}\\
    0 \mbox{ if } y\mbox{ is paid claim}
    \end{cases}
    \end{aligned}
    \]

    To ensure all parameters stay positive I will use the same approach as Jake does in this paper, that is I reparameterize using the logarithm of the parameters.

    my.f <- function(t, premium, lk_er, lk_p, lRLR, lRRF, delta){ k_er <- exp(lk_er) k_p <- exp(lk_p) RLR <- exp(lRLR) RRF <- exp(lRRF) paid <- (RLR * k_er / (k_er - k_p) * (exp(-k_p * t) - exp(-k_er * t))) os <- (RLR * RRF / (k_er - k_p) * (k_er * (1 - exp(-k_p * t)) - k_p * (1 - exp(-k_er * t)))) return(premium * (paid * (1 - delta) + os * delta)) }

    Using the nls function in R the model can be fitted quickly.

    n1 <- nls(loss_train ~ my.f(dev, premium, lk_er=lker, lk_p=lkp, lRLR=lRLR, lRRF=lRRF, delta=delta), data=lossData0[cal<=max(accident_year)], start = c(lker = log(1.5), lRLR = log(1), lkp = log(0.75), lRRF = log(0.75))) n1 ## Nonlinear regression model ## model: loss_train ~ my.f(dev, premium, lk_er = lker, lk_p = lkp, lRLR = lRLR, lRRF = lRRF, delta = delta) ## data: lossData0[cal <= max(accident_year)] ## lker lRLR lkp lRRF ## 0.8621 -0.1090 -0.8646 -0.4397 ## residual sum-of-squares: 0.3505 ## ## Number of iterations to convergence: 5 ## Achieved convergence tolerance: 2.844e-06

    Let’s bring the coefficients back to the original scale:

    n1par <- data.table(summary(n1)$coefficients)[, exp(Estimate + 0.5*`Std. Error`^2)] names(n1par) <- c("ker", "RLR", "kp", "RRF") n1par ## ker RLR kp RRF ## 2.4375609 0.8985600 0.4231800 0.6466241

    Assuming a constant variance across outstanding (delta=0) and paid claims (delta=1) doesn’t really make much sense, neither that the parameters RLR and RRF are the same across all accident years, which assumes the same ULR for all years: 57.8%.

    Hierarchical non-linear model

    To address the issues mentioned above, I allow for different variances depending on the compartment type (\(\sigma^2_{y[\delta]}\)) and allow the parameters RLR and RRF to vary across accident years.

    \[
    \begin{aligned}
    y(t) & \sim \mathcal{N}(f(t, \Pi, k_{er}, k_p, RLR_{[i]}, RRF_{[i]}), \sigma_{y[\delta]}^2) \\
    \begin{pmatrix} RLR_{[i]} \\ RRF_{[i]}\end{pmatrix} & \sim
    \mathcal{N} \left(
    \begin{pmatrix}
    \mu_{RLR} \\
    \mu_{RRF}
    \end{pmatrix},
    \begin{pmatrix}
    \sigma_{RLR}^2 & 0 \\
    0 & \sigma_{RRF}^2
    \end{pmatrix}
    \right)
    \end{aligned}
    \]

    The parameters \(\mu_{RLR}, \mu_{RRF}\) describe the underlying means across all years \(i\).

    With the preparation done, I can fit the hierarchical non-linear model using the nlme package, assuming that the parameters lRLR and lRRF vary across accident years (random effects), while I treat the parameters lker and lkp as fixed effects.

    library(nlme) m1 <- nlme(loss_train ~ my.f(dev, premium, lk_er=lker, lk_p=lkp, lRLR=lRLR, lRRF=lRRF, delta=delta), data=lossData0[cal<=max(accident_year)], fixed = lker + lRLR + lkp + lRRF ~ 1, random = pdDiag(lRLR + lRRF ~ 1), groups = ~ accident_year, weights = varIdent(form = ~ 1 | deltaf), start = c(lker = log(1.5), lRLR = log(1), lkp = log(0.75), lRRF = log(0.75)), control=list(msMaxIter=10000, pnlsMaxIter=10000, pnlsTol=0.4)) summary(m1) ## Nonlinear mixed-effects model fit by maximum likelihood ## Model: loss_train ~ my.f(dev, premium, lk_er = lker, lk_p = lkp, lRLR = lRLR, lRRF = lRRF, delta = delta) ## Data: lossData0[cal <= max(accident_year)] ## AIC BIC logLik ## -524.4347 -502.8309 270.2174 ## ## Random effects: ## Formula: list(lRLR ~ 1, lRRF ~ 1) ## Level: accident_year ## Structure: Diagonal ## lRLR lRRF Residual ## StdDev: 0.186949 0.1318405 0.03337559 ## ## Variance function: ## Structure: Different standard deviations per stratum ## Formula: ~1 | deltaf ## Parameter estimates: ## os paid ## 1.0000000 0.1805809 ## Fixed effects: lker + lRLR + lkp + lRRF ~ 1 ## Value Std.Error DF t-value p-value ## lker 0.4102733 0.05624434 97 7.294481 0.0000 ## lRLR 0.0225969 0.06751438 97 0.334698 0.7386 ## lkp -0.7946096 0.03483365 97 -22.811551 0.0000 ## lRRF -0.4049580 0.05558069 97 -7.285948 0.0000 ## Correlation: ## lker lRLR lkp ## lRLR -0.375 ## lkp -0.928 0.388 ## lRRF 0.522 -0.282 -0.572 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -2.63652139 -0.52065601 0.03990089 0.61348869 2.01628548 ## ## Number of Observations: 110 ## Number of Groups: 10

    The parameters \(k_{er}\), \(\mu_{RLR}\), \(k_p\), \(\mu_{RRF}\) on the original scale are:

    m1_fe <- data.table(summary(m1)$tTable)[, exp(Value + 0.5*`Std.Error`^2)] names(m1_fe) <- c("ker", "RLR", "kp", "RRF") m1_fe ## ker RLR kp RRF ## 1.5096155 1.0251880 0.4520317 0.6680359

    If you look at the p-Value of lRLR, we might as well assume \(\mu_{RLR}=1\).

    The estimated ULR across accident years is 68.5%, and the median by accident year:

    RLRRRF <- coef(m1)[,c(2,4)] names(RLRRRF) <- c("RLR", "RRF") round(exp(cbind(RLRRRF, ULR=apply(RLRRRF,1,sum))),3) ## RLR RRF ULR ## 1988 0.853 0.593 0.506 ## 1989 0.925 0.577 0.534 ## 1990 0.968 0.673 0.651 ## 1991 0.910 0.798 0.727 ## 1992 0.946 0.663 0.627 ## 1993 0.899 0.562 0.505 ## 1994 0.885 0.611 0.540 ## 1995 1.232 0.724 0.892 ## 1996 1.406 0.785 1.104 ## 1997 1.382 0.734 1.014

    The parameters \(\sigma_{y[\delta]}\) are:

    (sig_delta <- sapply(split(resid(m1), lossData0[cal<=max(accident_year)]$deltaf), sd)) ## os paid ## 0.032065014 0.005522208

    and \(\sigma_{RLR}\) and \(\sigma_{RRF}\) are:

    VarCorr(m1) ## accident_year = pdDiag(list(lRLR ~ 1,lRRF ~ 1)) ## Variance StdDev ## lRLR 0.03494992 0.18694896 ## lRRF 0.01738192 0.13184052 ## Residual 0.00111393 0.03337559

    Let’s looks at the predictions:

    Looks good. This is a huge improvement to the simple non-linear least squares model. For all years but 1997 (here we had only one data point) the predictions look pretty reasonable. The more data we have the more credibility will be given to it and shift parameters RLR and RRF away from the population mean.

    Bayesian reserving models with Stan

    We can fit the same model with Stan, which allows us to make different assumptions on the data generating distribution and specify the priors more flexibly. For example, I’d like to assume Gamma priors over my parameters, this will ensure the parameters are all positive and I don’t have to reparameterise them as I did above.

    curve(dgamma(x, 4, 5), from = 0, 5, main="Gamma priors", xlab="", ylab="", bty="n") curve(dgamma(x, 3, 2), add = TRUE, lty=3) curve(dgamma(x, 3, 4), add=TRUE, lty=5) text(x = c(0.1, 1.1, 2), y = c(0.8, 0.9, 0.4), labels = c("k_p", "RLR,\nRRF", "k_er"))

    Next I specify my model with brm, which will not only generate the Stan code, but also run the model and collect all the results. I continue to use a Gaussian distribution, as it will allow me to compare the output of brm with nlme.

    library(rstan) library(brms) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) fml <- loss_train ~ premium * ( (RLR*ker/(ker - kp) * (exp(-kp*dev) - exp(-ker*dev))) * (1 - delta) + (RLR*RRF/(ker - kp) * (ker *(1 - exp(-kp*dev)) - kp*(1 - exp(-ker*dev)))) * delta ) b1 <- brm(bf(fml, RLR ~ 1 + (1 | accident_year), RRF ~ 1 + (1 | accident_year), ker ~ 1, kp ~ 1, sigma ~ 0 + deltaf, nl = TRUE), data = lossData0[cal <= max(accident_year)], family = brmsfamily("gaussian", link_sigma = "log"), prior = c(prior(gamma(4, 5), nlpar = "RLR", lb=0), prior(gamma(4, 5), nlpar = "RRF", lb=0), prior(gamma(3, 2), nlpar = "ker", lb=0), prior(gamma(3, 4), nlpar = "kp", lb=0)), control = list(adapt_delta = 0.999, max_treedepth=15), seed = 1234, iter = 1000) b1 ## Family: gaussian ## Links: mu = identity; sigma = log ## Formula: loss_train ~ premium * ((RLR * ker/(ker - kp) * (exp(-kp * dev) - exp(-ker * dev))) * (1 - delta) + (RLR * RRF/(ker - kp) * (ker * (1 - exp(-kp * dev)) - kp * (1 - exp(-ker * dev)))) * delta) ## RLR ~ 1 + (1 | accident_year) ## RRF ~ 1 + (1 | accident_year) ## ker ~ 1 ## kp ~ 1 ## sigma ~ 0 + deltaf ## Data: lossData0[cal <= max(accident_year)] (Number of observations: 110) ## Samples: 4 chains, each with iter = 1000; warmup = 500; thin = 1; ## total post-warmup samples = 2000 ## ICs: LOO = NA; WAIC = NA; R2 = NA ## ## Group-Level Effects: ## ~accident_year (Number of levels: 10) ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## sd(RLR_Intercept) 0.25 0.08 0.14 0.47 709 1.00 ## sd(RRF_Intercept) 0.11 0.04 0.06 0.20 838 1.01 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## RLR_Intercept 1.03 0.09 0.85 1.20 602 1.01 ## RRF_Intercept 0.67 0.05 0.58 0.76 701 1.00 ## ker_Intercept 1.55 0.11 1.37 1.77 1111 1.00 ## kp_Intercept 0.45 0.02 0.42 0.49 1181 1.00 ## sigma_deltafos -3.38 0.12 -3.60 -3.12 2000 1.00 ## sigma_deltafpaid -5.07 0.13 -5.31 -4.79 1326 1.00 ## ## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample ## is a crude measure of effective sample size, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).

    This looks similar to the output of nlme. Note, the link function for sigma was log, so to compare the output with nlme I have to transform it back to the original scale:

    library(rstan) X <- extract(b1$fit) (sig_delta_brm <- apply(X$b_sigma, 2, function(x) exp(mean(x)+0.5*var(x)))) ## [1] 0.034170216 0.006316905

    Again, that’s pretty similar.

    The simulations are well behaved as well:

    plot(b1, N = 4, ask = FALSE)

    Next I want to compare the estimated accident year ULRs with the ‘population’ level ULR.

    RLR <- sweep(X$r_1_RLR_1, 1, X$b_RLR, "+") RRF <- sweep(X$r_2_RRF_1, 1, X$b_RRF, "+") ULR <- t(apply(RLR * RRF, 2, quantile, c(0.025, 0.5, 0.975))) matplot(unique(lossData0$accident_year), ULR*100, t="l", ylim=c(0, 150), lty=1, col=c(1,2,4), bty="n", ylab="Projected ULR (%)", xlab="Accident year") legend("topleft", legend = c("2.5%ile","50%ile", "97.5%ile"), lty=1, col=c(1,2,4), bty="n") baseULR <- X$b_RLR * X$b_RRF abline(h=quantile(baseULR, 0.025)*100, col=1, lty=3) abline(h=median(baseULR)*100, col=2, lty=3) abline(h=quantile(baseULR, 0.975)*100, col=4, lty=3)

    That’s interesting, as the softening casualty market (increasing loss ratio) of the late 1990’s is very visible. From this plot you would assume 1996 was the worst year and perhaps an outlier. However, as the data shows, 1997 was even less profitable than 1996.

    Perhaps, the under-prediction for 1997 may arises because historically RRF<1 i.e. over-reserving. However, this switches to under-reserving in more recent accident years as the reserving cycle kicks in (for the latest accident year in particular). With only 2 data points from which to assess this, shrinkage drags the RRF parameter downwards. It would be an interesting exercise to try and incorporate a prior which captures the expectation of a deteriotating reserving cycle.

    Let’s plot the predictions of the model, with the plotDevBananas function from an earlier post.

    predClaimsPred <- predict(b1, newdata = lossData0, method="predict") plotDevBananas(`2.5%ile`/premium*100 + `97.5%ile`/premium*100 + Estimate/premium*100 + loss_train/premium*100 + loss_test/premium*100 ~ dev | factor(accident_year), ylim=c(0, 100), data=cbind(lossData0, predClaimsPred)[delta==0], main="Outstanding claims developments", ylab="Outstanding loss ratio (%)")

    plotDevBananas(`2.5%ile`/premium*100 + `97.5%ile`/premium*100 + Estimate/premium*100 + loss_train/premium*100 + loss_test/premium*100 ~ dev | factor(accident_year), ylim=c(0, 120), data=cbind(lossData0, predClaimsPred)[delta==1], main="Paid claims developments", ylab="Paid loss ratio (%)")

    The output looks very promising, apart form the 1997 accident year, which turned out even worse than 1996. Perhaps, I shouldn’t assume a Gaussian data generating distribution, but a distribution which is skewed to the right, such as Gamma, Weibull or Log-normal, all of which is doable with brm by changing the family argument.

    Additionally, I could test for auto-correlation, calendar year effects, etc. Jake’s paper has further details on this, he also shows how the model can be extended to cover time-dependent parameters and covariate sub-models.

    If you’d like to learn more about brms and compartmental reserving model attend the Stan workshop at Cass Business School London, following the Insurance Data Science conference, on 17 July 2018, where Paul and Jake will be there as well.

    Session Info sessionInfo() ## R version 3.4.3 (2017-11-30) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.2 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib ## ## locale: ## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] rstan_2.17.2 StanHeaders_2.17.1 ggplot2_2.2.1 ## [4] nlme_3.1-131 latticeExtra_0.6-28 RColorBrewer_1.1-2 ## [7] lattice_0.20-35 data.table_1.10.4-3 deSolve_1.20 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.14 mvtnorm_1.0-6 gtools_3.5.0 ## [4] zoo_1.8-1 assertthat_0.2.0 rprojroot_1.3-2 ## [7] digest_0.6.13 mime_0.5 R6_2.2.2 ## [10] plyr_1.8.4 backports_1.1.2 stats4_3.4.3 ## [13] evaluate_0.10.1 coda_0.19-1 colourpicker_1.0 ## [16] blogdown_0.4 pillar_1.0.1 rlang_0.1.6 ## [19] lazyeval_0.2.1 miniUI_0.1.1 Matrix_1.2-12 ## [22] DT_0.2 rmarkdown_1.8 shinythemes_1.1.1 ## [25] shinyjs_0.9.1 stringr_1.2.0 htmlwidgets_0.9 ## [28] loo_1.1.0 igraph_1.1.2 munsell_0.4.3 ## [31] shiny_1.0.5 compiler_3.4.3 httpuv_1.3.5 ## [34] pkgconfig_2.0.1 base64enc_0.1-3 rstantools_1.4.0 ## [37] htmltools_0.3.6 tibble_1.4.1 gridExtra_2.3 ## [40] bookdown_0.5 threejs_0.3.1 matrixStats_0.52.2 ## [43] dplyr_0.7.4 brms_2.0.1 grid_3.4.3 ## [46] xtable_1.8-2 gtable_0.2.0 magrittr_1.5 ## [49] scales_0.5.0 stringi_1.1.6 reshape2_1.4.3 ## [52] bindrcpp_0.2 dygraphs_1.1.1.4 xts_0.10-1 ## [55] tools_3.4.3 glue_1.2.0 markdown_0.8 ## [58] shinystan_2.4.0 crosstalk_1.0.0 rsconnect_0.8.5 ## [61] abind_1.4-5 parallel_3.4.3 yaml_2.1.16 ## [64] inline_0.3.14 colorspace_1.3-2 bridgesampling_0.4-0 ## [67] bayesplot_1.4.0 knitr_1.18 bindr_0.1 ## [70] Brobdingnag_1.2-4 References

    Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. doi:10.18637/jss.v080.i01.

    Clark, David R. 2003. LDF Curve-Fitting and Stochastic Reserving: A Maximum Likelihood Approach. Casualty Actuarial Society; http://www.casact.org/pubs/forum/03fforum/03ff041.pdf.

    Guszcza, James. 2008. “Hierarchical Growth Curve Models for Loss Reserving.” In, 146–73. CAS Fall Forum; https://www.casact.org/pubs/forum/08fforum/7Guszcza.pdf.

    Lunn, David J, Andrew Thomas, Nicky Best, and David Spiegelhalter. 2000. “WinBUGS-a Bayesian Modelling Framework: Concepts, Structure, and Extensibility.” Statistics and Computing 10 (4). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.471.604&rep=rep1&type=pdf; Springer: 325–37.

    Miranda, María Dolores Martínez, Jens Perch Nielsen, and Richard Verrall. 2012. “Double Chain Ladder.” ASTIN Bulletin: The Journal of the IAA 42 (1). https://www.cass.city.ac.uk/__data/assets/pdf_file/0011/364961/double-chain-ladder-cass-knowledge.pdf; Cambridge University Press: 59–76.

    Morris, Jake. 2016. “Hierarchical Compartmental Models for Loss Reserving.” In. Casualty Actuarial Society Summer E-Forum; https://www.casact.org/pubs/forum/16sforum/Morris.pdf.

    Quarg, Gerhard, and Thomas Mack. 2004. “Munich Chain Ladder.” Munich Re Group: http://www.variancejournal.org/issues/02-02/266.pdf.

    Stan Development Team. 2017. “RStan: The R Interface to Stan.” http://mc-stan.org/.

    Zhang, Yanwei, Vanja Dukic, and James Guszcza. 2012. “A Bayesian Non-Linear Model for Forecasting Insurance Loss Payments.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 175 (2). Wiley Online Library: 637–56.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Learn your way around the R ecosystem

    Mon, 01/08/2018 - 23:04

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    One of the most powerful things about R is the ecosystem that has emerged around it. In addition to the R language itself and the many packages that extend it, you have a network of users, developers, governance bodies, software vendors and service providers that provide resources in technical information and support, companion applications, training and implementation services, and more.

    I gave the talk above at the useR! conference last year, but never posted the slides before because they weren't particularly useful without the associated talk track. Mark Sellors from Mango Solutions has thankfully filled the gap with his new Field Guide to the R Ecosystem. This concise and useful document (which Mark introduces here) provides an overview of R and its packages and APIs, developer tools, data connections, commercial vendors of R, and the user and developer community. The document assumes no background knowledge of R, and is useful for anyone thinking of getting into R, or for existing R users to learn about the resources available in the wider ecosystem. You can read Field Guide to the R Ecosystem at the link below.

    Mark Sellors: Field Guide to the R Ecosystem

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Bio7 2.7 Retina-Display Fix for MacOSX

    Mon, 01/08/2018 - 10:44

    (This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

    08.01.2018

    I recently discovered on a high dpi display (Retina-Display) enabled 13’’ notebook that the latest release of Bio7 for MacOSX is not displayed in high dpi but in a lower resolution which makes text and icons blurry (scaled) and unnecessary strains the eyes when editing R scripts with the R editor.

    So I uploaded a new, fixed version for MacOSX.

    However, to fix an already downloaded version of Bio7 you have to open open the ‘/Bio7.app/Contents/Info.plist’ file with a text editor (right-click on Bio7.app, execute ‘Show Package Contents’ and open ‘/Contents/Info.plist’) and add the following lines before the tag:

    NSHighResolutionCapable

    In addition to update the Info.plist file for the Operating System you have to open the Terminal (in ‘Applications/Utilities’) and execute the following command (default install directory – shell command is one line!):

    /System/Library/Frameworks/CoreServices.framework/Frameworks/LaunchServices.framework/Support/lsregister -v -f /Applications/Bio7.app

    Here some screenshots from before and after the fix was applied on a Retina-Display (zoom in for details):

    Before:

    After:

    An updated version with the fix already applied can be downloaded here

    See also:

    Release notes for Bio7 2.7

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How environments work in R and what is lazy evaluation

    Mon, 01/08/2018 - 01:00

    Knowledge of the way how R evaluates expressions is crucial to avoid hours of staring at the screen or hitting unexpected and difficult bugs.

    We’ll start with an example of an issue I came accross a few months ago when using the purrr::map function. To simplify, the issue I had:


    wat

    makePrintFunction <- function(index) { function() { print(index) } } printFunctions <- lapply(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 2 printFunctions <- purrr::map(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 3

    Since I came across the issue, purrr::map has changed and this example no longer applies. To simulate it, let’s use a simplified implementation of map function. You should be able to just copy-paste the code in this article and run it:

    map <- function(range, functionToApply) { result <- vector("list", length(range)) for (i in range) { result[[i]] <- functionToApply(i) } return(result) } makePrintFunction <- function(index) { function() { print(index) } } printFunctions <- lapply(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 2 printFunctions <- map(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 3

    How to fix that?

    If you don’t already know to fix that issue, you’ll quickly find out. This is quite a common problem and the solution is to use the force function as follows:

    makePrintFunction <- function(index) { force(index) function() { print(index) } } printFunctions <- lapply(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 2 printFunctions <- map(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 2

    It works! But … why?

    This could be a great moment to just carry on – the problem is solved. You’ve heard about lazy evaluation and know that force() is useful in fixing such issues. But then again, what does lazy evaluation mean in this context?

    Let’s take a look at the magical force function. It consists of two lines:


    Huh?

    force # function (x) # x # # <environment: namespace:base>

    Wait, what’s going on here? Does this mean that I can simply call index instead of force(index) and it will still work?

    makePrintFunction <- function(index) { index function() { print(index) } } printFunctions <- lapply(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 2 printFunctions <- map(1:3, function(i) makePrintFunction(i)) printFunctions[[2]]() # 2

    Let’s get to the bottom of this

    There are two factors that cause the issue we are facing. The first one is lazy evaluation. The second is the way environments work in R.

    Lazy evaluation

    The way R works is that it doesn’t evaluate an expression when it is not used. Let’s take a look at an example that you can find in Hadley’s book http://adv-r.had.co.nz/Functions.html:

    f <- function(x) { 42 } f(stop("This is an error!")) # 42 f <- function(x) { force(x) 42 } f(stop("This is an error!")) # Error in force(x): This is an error!

    Another useful example to better understand that expressions are evaluated at the moment they are used:

    printLabel <- function(x, label = toupper(x)) { x <- "changed" print(label) } printLabel("original") # CHANGED printLabel <- function(x, label = toupper(x)) { force(label) x <- "changed" print(label) } printLabel("original") # ORIGINAL

    sticky notePlease note that promises mentioned here are something different than promises package used to handle concurrent computations.
    These semantics are described in R language definition R language definition:

    The mechanism is implemented via promises. When a function is being evaluated the actual expression used as an argument is stored in the promise together with a pointer to the environment the function was called from. When (if) the argument is evaluated the stored expression is evaluated in the environment that the function was called from. Since only a pointer to the environment is used any changes made to that environment will be in effect during this evaluation. The resulting value is then also stored in a separate spot in the promise. Subsequent evaluations retrieve this stored value (a second evaluation is not carried out).

    How environments work

    Every function object has an environment assigned when it is created. Let’s call it environment A. When the function is invoked, a new environment is created and used in the function call. This new environment inherits from environment A.

    a <- 1 f <- function(a) { a <- a + 1 a # <-- debug here } f(5) # Browse[2]> environment(f) # # # Browse[2]> environment(f)[["a"]] # [1] 1 # # Browse[2]> environment() # # # Browse[2]> environment()[["a"]] # [1] 6

    This is what the environments hierarchy is at this point:

    How does our example work without force

    Environment 0x3fa2db2 inherits from mpfEnv and points to index variable which is stored in 0x3fa2db0. index variable is not going to be copied to environment 0x3fa2db2 until it is used there.

    How does our example work with force

    You shouldn’t come across this issue while using most high-order functions:

    R 3.2.0 (2015) changelog:

    • Higher-order functions such as the apply functions and Reduce()
      now force arguments to the functions they apply in order to
      eliminate undesirable interactions between lazy evaluation and
      variable capture in closures. This resolves PR#16093.

    Purrr issue fixed in March 2017: https://github.com/tidyverse/purrr/issues/191

    I hope this knowledge will save you some time if you stumble upon such issues in the future.

    Until next time!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Simulating a cost-effectiveness analysis to highlight new functions for generating correlated data

    Mon, 01/08/2018 - 01:00

    (This article was first published on ouR data generation, and kindly contributed to R-bloggers)

    My dissertation work (which I only recently completed – in 2012 – even though I am not exactly young, a whole story on its own) focused on inverse probability weighting methods to estimate a causal cost-effectiveness model. I don’t really do any cost-effectiveness analysis (CEA) anymore, but it came up very recently when some folks in the Netherlands contacted me about using simstudy to generate correlated (and clustered) data to compare different approaches to estimating cost-effectiveness. As part of this effort, I developed two more functions in simstudy that allow users to generate correlated data drawn from different types of distributions. Earlier I had created the CorGen functions to generate multivariate data from a single distribution – e.g. multivariate gamma. Now, with the new CorFlex functions (genCorFlex and addCorFlex), users can mix and match distributions. The new version of simstudy is not yet up on CRAN, but is available for download from my github site. If you use RStudio, you can install using devtools::install.github("kgoldfeld/simstudy").

    I thought I’d introduce this new functionality by generating some correlated cost and outcome data, and show how to estimate a cost-effectiveness analysis curve (CEAC). The CEAC is based on a measure called the incremental net benefit (INB). It is far more common in cost-effectiveness analysis to measure the incremental cost-effectiveness ratio (ICER). I was never enamored of ICERs, because ratios can behave poorly when denominators (in this case the changes in outcomes) get very small. Since it is a difference, the INB behaves much better. Furthermore, it seems relatively intuitive that a negative INB is not a good thing (i.e., it is not good if costs are greater than benefits), but a negative ICER has an unclear interpretation. My goal isn’t to give you a full explanation of CEA, but to provide an application to demonstrate the new simstudy functions. If you really want to learn more about this topic, you can find a paper here that described my dissertation work. Of course, this is a well-established field of study, so naturally there is much more out there…

    Simulation scenario

    In the simulation scenario I’ve concocted, the goal is to increase the number of patients that come in for an important test. A group of public health professionals have developed a new outreach program that they think will be able to draw in more patients. The study is conducted at the site level – some sites will implement the new approach, and the others, serving as controls, will continue with the existing approach. The cost for the new approach is expected to be higher, and will vary by site. In the first scenario, we assume that costs and recruitment are correlated with each other. That is, sites that tend to spend more generally have higher recruitment levels, even before introducing the new recruitment method.

    The data are simulated using the assumption that costs have a gamma distribution (since costs are positive, continuous and skewed to the right) and that recruitment numbers are Poisson distributed (since they are non-negative counts). The intervention sites will have costs that are on average $1000 greater than the control sites. Recruitment will be 10 patients higher for the intervention sites. This is an average expenditure of $100 per additional patient recruited:

    library(simstudy) # Total of 500 sites, 250 control/250 intervention set.seed(2018) dt <- genData(500) dt <- trtAssign(dtName = dt, nTrt = 2, balanced = TRUE, grpName = "trtSite") # Define data - intervention costs $1000 higher on average def <- defDataAdd(varname = "cost", formula = "1000 + 1000*trtSite", variance = 0.2, dist = "gamma") def <- defDataAdd(def, varname = "nRecruits", formula = "100 + 10*trtSite", dist = "poisson") # Set correlation paramater (based on Kendall's tau) tau <- 0.2 # Generate correlated data using new function addCorFlex dOutcomes <- addCorFlex(dt, defs = def, tau = tau) dOutcomes ## id trtSite cost nRecruits ## 1: 1 1 1553.7862 99 ## 2: 2 1 913.2466 90 ## 3: 3 1 1314.5522 91 ## 4: 4 1 1610.5535 112 ## 5: 5 1 3254.1100 99 ## --- ## 496: 496 1 1452.5903 99 ## 497: 497 1 292.8769 109 ## 498: 498 0 835.3930 85 ## 499: 499 1 1618.0447 92 ## 500: 500 0 363.2429 101

    The data have been generated, so now we can examine the means and standard deviations of costs and recruitment:

    dOutcomes[, .(meanCost = mean(cost), sdCost = sd(cost)), keyby = trtSite] ## trtSite meanCost sdCost ## 1: 0 992.2823 449.8359 ## 2: 1 1969.2057 877.1947 dOutcomes[, .(meanRecruit = mean(nRecruits), sdRecruit = sd(nRecruits)), keyby = trtSite] ## trtSite meanRecruit sdRecruit ## 1: 0 99.708 10.23100 ## 2: 1 108.600 10.10308

    And here is the estimate of Kendall’s tau within each intervention arm:

    dOutcomes[, .(tau = cor(cost, nRecruits, method = "kendall")), keyby = trtSite] ## trtSite tau ## 1: 0 0.2018365 ## 2: 1 0.1903694 Cost-effectiveness: ICER

    The question is, are the added expenses of the program worth it when we look at the difference in recruitment? In the traditional approach, the incremental cost-effectiveness ratio is defined as

    \[ICER = \frac{ \bar{C}_{intervention} – \bar{C}_{control} }{ \bar{R}_{intervention} – \bar{R}_{control}}\]

    where \(\bar{C}\) and \(\bar{R}\) represent the average costs and recruitment levels, respectively.

    We can calculate the ICER in this simulated study:

    (costDif <- dOutcomes[trtSite == 1, mean(cost)] - dOutcomes[trtSite == 0, mean(cost)]) ## [1] 976.9235 (nDif <- dOutcomes[trtSite == 1, mean(nRecruits)] - dOutcomes[trtSite == 0, mean(nRecruits)]) ## [1] 8.892 # ICER costDif/nDif ## [1] 109.8654

    In this case the average cost for the intervention group is $976 higher than the control group, and recruitment goes up by about 9 people. Based on this, the ICER is $110 per additional recruited individual. We would deem the initiative cost-effective if we are willing to pay at least $110 to recruit a single person. If, for example, we save $150 in future health care costs for every additional person we recruit, we should be willing to invest $110 for a new recruit. Under this scenario, we would deem the program cost effective (assuming, of course, we have some measure of uncertainty for our estimate).

    Cost-effectiveness: INB & the CEAC

    I alluded to the fact that I believe that the incremental net benefit (INB) might be a preferable way to measure cost-effectiveness, just because the measure is more stable and easier to interpret. This is how it is defined:

    \[INB = \lambda (\bar{R}_{intervention} – \bar{R}_{control}) – (\bar{C}_{intervention} – \bar{C}_{control})\]

    where \(\lambda\) is the willingness-to-pay I mentioned above. One of the advantages to using the INB is that we don’t need to specify \(\lambda\), but can estimate a range of INBs based on a range of willingness-to-pay values. For all values of \(\lambda\) where the INB exceeds $0, the intervention is cost-effective.

    The CEAC is a graphical approach to cost-effectiveness analysis that takes into consideration uncertainty. We estimate uncertainty using a bootstrap approach, which entails sampling repeatedly from the original “observed” data set with replacement. Each time we draw a sample, we estimate the mean differences in cost and recruitment for the two treatment arms. A plot of these estimated means gives a sense of the variability of our estimates (and we can see how strongly these means are correlated). Once we have all these bootstrapped means, we can calculate a range of INB’s for each pair of means and a range of \(\lambda\)’s. The CEAC represents the proportion of bootstrapped estimates with a positive INB at a particular level of \(\lambda\).

    This is much easier to see in action. To implement this, I wrote a little function that randomly samples the original data set and estimates the means:

    estMeans <- function(dt, grp, boot = FALSE) { dGrp <- dt[trtSite == grp] if (boot) { size <- nrow(dGrp) bootIds <- dGrp[, sample(id, size = size, replace = TRUE)] dGrp <- dt[bootIds] } dGrp[, .(mC = mean(cost), mN = mean(nRecruits))] }

    First, we calculate the differences in means of the observed data:

    (estResult <- estMeans(dOutcomes, 1) - estMeans(dOutcomes, 0)) ## mC mN ## 1: 976.9235 8.892

    Next, we draw 1000 bootstrap samples:

    bootResults <- data.table() for (i in 1:1000) { changes <- estMeans(dOutcomes, 1, boot = TRUE) - estMeans(dOutcomes, 0, boot = TRUE) bootResults <- rbind(bootResults, changes) } bootResults ## mC mN ## 1: 971.3087 9.784 ## 2: 953.2996 8.504 ## 3: 1053.0340 9.152 ## 4: 849.5292 8.992 ## 5: 1008.9378 8.452 ## --- ## 996: 894.0251 8.116 ## 997: 1002.0393 7.948 ## 998: 981.6729 8.784 ## 999: 1109.8255 9.596 ## 1000: 995.6786 8.736

    Finally, we calculate the proportion of INBs that exceed zero for a range of \(\lambda\)’s from $75 to $150. We can see that at willingness-to-pay levels higher than $125, there is a very high probability (~90%) of the intervention being cost-effective. (At the ICER level of $110, the probability of cost-effectiveness is only around 50%.)

    CEAC <- data.table() for (wtp in seq(75, 150, 5)) { propPos <- bootResults[, mean((wtp * mN - mC) > 0)] CEAC <- rbind(CEAC, data.table(wtp, propPos)) } CEAC ## wtp propPos ## 1: 75 0.000 ## 2: 80 0.000 ## 3: 85 0.002 ## 4: 90 0.018 ## 5: 95 0.075 ## 6: 100 0.183 ## 7: 105 0.339 ## 8: 110 0.505 ## 9: 115 0.659 ## 10: 120 0.776 ## 11: 125 0.871 ## 12: 130 0.941 ## 13: 135 0.965 ## 14: 140 0.984 ## 15: 145 0.992 ## 16: 150 0.998 A visual CEA

    Here are three series of plots, shown for different levels of correlation between cost and recruitment. Each series includes a plot of the original cost and recruitment data, where each point represents a site. The second plot shows the average difference in means between the intervention and control sites in purple and the bootstrapped differences in grey. The third plot is the CEAC with a horizontal line drawn at 90%. The first series is the data set we generated with tau = 0.2:

    When there is no correlation between costs and recruitment across sites (tau = 0):

    And finally – when there is a higher degree of correlation, tau = 0.4:

    Effect of correlation?

    In all three scenarios (with different levels of tau), the ICER is approximately $110. Of course, this is directly related to the fact that the estimated differences in means of the two intervention groups is the same across the scenarios. But, when we look at the three site-level and bootstrap plots, we can see the varying levels of correlation.

    And while there also appears to be a subtle visual difference between the CEAC’s for different levels of correlation, it is not clear if this is a real difference or random variation. To explore this a bit further, I generated 250 data sets and their associated CEACs (which in turn are generated by 1000 bootstrap steps eacj) under a range of tau’s, starting with no correlation (tau = 0) up to a considerable level of correlation (tau = 0.4). In these simulations, I used a larger sample size of 2000 sites to reduce the variation a bit. Here are the results:

    It appears that the variability of the CEAC curves decreases as correlation between cost and recruitment (determined by tau) increases; the range of the curves is smallest when tau is 0.4. In addition, in looks like the “median” CEAC moves slightly rightward as tau increases, which suggests that probability of cost-effectiveness will vary across different levels of tau. All this is to say that correlation appears to matter, so it might be an important factor to consider when both simulating these sorts of data and actually conducting a CEA.

    Next steps?

    In this example, I based the entire analysis on a simple non-parametric estimate of the means. In the future, I might explore copula-based methods to fit joint models of costs and outcomes. In simstudy, a Gaussian copula generates the correlated data. However there is a much larger world of copulas out there that can be used to model correlation between measures regardless of their marginal distributions. And some of these methods have been applied in the context of CEA. Stay tuned on this front (though it might be a while).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The Trouble with Tibbles

    Mon, 01/08/2018 - 01:00

    (This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

    Let’s get something straight, there isn’t really any trouble with tibbles. I’m hoping you’ve noticed this is a play on 1967 Star Trek episode, “The Trouble with Tribbles”. I’ve recently got myself a job as a Data Scientist, here, at Jumping Rivers. Having never come across tibbles until this point, I now find myself using them in nearly every R script I compose. Be that your timeless standard R script, your friendly Shiny app or an analytical Markdown document.

    What are tibbles?

    Presumably this is why you came here, right?

    Tibbles are a modern take on data frames, but crucially they are still data frames. Well, what’s the difference then? There’s a quote I found somewhere on the internet that decribes the difference quite well;

    “keeping what time has proven to be effective, and throwing out what is not”.

    Basically, some clever people took the classic data.frame(), shook it til the ineffective parts fell out, then added some new, more appropriate features.

    Precursors # The easiest way to get access is to isstall the tibble package. install.packages("tibble") # Alternatively, tibbles are a part of the tidyverse and hence installing the whole tidyverse will give you access. install.packages("tidverse") # I am just going to use tibble. library(tibble) Tribblemaking

    There are three ways to form a tibble. It pretty much acts as your friendly old pal data.frame() does. Just like standard data frames, we can create tibbles, coerce objects into tibbles and import data sets into R as a tibble. Below is a table of the traditional data.frame() commands and their respective tidyverse commands.

    Formation Type Data Frame Commands Tibbles Commands Creation data.frame() data_frame() tibble() tribble() Coercion as.data.frame() as_data_frame() as_tibble() Importing read.*() read_delim() read_csv() read_csv2() read_tsv()

    Let’s take a closer look…

    1) Creation.

    Just as data.frame() creates data frames,tibble(), data_frame() and tribble() all create tibbles.

    Standard data frame.

    data.frame(a = 1:5, b = letters[1:5]) ## a b ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 d ## 5 5 e

    A tibble using tibble() (identical to using data_frame).

    tibble(a = 1:5, b = letters[1:5]) ## # A tibble: 5 x 2 ## a b ## ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 d ## 5 5 e

    A tibble using tribble().

    tribble( ~a, ~b, #---|---- 1, "a", 2, "b") ## # A tibble: 2 x 2 ## a b ## ## 1 1.00 a ## 2 2.00 b

    Notice the odd one out? tribble() is different. It’s a way of laying out small amounts of data in an easy to read form. I’m not too keen on these, as even writing out that simple 2 x 2 tribble got tedious.

    2) Coercion.

    Just as as.data.frame() coerces objects into data frames, as_data_frame() and as_tibble() coerce objects into tibbles.

    df = data.frame(a = 1:5, b = letters[1:5]) as_data_frame(df) ## # A tibble: 5 x 2 ## a b ## ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 d ## 5 5 e as_tibble(df) ## # A tibble: 5 x 2 ## a b ## ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 d ## 5 5 e

    You can coerce more than just data frames, too. Objects such as lists, matrices, vectors and single instances of class are convertible.

    3) Importing.

    There’s a few options to read in data files within the tidyverse, so we’ll just compare read_csv() and its representative data.frame() pal, read.csv(). Let’s take a look at them. I have here an example data set that I’ve created in MS Excel. You can download/look at this data here. To get access to this function you’ll need the readr package. Again this is part of the tidyverse so either will do.

    library("readr") url = "https://gist.githubusercontent.com/theoroe3/8bc989b644adc24117bc66f50c292fc8/raw/f677a2ad811a9854c9d174178b0585a87569af60/tibbles_data.csv" tib = read_csv(url) ## Parsed with column specification: ## cols( ## `<-` = col_integer(), ## `8` = col_integer(), ## `%` = col_double(), ## name = col_character() ## ) tib ## # A tibble: 4 x 4 ## `<-` `8` `%` name ## ## 1 1 2 0.250 t ## 2 2 4 0.250 h ## 3 3 6 0.250 e ## 4 4 8 0.250 o df = read.csv(url) df ## X.. X8 X. name ## 1 1 2 0.25 t ## 2 2 4 0.25 h ## 3 3 6 0.25 e ## 4 4 8 0.25 o

    Not only does read_csv() return a pretty tibble, it is also much faster. For proof, check out this article by Erwin Kalvelagen. The keen eyes amongst you will have noticed something odd about the variable names… we’ll get on to that soon.

    Tibbles vs Data Frames

    Did you notice a key difference in the tibble()s and data.frame()s above? Take a look again.

    tibble(a = 1:26, b = letters) ## # A tibble: 26 x 2 ## a b ## ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 d ## 5 5 e ## # ... with 21 more rows

    The first thing you should notice is the pretty print process. The class of each column is now displayed above it and the dimensions of the tibble are shown at the top. The default print option within tibbles mean they will only display 10 rows if the data frame has more than 20 rows (I’ve changed mine to display 5 rows). Neat. Along side that we now only view columns that will fit on the screen. This is already looking quite the part. The row settings can be changed via

    options(tibble.print_max = 3, tibble.print_min = 1)

    So now if there is more than 3 rows, we print only 1 row. Tibbles of length 3 and 4 would now print as so.

    tibble(1:3) ## # A tibble: 3 x 1 ## `1:3` ## ## 1 1 ## 2 2 ## 3 3 tibble(1:4) ## # A tibble: 4 x 1 ## `1:4` ## ## 1 1 ## # ... with 3 more rows

    Yes, OK, you could do this with the traditional data frame. But it would be a lot more work, right?

    As well as the fancy printing, tibbles don’t drop the variable type, don’t partial match and they allow non-syntactic column names when importing data in. We’re going to use the data from before. Again, it is available here. Notice it has 3 non-syntactic column names and one column of characters. Reading this is as a tibble and a data frame we get

    tib ## # A tibble: 4 x 4 ## `<-` `8` `%` name ## ## 1 1 2 0.250 t ## 2 2 4 0.250 h ## 3 3 6 0.250 e ## 4 4 8 0.250 o df ## X.. X8 X. name ## 1 1 2 0.25 t ## 2 2 4 0.25 h ## 3 3 6 0.25 e ## 4 4 8 0.25 o

    We see already that in the read.csv() process we’ve lost the column names.

    Let’s try some partial matching…

    tib$n ## Warning: Unknown or uninitialised column: 'n'. ## NULL df$n ## [1] t h e o ## Levels: e h o t

    With the tibble we get an error, yet with the data frame it leads us straight to our name variable. To read more about why partial matching is bad, check out this thread.

    What about subsetting? Let’s try it out using the data from our csv file.

    tib[,2] ## # A tibble: 4 x 1 ## `8` ## ## 1 2 ## 2 4 ## 3 6 ## 4 8 tib[2] ## # A tibble: 4 x 1 ## `8` ## ## 1 2 ## 2 4 ## 3 6 ## 4 8 df[,2] ## [1] 2 4 6 8 df[2] ## X8 ## 1 2 ## 2 4 ## 3 6 ## 4 8

    Using the a normal data frame we get a vector and a data frame using single square brackets. Using tibbles, single square brackets, [, will always return another tibble. Much neater. Now for double brackets.

    tib[[1]] ## [1] 1 2 3 4 tib$name ## [1] "t" "h" "e" "o" df[[1]] ## [1] 1 2 3 4 df$name ## [1] t h e o ## Levels: e h o t

    Double square brackets, [[, and the traditional dollar, $ are ways to access individual columns as vectors. Now, with tibbles, we have seperate operations for data frame operations and single column operations. Now we don’t have to use that pesky drop = FALSE. Note, these are actually quicker than the [[ and $ of the data.frame(), as shown in the documentation for the tibble package.

    At last, no more strings as factors! Upon reading the data in, tibbles recognise strings as strings, not factors. For example, with the name column in our data set.

    class(df$name) ## [1] "factor" class(tib$name) ## [1] "character"

    I quite like this, it’s much easier to turn a vector of characters into factors than vice versa, so why not give me everything as strings? Now I can choose whether or not to convert to factors.

    Disadvantages

    This won’t be long, there’s only one. Some older packages don’t work with tibbles because of their alternative subsetting method. They expect tib[,1] to return a vector, when infact it will now return another tibble. Until this functionality is added in you must convert your tibble back to a data frame using as_data_frame() or as_tibble() as discussed previously. Whilst adding this functionality will give users the chance to use packages with tibbles and normal data frames, it of course puts extra work on the shoulders of package writers, who now have to change every package to be compatible with tibbles. For more on this discussion, see this thread.

    To summarise..

    So, most of the things you can accomplish with tibbles, you can accomplish with data.frame(), but it’s bit of a pain. Simple things like checking the dimensions of your data or converting strings to factors are small jobs. Small jobs that take time. With tibbles they take no time. Tibbles force you to look at your data earlier; confront the problems earlier. Ultimately leading to cleaner code.

    Thanks for chatting!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Written R/exams around the World

    Mon, 01/08/2018 - 00:00

    (This article was first published on R/exams, and kindly contributed to R-bloggers)

    How to internationalize exams2nops() by adding support for new natural languages in written R/exams (that can be automatically scanned and evaluated).

    Idea

    In addition to completely customizable PDF output, R/exams provides a standardized format called
    “NOPS” for written exams with multiple-choice
    (and/or single-choice) exercises that can be
    automatically generated, scanned, and evaluated.
    In order to assure that the automatic scanning works the title page has a fixed layout
    that cannot be modified by R/exams users. However, to internationalize the format there
    is a possibility for customizing the natural language support. A number of languages
    is already available but it is not difficult to add further languages or to tweak
    existing languages if desired.

    Example

    To illustrate how the language support works, once it has been fully incorporated into the
    exams package, we set up a short exam with three exercises:
    deriv2, tstat2,
    swisscapital. All of these are readily available
    in the package (and are actually in English).

    library("exams") myexam <- c("deriv2.Rnw", "tstat2.Rnw", "swisscapital.Rnw")

    Then we set up PDF output in English (en), German (de), and Spanish (es).
    By setting language most text on the title page is modified, only the name of the
    institution and the title of the exam have to be set separately. For the English
    example we produce n = 1 PDF output file in the output directory nops_pdf (created
    automatically).

    set.seed(403) exams2nops(myexam, n = 1, language = "en", institution = "R University", title = "Exam", dir = "nops_pdf", name = "en", date = "2018-01-08")

    Then we do the same for the other two languages.

    set.seed(403) exams2nops(myexam, n = 1, language = "de", institution = "R Universit\\\"at", title = "Klausur", dir = "nops_pdf", name = "de", date = "2018-01-08") set.seed(403) exams2nops(myexam, n = 1, language = "es", institution = "R Universidad", title = "Examen", dir = "nops_pdf", name = "es", date = "2018-01-08")

    The title pages of the resulting PDF files then have the desired languages.



    Language specification

    To add a new language, essentially just a single text file (say lang.dcf) is needed containing
    suitable translations of all the phrases on the title page as well as a few additional phrases,
    e.g., occuring in the HTML evaluation reports etc.
    As an example, the first few phrases in English (en.dcf) are:

    PersonalData: Personal Data Name: Name FamilyName: Family Name GivenName: Given Name Signature: Signature RegistrationNumber: Registration Number Checked: checked NoChanges: In this section \textbf{no} changes or modifications must be made! ...

    And the corresponding translations to German (de.dcf) are:

    PersonalData: Pers{\"o}nliche Daten Name: Name FamilyName: Nachname GivenName: Vorname Signature: Unterschrift RegistrationNumber: Matrikelnummer Checked: gepr{\"u}ft NoChanges: In diesem Feld d{\"u}rfen \textbf{keine} Ver{\"a}nderungen der Daten vorgenommen werden! ...

    Note that here LaTeX markup is used for the German umlaute and for bold highlighting. Alternatively,
    special characters can be added in a suitable encoding (typically UTF-8) but then the encoding has
    to be declared when calling exams2nops() (e.g., encoding = "UTF-8").

    Most of the phrases required in the .dcf are very straightforward and only some are a bit technical.
    There are also a couple of coordinates (MarkExample*) necessary for aligning some text lines.
    If you have set up your own lang.dcf you can easily pass it to exams2nops() by setting
    language = "/path/to/lang.dcf". The same has to be done for nops_eval() when evaluating the exam.

    Currently available languages

    Due to the kind support from friends and various dedicated R/exams users, there is already support
    for many important Western languages as well as a few other languages/countries. All of these
    are directly available in the R package (note that "tr" requires the current
    development version from R-Forge, though). But for convenience
    and manual inspection the .dcf files are also linked here.

    File Language Contributor de.dcf German Achim Zeileis en.dcf English Achim Zeileis es.dcf Spanish Maria Kogelnik fi.dcf Finnish Klaus Nordhausen fr.dcf French Arthur Allignol hu.dcf Hungarian Gergely Daróczi & Dénes Tóth it.dcf Italian Domenico Zambella nl.dcf Dutch Niels Smits pt.dcf Portugese Mauricio Calvão & Fabian Petutschnig ro.dcf Romanian Cristian Gatu tr.dcf Turkish Emrah Er Contributing new languages

    If you want to contribute a new language, simply set up a .dcf file starting out from one of the examples
    above and send the file or a link to
    .
    Do not worry if not everything is 100% perfect, yet, we can still sort this out together!
    For Western languages (e.g., dk, se, no are still missing) it is probably the most robust solution to
    code special characters in LaTeX. For languages requiring other alphabets (e.g., ru would be nice or Asian languages…)
    it is probably easiest to use UTF-8 encoding. Get in touch through e-mail, the
    support forum
    or on Twitter (@AchimZeileis) if you want to know more or need further details.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Diffusion/Wiener Model Analysis with brms – Part II: Model Diagnostics and Model Fit

    Sun, 01/07/2018 - 21:08

    (This article was first published on Computational Psychology - Henrik Singmann, and kindly contributed to R-bloggers)

    This is the considerably belated second part of my blog series on fitting diffusion models (or better, the 4-parameter Wiener model) with brms. The first part discusses how to set up the data and model. This second part is concerned with perhaps the most important steps in each model based data analysis, model diagnostics and the assessment of model fit. Note, the code in the part is completely self sufficient and can be run without running the code of part I.

    Setup

    At first, we load quite a few packages that we will need down the way. Obviously brms, but also some of the packages from the tidyverse (i.e., dplyr, tidyr, tibble, and ggplot2). It took me a little time to jump on the tidyverse bandwagon, but now that I use it more and more I cannot deny its utility. If your data can be made ‘tidy’, the coherent set of tools offered by the tidyverse make many seemingly complicated tasks pretty easy. A few examples of this will be shown below. If you need more introduction, I highly recommend the awesome ‘R for Data Science’ book by Grolemund and Wickham, which they made available for free! We also need gridExtra for combining plots and DescTools for the concordance correlation coefficient CCC used below.

    library("brms") library("dplyr") library("tidyr") library("tibble") # for rownames_to_column library("ggplot2") library("gridExtra") # for grid.arrange library("DescTools") # for CCC

    As in part I, we need package rtdists for the data.

    data(speed_acc, package = "rtdists") speed_acc <- droplevels(speed_acc[!speed_acc$censor,]) # remove extreme RTs speed_acc <- droplevels(speed_acc[ speed_acc$frequency %in% c("high", "nw_high"),]) speed_acc$response2 <- as.numeric(speed_acc$response)-1

    I have uploaded the binary R data file containing the fitted model object as well as the generated posterior predictive distributions to github, from which we can download them directly into R. Note that I needed to go the way via a temporary folder. If there is a way without that I would be happy to learn about it.

    tmp <- tempdir() download.file("https://singmann.github.io/files/brms_wiener_example_fit.rda", file.path(tmp, "brms_wiener_example_fit.rda")) download.file("https://singmann.github.io/files/brms_wiener_example_predictions.rda", file.path(tmp, "brms_wiener_example_predictions.rda")) load(file.path(tmp, "brms_wiener_example_fit.rda")) load(file.path(tmp, "brms_wiener_example_predictions.rda")) Model Diagnostics

    We already know from part I that there are a few divergent transitions. If this were a real analysis we therefore would not be satisfied with the current fit and try to rerun brm with an increased adapt_delta with the hope that this removes the divergent transitions. The Stan warning guidelines clearly state that “the validity of the estimates is not guaranteed if there are post-warmup divergences”. However, it is unclear what the actual impact of the small number of divergent transitions (< 10) observed here is on the posterior. Also, it is unclear what one can do if adapt_delta cannot be increased anymore and the model also cannot be reparameterized. Should all fits with any divergent transitions be completely disregarded? I hope the Stan team provides more guidelines to such questions in the future.

    Coming back to our fit, as a first step in our model diagnostics we check the R-hat statistic as well as the number of effective samples. Specifically, we look at the parameters with the highest R² and lowest number of effective samples.

    tail(sort(rstan::summary(fit_wiener$fit)$summary[,"Rhat"])) # sd_id__conditionaccuracy:frequencyhigh # 1.00 # r_id__bs[15,conditionaccuracy] # 1.00 # b_bias_conditionaccuracy # 1.00 # cor_id__conditionspeed:frequencyhigh__ndt_conditionaccuracy # 1.00 # sd_id__ndt_conditionspeed # 1.00 # cor_id__conditionspeed:frequencynw_high__bs_conditionspeed # 1.01 head(sort(rstan::summary(fit_wiener$fit)$summary[,"n_eff"])) # lp__ # 462 # b_conditionaccuracy:frequencyhigh # 588 # sd_id__ndt_conditionspeed # 601 # sd_id__conditionspeed:frequencyhigh # 646 # b_conditionspeed:frequencyhigh # 695 # r_id[12,conditionaccuracy:frequencyhigh] # 712

    Both are unproblematic (i.e., R-hat < 1.05 and n_eff > 100) and suggest that the sampler has converged on the stationary distribution. If anyone has a similar oneliner to return the number of divergent transitions, I would be happy to learn about it.

    We also visually inspect the chain behavior of a few semi-randomly selected parameters.

    pars <- parnames(fit_wiener) pars_sel <- c(sample(pars[1:10], 3), sample(pars[-(1:10)], 3)) plot(fit_wiener, pars = pars_sel, N = 6, ask = FALSE, exact_match = TRUE, newpage = TRUE, plot = TRUE)

    This visual inspection confirms the earlier conclusion. For all parameters the posteriors look well-behaved and the chains appears to mix well.

    Finally, in the literature there are some discussions about parameter trade-offs for the diffusion and related models. These trade-offs supposedly make fitting the diffusion model in a Bayesian setting particularly complicated. To investigate whether fitting the Wiener model with HMC as implemented in Stan (i.e., NUTS) also shows this pattern we take a look at the joint posterior of the fixed-effects of the main Wiener parameters for the accuracy condition. For this we use the stanfit method of the pairs function and set the condition to "divergent__". This plots the few divergent transitions above the diagonal and the remaining samples below the diagonal.

    pairs(fit_wiener$fit, pars = pars[c(1, 3, 5, 7, 9)], condition = "divergent__")

    This plot shows some correlations, but nothing too dramatic. HMC appears to sample quite efficiently from the Wiener model.

    Next we also take a look at the correlations across all parameters (not only the fixed effects).

    posterior <- as.mcmc(fit_wiener, combine_chains = TRUE) cor_posterior <- cor(posterior) cor_posterior[lower.tri(cor_posterior, diag = TRUE)] <- NA cor_long <- as.data.frame(as.table(cor_posterior)) cor_long <- na.omit(cor_long) tail(cor_long[order(abs(cor_long$Freq)),], 10) # Var1 Var2 Freq # 43432 b_ndt_conditionspeed r_id__ndt[1,conditionspeed] -0.980 # 45972 r_id__ndt[4,conditionspeed] r_id__ndt[11,conditionspeed] 0.982 # 46972 b_ndt_conditionspeed r_id__ndt[16,conditionspeed] -0.982 # 44612 b_ndt_conditionspeed r_id__ndt[6,conditionspeed] -0.983 # 46264 b_ndt_conditionspeed r_id__ndt[13,conditionspeed] -0.983 # 45320 b_ndt_conditionspeed r_id__ndt[9,conditionspeed] -0.984 # 45556 b_ndt_conditionspeed r_id__ndt[10,conditionspeed] -0.985 # 46736 b_ndt_conditionspeed r_id__ndt[15,conditionspeed] -0.985 # 44140 b_ndt_conditionspeed r_id__ndt[4,conditionspeed] -0.990 # 45792 b_ndt_conditionspeed r_id__ndt[11,conditionspeed] -0.991

    This table lists the ten largest absolute values of correlations among posteriors for all pairwise combinations of parameters. The value in column Freq somewhat unintuitively is the observed  correlation among the posteriors of the two parameters listed in the two previous columns. To create this table I used a trick from SO using as.table, which is responsible for labeling the column containing the correlation value Freq.

    What the table shows is some extreme correlations for the individual-level deviations (the first index in the squared brackets of the parameter names seems to be the participant number). Let us visualize these correlations as well.

    pairs(fit_wiener$fit, pars = c("b_ndt_conditionspeed", "r_id__ndt[11,conditionspeed]", "r_id__ndt[4,conditionspeed]"), condition = "divergent__")

    This plot shows that some of the individual-level parameters are not well estimated.

    However, overall these extreme correlations appear rather rarely.

    hist(cor_long$Freq, breaks = 40)

    Overall the model diagnostics do not show any particularly worrying behavior (with the exception of the divergent transitions). We have learned that a few of the individual-level estimates for some of the parameters are not very trustworthy. However, this does not disqualify the overall fit. The main take away from this fact is that we would need to be careful in interpreting the individual-level estimates. Thus, we assume the fit is okay and continue with the next step of the analysis.

    Assessing Model Fit

    We will now investigate the model fit. That is, we will investigate whether the model provides an adequate description of the observed data. We will mostly do so via graphical checks. To do so, we need to prepare the posterior predictive distribution and the data. As a first step, we combine the posterior predictive distributions with the data.

    d_speed_acc <- as_tibble(cbind(speed_acc, as_tibble(t(pred_wiener))))

    Then we calculate three important measures (or test statistics T()) on the individual level for each cell of the design (i.e., combination of condition and frequency factors):

    • Probability of giving an upper boundary response (i.e., respond “nonword”).
    • Median RTs for responses to the upper boundary.
    • Median RTs for the lower boundary.

    We first calculate this for each sample of the posterior predictive distribution. We then summarize these three measures by calculating the median and some additional quantiles across the posterior predictive distribution. We calculate all of this in one step using a somewhat long combination of dplyr and tidyr magic.

    d_speed_acc_agg <- d_speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(starts_with("V")), funs(prob.upper = mean(. > 0), medrt.lower = median(abs(.[. < 0]) ), medrt.upper = median(.[. > 0] ) )) %>% ungroup %>% gather("key", "value", -id, -condition, -frequency) %>% # remove grouping vars separate("key", c("rep", "measure"), sep = "_") %>% spread(measure, value) %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(prob.upper, medrt.lower, medrt.upper), .funs = funs(median = median(., na.rm = TRUE), llll = quantile(., probs = 0.01,na.rm = TRUE), lll = quantile(., probs = 0.025,na.rm = TRUE), ll = quantile(., probs = 0.1,na.rm = TRUE), l = quantile(., probs = 0.25,na.rm = TRUE), h = quantile(., probs = 0.75,na.rm = TRUE), hh = quantile(., probs = 0.9,na.rm = TRUE), hhh = quantile(., probs = 0.975,na.rm = TRUE), hhhh = quantile(., probs = 0.99,na.rm = TRUE) ))

    Next, we calculate the three measures also for the data and combine it with the results from the posterior predictive distribution in one data.frame using left_join.

    speed_acc_agg <- speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise(prob.upper = mean(response == "nonword"), medrt.upper = median(rt[response == "nonword"]), medrt.lower = median(rt[response == "word"]) ) %>% ungroup %>% left_join(d_speed_acc_agg) Aggregated Model-Fit

    The first important question is whether our model can adequately describe the overall patterns in the data aggregated across participants. For this we simply aggregate the results obtained in the previous step (i.e., the summary results from the posterior predictive distribution as well as the test statistics from the data) using mean.

    d_speed_acc_agg2 <- speed_acc_agg %>% group_by(condition, frequency) %>% summarise_if(is.numeric, mean, na.rm = TRUE) %>% ungroup

    We then use these summaries and plot predictions (in grey and black) as well as data (in red) for the three measures. The inner (fat) error bars show the 80% credibility intervals (CIs), the outer (thin) error bars show the 95% CIs. The black circle shows the median of the posterior predictive distributions.

    new_x <- with(d_speed_acc_agg2, paste(rep(levels(condition), each = 2), levels(frequency), sep = "\n")) p1 <- ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = prob.upper_lll, ymax = prob.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = prob.upper_ll, ymax = prob.upper_hh), size = 2, col = "grey") + geom_point(aes(y = prob.upper_median), shape = 1) + geom_point(aes(y = prob.upper), shape = 4, col = "red") + ggtitle("Response Probabilities") + ylab("Probability of upper resonse") + xlab("") + scale_x_discrete(labels = new_x) p2 <- ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = medrt.upper_lll, ymax = medrt.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = medrt.upper_ll, ymax = medrt.upper_hh), size = 2, col = "grey") + geom_point(aes(y = medrt.upper_median), shape = 1) + geom_point(aes(y = medrt.upper), shape = 4, col = "red") + ggtitle("Median RTs upper") + ylab("RT (s)") + xlab("") + scale_x_discrete(labels = new_x) p3 <- ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = medrt.lower_lll, ymax = medrt.lower_hhh), col = "darkgrey") + geom_linerange(aes(ymin = medrt.lower_ll, ymax = medrt.lower_hh), size = 2, col = "grey") + geom_point(aes(y = medrt.lower_median), shape = 1) + geom_point(aes(y = medrt.lower), shape = 4, col = "red") + ggtitle("Median RTs lower") + ylab("RT (s)") + xlab("") + scale_x_discrete(labels = new_x) grid.arrange(p1, p2, p3, ncol = 2)

     

    Inspection of the plots show no dramatic misfit. Overall the model appears to be able to describe the general patterns in the data. Only the response probabilities for words (i.e., frequency = high) appears to be estimated too low. The red x appear to be outside the 80% CIs but possibly also outside the 95% CIs.

    The plots of the RTs show an interesting (but not surprising) pattern. The posterior predictive distributions for the rare responses (i.e., “word” responses for upper/non-word stimuli and “nonword” response to lower/word stimuli) are relatively wide. In contrast, the posterior predictive distributions for the common responses are relatively narrow. In each case, the observed median is inside the 80% CI and also quite near to the predicted median.

    Individual-Level Fit

    To investigate the pattern of predicted response probabilities further, we take a look at them on the individual level. We again plot the response probabilities in the same way as above, but separated by participant id.

    ggplot(speed_acc_agg, aes(x = condition:frequency)) + geom_linerange(aes(ymin = prob.upper_lll, ymax = prob.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = prob.upper_ll, ymax = prob.upper_hh), size = 2, col = "grey") + geom_point(aes(y = prob.upper_median), shape = 1) + geom_point(aes(y = prob.upper), shape = 4, col = "red") + facet_wrap(~id, ncol = 3) + ggtitle("Prediced (in grey) and observed (red) response probabilities by ID") + ylab("Probability of upper resonse") + xlab("") + scale_x_discrete(labels = new_x)

    This plot shows a similar pattern as the aggregated data. For none of the participants do we observe dramatic misfit. Furthermore, response probabilities to non-word stimuli appear to be predicted rather well. In contrast, response probabilities for word-stimuli are overall predicted to be lower than observed. However, this misfit does not seem to be too strong.

    As a next step we look at the coverage probabilities of our three measures across individuals. That is, we calculate for each of the measures, for each of the cells of the design, and for each of the CIs (i.e., 50%, 80%, 95%, and 99%), the proportion of participants for which the observed test statistics falls into the corresponding CI.

    speed_acc_agg %>% mutate(prob.upper_99 = (prob.upper >= prob.upper_llll) & (prob.upper <= prob.upper_hhhh), prob.upper_95 = (prob.upper >= prob.upper_lll) & (prob.upper <= prob.upper_hhh), prob.upper_80 = (prob.upper >= prob.upper_ll) & (prob.upper <= prob.upper_hh), prob.upper_50 = (prob.upper >= prob.upper_l) & (prob.upper <= prob.upper_h), medrt.upper_99 = (medrt.upper > medrt.upper_llll) & (medrt.upper < medrt.upper_hhhh), medrt.upper_95 = (medrt.upper > medrt.upper_lll) & (medrt.upper < medrt.upper_hhh), medrt.upper_80 = (medrt.upper > medrt.upper_ll) & (medrt.upper < medrt.upper_hh), medrt.upper_50 = (medrt.upper > medrt.upper_l) & (medrt.upper < medrt.upper_h), medrt.lower_99 = (medrt.lower > medrt.lower_llll) & (medrt.lower < medrt.lower_hhhh), medrt.lower_95 = (medrt.lower > medrt.lower_lll) & (medrt.lower < medrt.lower_hhh), medrt.lower_80 = (medrt.lower > medrt.lower_ll) & (medrt.lower < medrt.lower_hh), medrt.lower_50 = (medrt.lower > medrt.lower_l) & (medrt.lower < medrt.lower_h) ) %>% group_by(condition, frequency) %>% ## grouping factors without id summarise_at(vars(matches("\\d")), mean, na.rm = TRUE) %>% gather("key", "mean", -condition, -frequency) %>% separate("key", c("measure", "ci"), "_") %>% spread(ci, mean) %>% as.data.frame() # condition frequency measure 50 80 95 99 # 1 accuracy high medrt.lower 0.706 0.8824 0.882 1.000 # 2 accuracy high medrt.upper 0.500 0.8333 1.000 1.000 # 3 accuracy high prob.upper 0.529 0.7059 0.765 0.882 # 4 accuracy nw_high medrt.lower 0.500 0.8125 0.938 0.938 # 5 accuracy nw_high medrt.upper 0.529 0.8235 1.000 1.000 # 6 accuracy nw_high prob.upper 0.529 0.8235 0.941 0.941 # 7 speed high medrt.lower 0.471 0.8824 0.941 1.000 # 8 speed high medrt.upper 0.706 0.9412 1.000 1.000 # 9 speed high prob.upper 0.000 0.0588 0.588 0.647 # 10 speed nw_high medrt.lower 0.706 0.8824 0.941 0.941 # 11 speed nw_high medrt.upper 0.471 0.7647 1.000 1.000 # 12 speed nw_high prob.upper 0.235 0.6471 0.941 1.000

    As can be seen, for the RTs, the coverage probability is generally in line with the width of the CIs or even above it. Furthermore, for the common response (i.e., upper for frequency = nw_high and lower for frequency = high), the coverage probability is 1 for the 99% CIs in all cases.

    Unfortunately, for the response probabilities, the coverage is not that great. especially in the speed condition and for tighter CIs. However, for the wide CIs the coverage probabilities is at least acceptable. Overall the results so far suggest that the model provides an adequate account. There are some misfits that should be kept in mind if one is interested in extending the model or fitting it to new data, but overall it provides a satisfactory account.

    QQ-plots: RTs

    The final approach for assessing the fit of the model will be based on more quantiles of the RT distribution (i.e., so far we only looked at th .5 quantile, the median). We will then plot individual observed versus predicted (i.e., mean from posterior predictive distribution) quantiles across conditions. For this we first calculate the quantiles per sample from the posterior predictive distribution and then aggregate across the samples. This is achieved via dplyr::summarise_at using a list column and tidyr::unnest to unstack the columns (see section 25.3 in “R for data Science”). We then combine the aggregated predicted RT quantiles with the observed RT quantiles.

    quantiles <- c(0.1, 0.25, 0.5, 0.75, 0.9) pp2 <- d_speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(starts_with("V")), funs(lower = list(rownames_to_column( data.frame(q = quantile(abs(.[. < 0]), probs = quantiles)))), upper = list(rownames_to_column( data.frame(q = quantile(.[. > 0], probs = quantiles )))) )) %>% ungroup %>% gather("key", "value", -id, -condition, -frequency) %>% # remove grouping vars separate("key", c("rep", "boundary"), sep = "_") %>% unnest(value) %>% group_by(id, condition, frequency, boundary, rowname) %>% # grouping vars + new vars summarise(predicted = mean(q, na.rm = TRUE)) rt_pp <- speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise(lower = list(rownames_to_column( data.frame(observed = quantile(rt[response == "word"], probs = quantiles)))), upper = list(rownames_to_column( data.frame(observed = quantile(rt[response == "nonword"], probs = quantiles )))) ) %>% ungroup %>% gather("boundary", "value", -id, -condition, -frequency) %>% unnest(value) %>% left_join(pp2)

    To evaluate the agreement between observed and predicted quantiles we calculate for each cell and quantile the concordance correlation coefficient (CCC; e.g, Barchard, 2012, Psych. Methods). The CCC is a measure of absolute agreement between two values and thus better suited than simple correlation. It is scaled from -1 to 1 where 1 represent perfect agreement, 0 no relationship, and -1 a correlation of -1 with same mean and variance of the two variables.

    The following code produces QQ-plots for each condition and quantile separately for responses to the upper boundary and lower boundary. The value in the upper left of each plot gives the CCC measures of absolute agreement.

    plot_text <- rt_pp %>% group_by(condition, frequency, rowname, boundary) %>% summarise(ccc = format( CCC(observed, predicted, na.rm = TRUE)$rho.c$est, digits = 2)) p_upper <- rt_pp %>% filter(boundary == "upper") %>% ggplot(aes(x = observed, predicted)) + geom_abline(slope = 1, intercept = 0) + geom_point() + facet_grid(condition+frequency~ rowname) + geom_text(data=plot_text[ plot_text$boundary == "upper", ], aes(x = 0.5, y = 1.8, label=ccc), parse = TRUE, inherit.aes=FALSE) + coord_fixed() + ggtitle("Upper responses") + theme_bw() p_lower <- rt_pp %>% filter(boundary == "lower") %>% ggplot(aes(x = observed, predicted)) + geom_abline(slope = 1, intercept = 0) + geom_point() + facet_grid(condition+frequency~ rowname) + geom_text(data=plot_text[ plot_text$boundary == "lower", ], aes(x = 0.5, y = 1.6, label=ccc), parse = TRUE, inherit.aes=FALSE) + coord_fixed() + ggtitle("Lower responses") + theme_bw() grid.arrange(p_upper, p_lower, ncol = 1)

    Results show that overall the fit is better for the accuracy than the speed conditions. Furthermore, fit is better for the common response (i.e., nw_high for upper and high for lower responses). This latter observation is again not too surprising.

    When comparing the fit for the different quantiles it appears that at least the median (i.e., 50%) shows acceptable values for the common response. However, especially in the speed condition the account of the other quantiles is not great. Nevertheless, dramatic misfit is only observed for the rare responses.

    One possibility for some of the low CCCs in the speed conditions may be the comparatively low variances in some of the cells. For example, for both speed conditions that are common (i.e., speed & nw_high for upper responses and speed & high for lower responses) the visual inspection of the plot suggests a acceptable account while at the same time some CCC value are low (i.e., < .5). Only for the 90% quantile in the speed conditions (and somewhat less the 75% quantile) we see some systematic deviations. The model predicts lower RTs than observed.

    Taken together, the model appear to provide an at least acceptable account. The only slightly worrying patterns are (a) that the model predicts a slightly better performance for the word stimuli than observed (i.e., lower predicted rate of non-word responses than observed for word-stimuli) and (b) that in the speed conditions the model predicts somewhat longer RTs for the 75% and 90% quantile than observed.

    The next step will be to look at differences between parameters as a function of the speed-accuracy condition. However, this will be the topic of the third blog post. I am hopeful it will not take two months this time.

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Computational Psychology - Henrik Singmann. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Anscombe’s Quartet: 1980’s Edition

    Sun, 01/07/2018 - 19:57

    (This article was first published on Method Matters, and kindly contributed to R-bloggers)

    In this post, I’ll describe a fun visualization of Anscombe’s quartet I whipped up recently.

    If you aren’t familiar with Anscombe’s quartet, here’s a brief description from its Wikipedia entry: “Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to counter the impression among statisticians that ‘numerical calculations are exact, but graphs are rough.’

    In essence, there are 4 different datasets with quite different patterns in the data. Fitting a linear regression model through each dataset yields (nearly) identical regression coefficients, while graphing the data makes it clear that the underlying patterns are very different. What’s amazing to me is how these simple data sets (and accompanying graphs) make immediately intuitive the importance of data visualization, and drive home the point of how well-constructed graphs can help the analyst understand the data he or she is working with. 

    The Anscombe data are included in base R, and these data (and R, of course!) are used to produce the plot that accompanies the Wikipedia entry on Anscombe’s quartet.

    Because the 1980’s are back, I decided to make a visualization of Anscombe’s quartet using the like-most-totally-rad 1980’s graphing elements I could come up with. I was aided with the colors by a number of graphic design palettes with accompanying hex codes. I used the excellent showtext package for the 1980’s font, which comes from the Google font “Press Start 2P.” (Note, if you’re reproducing the graph at home, the fonts won’t show properly in RStudio. Run the code in the standalone R program and everything works like a charm). I had to tweak a number of graphical parameters in order to get the layout right, but in the end I’m quite pleased with the result.

    The Code

    # showtext library to get 1980's font
    # "Press Start 2P"
    library(showtext)
    # add the font from Google fonts
    font.add.google("Press Start 2P", "start2p")


    png("C:\\Directory\\Anscombe_80s.png",
    width=11,height=8, units='in', res=600)

    showtext.begin()
    op <- par(las=1, mfrow=c(2,2), mar=1.5+c(4,4,.5,1), oma=c(0,0,5,0),
    lab=c(6,6,7), cex.lab=12.0, cex.axis=5, mgp=c(3,1,0), bg = 'black',
    col.axis = '#F2CC00', col.lab = '#A10EEC', family = 'start2p')
    ff <- y ~ x
    for(i in 1:4) {
    ff[[2]] <- as.name(paste("y", i, sep=""))
    ff[[3]] <- as.name(paste("x", i, sep=""))
    lmi <- lm(ff, data= anscombe)
    xl <- substitute(expression(x[i]), list(i=i))
    yl <- substitute(expression(y[i]), list(i=i))
    plot(ff, data=anscombe, col="#490E61", pch=21, cex=2.4, bg = "#32FA05",
    xlim=c(3,19), ylim=c(3,13)
    , xlab=eval(xl), ylab=yl,
    family = 'start2p'
    )
    abline(lmi, col="#FA056F", lwd = 5)
    axis(1, col = '#FA1505')
    axis(2, col = '#FA1505')
    box(col="#FA1505")
    }
    mtext("Anscombe's Quartet", outer = TRUE,
    cex = 20, family = 'start2p', col="#FA1505")
    showtext.end()

    dev.off()

    The Plot

    Conclusion

    In this post, I used data available in R to make a 1980’s-themed version of the Anscombe quartet graphs. The main visual elements I manipulated were the colors and the fonts. R’s wonderful and flexible plotting capabilities (here using base R!) made it very straightforward to edit every detail of the graph to achieve the desired retro-kitsch aesthetic.

    OK, so maybe this isn’t the most serious use of R for data analysis and visualization. There are doubtless more important business cases and analytical problems to solve. Nevertheless, this was super fun to do. Data analysis (or data science, or whatever you’d like to call it) is a field in which there are countless ways to be creative with data. It’s not always easy to bring this type of creativity to every applied project, but this blog is a place where I can do any crazy thing I set my mind to and just have fun. Judging by that standard, I think this project was a success.

    Coming Up Next

    In the next post, I’ll do something a little bit different with data. Rather than doing data analysis, I’ll describe a project in which I used Python to manage and edit meta-data (ID3 tags) in mp3 files. Stay tuned!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages